>>725356971
so, basically you train a denoising algorithm on a lot of data from real life video and images. starting from let's say 10% noise, and gradually working up until it gets to the point where you can give it a 100% noise image(random static) and tell it to denoise that image into let's say person holding an apple. which it can do at that point with good accuracy
it'll then denoise the static into an image of a person holding an apple using it's vast amounts of training denoising images of people, and apples, and people holding things, maybe sometimes even holding an apple.
so then to make a video, you give it a frame and have it predict what the next frame is going to be by denoising an image that's very similar to the previous frame, but slightly different being more influenced by a prompt, like say, lifting their arm, with each successive frame being more and more denoised towards that end goal of an image of that original image but with a lifted arm.
or, magic, it's just magic, good morning saar