>>725762982
These picture/video generators are basically running an image recognition algo backwards.
Normally you train a neural network with a shitload pictures of a dog, and it creates a black box that can recognize dogs.
Then you feed it literal noise, and it goes reshaping the noise until it can recognize it as a dog.
Add a whole vocabulary to the training and bob's your uncle, "dog holding gun" makes it try to make a picture that looks as much as possible as dog, holding and gun