Emu3.5 works out of the box on 4x4090. It's a Transformers framework LLM, so it automatically does model parallelism across all your GPUs.
The version they released is autoregressive only, meaning it has to predict the visual tokens one-by-one instead of diffusing them all at once. So it's quite slow, 12 minutes to generate this image. The model seems to choose the aspect ratio and resolution. There's no config to specify them. I'm running another gen and asking the model to generate 512x512 specifically, let's see if it knows how to do that.