Search Results
6/13/2025, 10:00:50 AM
e12
>>8625205
I'll upload base 1024x checkpoint, a 1536x checkpoint and a lora extract between the two. I'll also probably upload a merge of the last two epochs if it turns out to be good.
>>8625229
>do you have a custom training script or are you using an existing one?
I'm using a modified naifu script
>what is the training config you have setup for your finetuning, and is there any particular factors that made you consider those hyperparameters?
Full bf16, AdamW4bit + bf16_sr, bs=12 lr=5e-6 for 1024x, bs=4*3 lr=7e-6 for 1536x, 15 epochs, cosine schedule with warmup, pretrained edm2 weights, captions are shuffled with 0.6 probability, the first token is kept (for artists), captions are replaced with zeros with 0.1 probability (for cfg). I settled on these empirically.
>in terms of data preparation, is the prep for finetuning different from training loras? do you do anything special with the dataset?
Yes and no. You should tag what you see and give it enough room for contrastive learning in general. Obviously no contradicting shit should be present. Multi-level dropout rules like described in illustrious 0.1 tech report will also help with short prompts but a good implementation would require implementing more complicated processing pipeline, so I'm not using it.
>how much vram usage are you running at when performing a finetune at your current batch size?
23.0 gb at batch size 4 with gradient accumulation.
>>8625205
I'll upload base 1024x checkpoint, a 1536x checkpoint and a lora extract between the two. I'll also probably upload a merge of the last two epochs if it turns out to be good.
>>8625229
>do you have a custom training script or are you using an existing one?
I'm using a modified naifu script
>what is the training config you have setup for your finetuning, and is there any particular factors that made you consider those hyperparameters?
Full bf16, AdamW4bit + bf16_sr, bs=12 lr=5e-6 for 1024x, bs=4*3 lr=7e-6 for 1536x, 15 epochs, cosine schedule with warmup, pretrained edm2 weights, captions are shuffled with 0.6 probability, the first token is kept (for artists), captions are replaced with zeros with 0.1 probability (for cfg). I settled on these empirically.
>in terms of data preparation, is the prep for finetuning different from training loras? do you do anything special with the dataset?
Yes and no. You should tag what you see and give it enough room for contrastive learning in general. Obviously no contradicting shit should be present. Multi-level dropout rules like described in illustrious 0.1 tech report will also help with short prompts but a good implementation would require implementing more complicated processing pipeline, so I'm not using it.
>how much vram usage are you running at when performing a finetune at your current batch size?
23.0 gb at batch size 4 with gradient accumulation.
Page 1