>>8630350
On big pretrain runs when you can't actually overfit the model, loss, besides being an indicator of training going smoothly (pic related), may be a pretty useful metric. If you are seeing shit like this, you can immediately tell that something is fucked up.
One shouldn't really use it as a metric for tiny diffusion training runs at all, in any way.
>>8630360
>Gradients
Why don't you actually try to look at them instead?