I did another QLoRa training run today and I'm very happy with the results. It's working very well without any serious mistakes or signs of retardation.
It seems that doing a lighter tune and not trying to get the absolute lowest validation loss lead to better results.
I'm not sure what helped the most, the lower lr, the 0.3 warmup or the cosine decay. Also I disabled double quantization. Parameters which were kept the same are the dropout and weight decay of 0.1 for both. Training context is 65k max because of a bug in liger kernels which seems to be saving some value as int16 and immediately gives me NaNs when I try to do 66k tokens.