So I tried the iterated lora finetuning (or qlora rather). By this I mean train a LoRa for one epoch, merge it, train another LoRa, etc.
First when I looked at the losses I found it very interesting but kinda disappointed that it generalized worse than just training a single LoRa and seemed to overfit.
Then I realized the results aren't really comparable because when I trained the LoRa for many epochs without merging, I did it using cosine schedule at a lower learning rate than the iterated case. And then when testing I found the quality very very bad.
But then when I tested the LoRa for the iterated case when saved during the first epoch (so before doing any merging) the quality was similarly bad to the merged case.
So my conclusion is that it's very important to train with a small learning rate (1e-04 vs 1e-05). The difference really is drastic. At 1e-04 all the apologetic behavior ("you are absolutely right", "I am deeply sorry", "I am malfunctioning") is gone (I'm training on a small dataset without any of those phrases and no apologies, with quite rude replies), but it also is very dumb.
When training at 1e-05 even after many epochs it retains the slop phrases and apologetic behavior, as well as some other undesirable behavior from the original model like using html codes when it shouldn't and using ```xml markdown before the tool calls.
I am training with what I understand to be a quite high dropout and weight decay of 0.1 both, so it makes sense that the effect of the LoRa might be unable to become strong enough when training at a low lr.
So in conclusion I'm finding it hard to get rid of the slop with a tiny dataset without hurting the intelligence of the model. I guess I'll just have to keep increasing the size of the dataset and only training for a couple epochs at the low learning rate and gradually increasing it (or the number of epochs) as I get more data. I wish I had the money to do a large hyperparameter search.