So I began tuning Gemma on my own cleaned up logs like I said I was going to do.
But there seems to be one crucial issue and that is that I am able to fit much less context at training time than I am at inference time. This short context finetuning is hurting the long context performance at inferennce time, which is very unfortunate, because it's not like I'm able to have generous amounts of context to begin with when serving the model using llama-factory.