>>106934583
As I said last thread I use TensorArt because it is way faster. Pic shows my usual settings.
The only thing not visible is Restarts set to 3. I never use repeats, just set it to 1 and bump epochs instead for more consistent results.
Batch size shows as nothing but TA sets it to 1 for Qwen behind the scenes. They use Gradient Accumulation Steps instead of actual batch size to avoid OOM errors on their servers.
So if you train Qwen elsewhere with normal batch size, you would set batch size to 4 and Gradient Accumulation to 1. TA does the opposite. Hope that is clear.