Search - 4rchive

I have a hypothesis that the main reason why LLMs have improved is not because they're getting trained of larger amounts of data, but mainly because of benchmark-optimizing dataset filtering, larger training context size and better training hyperparameters. Most of the user-facing improvements have been in post-training. This has been bugging me for a while, now.

Llama 1 was pretrained on 1.4T tokens and 2048 GPUs (no post-training).
Llama 3 was pretrained on 15.6T tokens and 32000 GPUs.

Between Llama 1 and Llama 3 the pretraining data was increased by about 11 times, but the number of GPUs increased by roughly 15 times. This means the newer model saw a similar (or more likely lower, considering the larger context size) number of total training steps, since the effective training batch sizes is proportional to the amount of GPUs used.

The implications of very large batch sizes for knowledge learning at scale aren't fully understood, but you can observe them at a small scale with finetuning. For pretraining, it's generally assumed that more data, the better, but if to train larger amounts of data you need more GPUs (thus larger batch sizes), then every weight update will average the gradient of more samples together compared to models trained on less data but also less GPUs.

The larger the training GPU cluster, the more every weight update will be an averaged sloppy representation of the pretraining data (many thousands of random samples that might have nothing to do with each other), even before sloppifying the model with carelessly LLM-generated data in post-training.

The models will probably get better at modeling language **on average**, but also learn less unique facts, ideas and styles due to this large-batch gradient averaging. GPU training clusters can't be scaled up to 1 million GPUs or more without paradoxically making the models less knowledgeable and sloppier.

Search results for "d56ba3963a4e62c5ea676942b203760e" in md5 (1)