>>105591939
It seems to me that with synthetic translations + reversal (https://arxiv.org/abs/2403.13799) alone they could obtain almost as much data as they want. With a very good synthetic pipeline they could even turn web documents and books into conversations, if they wanted, and it seems there's a lack of those in the training data considering that chatbots are the primary use for LLMs. Verifiable data like math could be generated to any arbitrary extent. There are many trillions of tokens on untapped "toxic" data they could use too. More epochs count too as more data.

This is not even considering multimodal data that could be natively trained together with text in many ways and just not as add-on like many have been doing. In that case, then speech could be generated too from web data, for example.

What might have ended (but not really) is the low-hanging fruit, but there's much more than that to pick. The models aren't getting trained on hundreds of trillions of tokens yet.