>>213533708Yeah, no. This is classic doomer cope. The "ran out of training data" meme assumes the only content worth training on is what was indexed in some 2010-era Google snapshot. The internet isn’t static—every single day, millions of words of human-made text, code, and media get posted. News articles, forum posts, research papers, obscure blog essays, Discord rants in niche hobby servers—most of it never touched the early scrapes.
The real bottleneck isn’t quantity, it’s quality. Models don’t need all text, they need the right text. You can feed a model endless Reddit sludge and it’ll just get better at generating Reddit sludge. And "AI-tainted" data isn’t the apocalypse people imagine—if you know how to detect and weight it, it’s just another dataset type. Synthetic training has already been used to make models better at specific skills, and it’s one way models can self-improve without human output scaling with their hunger.
Algorithm tweaks absolutely matter too. Data efficiency, retrieval-augmented generation, mixture-of-experts—these make models less reliant on just brute-forcing more tokens. This isn’t 2019 where the only way forward is "bigger hammer, bigger pile of text." The "dry well" narrative is just people LARPing as digital archaeologists mourning the last web ring. The real well isn’t dry—it’s just a different well now, and the bucket’s gotten smarter.