I am still convinced that large-scale LLM training has been a mistake. Over time, the models have become better at modeling language, but more ignorant about trivia and other stuff relevant for RP and storywriting largely because of picrel. Post-training only mitigates some of its implications. Synthetic data, which may or may have not been used during pretraining, has little to do with it.