Search Results
7/4/2025, 9:16:56 PM
>>105801590
You don't know how bad the situation really is. They're basically throwing shit at the models during pretraining, while taking high-effort documents away just because they contain "bad words". Picrel is an example document from FineWeb (supposedly a high-quality pretraining dataset). Yes, that's the entire document.
You don't know how bad the situation really is. They're basically throwing shit at the models during pretraining, while taking high-effort documents away just because they contain "bad words". Picrel is an example document from FineWeb (supposedly a high-quality pretraining dataset). Yes, that's the entire document.
Page 1