Anonymous
6/24/2025, 11:22:56 AM
No.105688325
>>105688283
I mentioned PI's stuff because they were claiming their code is ready to handle both malicious nodes and smaller GPUs, but I think the smaller GPU stuff is mostly good for RL rather than pretrain proper, but maybe I'm wrong about that:
https://xcancel.com/PrimeIntellect/status/1937272179223380282#m Pipeline Parallelism No single GPU holds the full model - each handles a stage, streaming activations forward. This lets smaller GPUs run large models like DeepSeek-R1. Hidden states pass stage to stage; the final GPU decodes a token, sends it back, and the cycle continues.
I mentioned PI's stuff because they were claiming their code is ready to handle both malicious nodes and smaller GPUs, but I think the smaller GPU stuff is mostly good for RL rather than pretrain proper, but maybe I'm wrong about that:
https://xcancel.com/PrimeIntellect/status/1937272179223380282#m Pipeline Parallelism No single GPU holds the full model - each handles a stage, streaming activations forward. This lets smaller GPUs run large models like DeepSeek-R1. Hidden states pass stage to stage; the final GPU decodes a token, sends it back, and the cycle continues.