Search Results
8/2/2025, 11:33:29 AM
The discussion on MoEs is retarded. People have some insane intuitions, like this square law or whatever.
The reason MoE works is that dense training is absurdly inefficient. Most activations can be pushed to zero with minor cost (see ReLUfication lit). Dense transformers are not optimal engines for turning flops into intelligence, they're also about as sparse as MoEs in terms of what circuits they actually learn, but their design makes it impossible to easily and predictably zero out everything non-contributing to a given token. This is why DeepSeek went all in on "expert specialization" and finegrained sparsity almost 2 years ago, so now we have models that do as well as old dense ones with a fraction of compute cost. We do not know how to train efficient dense models, just cramming more tokens doesn't work.
MoEs won before they were even invented. Were we to fail to develop finegrained MoEs, we'd just be doing PowerInfer-like stuff to accelerate dense models by training conditional sparsity into them.
The reason MoE works is that dense training is absurdly inefficient. Most activations can be pushed to zero with minor cost (see ReLUfication lit). Dense transformers are not optimal engines for turning flops into intelligence, they're also about as sparse as MoEs in terms of what circuits they actually learn, but their design makes it impossible to easily and predictably zero out everything non-contributing to a given token. This is why DeepSeek went all in on "expert specialization" and finegrained sparsity almost 2 years ago, so now we have models that do as well as old dense ones with a fraction of compute cost. We do not know how to train efficient dense models, just cramming more tokens doesn't work.
MoEs won before they were even invented. Were we to fail to develop finegrained MoEs, we'd just be doing PowerInfer-like stuff to accelerate dense models by training conditional sparsity into them.
Page 1