Search Results
8/2/2025, 1:21:49 PM
>>106114920
No, it would not be a dense model in every way, because it would have sharded MLPs with small intermediate dimensions. Yes it'd be a very dumb design but this does not matter, this is a question of the appropriateness of the mathematical model. The way MoE performance scales in literally every serious research and example, they are on a much higher Pareto frontier than dense models. It's probable that a 25% active MoE would already beat the dense equivalent.
Here's actual sparsity scaling law from Kimi, adapt it for dense case if you want
No, it would not be a dense model in every way, because it would have sharded MLPs with small intermediate dimensions. Yes it'd be a very dumb design but this does not matter, this is a question of the appropriateness of the mathematical model. The way MoE performance scales in literally every serious research and example, they are on a much higher Pareto frontier than dense models. It's probable that a 25% active MoE would already beat the dense equivalent.
Here's actual sparsity scaling law from Kimi, adapt it for dense case if you want
Page 1