Search Results

Found 1 results for "2458883e9eef63f3bd6066adaa4abd38" across all boards searching md5.

8/2/2025, 1:21:49 PM

>>106114920
No, it would not be a dense model in every way, because it would have sharded MLPs with small intermediate dimensions. Yes it'd be a very dumb design but this does not matter, this is a question of the appropriateness of the mathematical model. The way MoE performance scales in literally every serious research and example, they are on a much higher Pareto frontier than dense models. It's probable that a 25% active MoE would already beat the dense equivalent.

Here's actual sparsity scaling law from Kimi, adapt it for dense case if you want

Go to Thread

Page 1