Search Results
8/2/2025, 12:52:42 PM
>>106114576
"square root law" is bro science, not supported by anything ever. Qwen 30B-A3B is roughly equivalent to Qwen-14B, not 8B.
More devastatingly for square cube law bros, it's inherently retarded. In the limit, it suggests that a MoE with 1 active parameter is equivalent to a square root of total (but in reality it'd be braindead, and its training costs would be negligible) and that a full-activation MoE would be only as effective as a dense model of the same scale, with the same (actually lower, due to MoE MFU penalty) training and inference cost. We know that well-designed MoEs are like 20-40x more compute-efficient than dense, so the curve cannot be like this.
This has always been mere VRAMlet cope.
"square root law" is bro science, not supported by anything ever. Qwen 30B-A3B is roughly equivalent to Qwen-14B, not 8B.
More devastatingly for square cube law bros, it's inherently retarded. In the limit, it suggests that a MoE with 1 active parameter is equivalent to a square root of total (but in reality it'd be braindead, and its training costs would be negligible) and that a full-activation MoE would be only as effective as a dense model of the same scale, with the same (actually lower, due to MoE MFU penalty) training and inference cost. We know that well-designed MoEs are like 20-40x more compute-efficient than dense, so the curve cannot be like this.
This has always been mere VRAMlet cope.
Page 1