Search Results

Found 1 results for "ea4acb1240f7f1a76b292e4d13cf7519" across all boards searching md5.

8/2/2025, 12:52:42 PM

>>106114576
"square root law" is bro science, not supported by anything ever. Qwen 30B-A3B is roughly equivalent to Qwen-14B, not 8B.

More devastatingly for square cube law bros, it's inherently retarded. In the limit, it suggests that a MoE with 1 active parameter is equivalent to a square root of total (but in reality it'd be braindead, and its training costs would be negligible) and that a full-activation MoE would be only as effective as a dense model of the same scale, with the same (actually lower, due to MoE MFU penalty) training and inference cost. We know that well-designed MoEs are like 20-40x more compute-efficient than dense, so the curve cannot be like this.

This has always been mere VRAMlet cope.

Go to Thread

Page 1