Search Results
8/11/2025, 10:30:16 AM
How do speed gains in token generation work with MoE models in the era of "exps=cpu"? I currently have access to a big DDR5 server with 12x 6400mhz ddr5 and an Epyc Turin CPU. Running purely off CPU, this thing runs laps around my personal DDR4-2400mhz shitbox. The former has more than 4 times the RAM bandwidth and it shows in actual use:1.7t/s vs 6.3t/s with ngl 0 and only the kv cache on GPU.
However, if I load a model like I usually would (in this case GLM4.5 Q4_K) with only the experts in RAM, the gains become much more marginal. Running exps=cpu with ngl 99 on both, the 2400mhz shitbox now gens at about 6.5t/s @ 4k ctx while the DDR5 one is at about 15t/s. Both are using an A6000 as their GPU. Obviously, this is purely about token generation and pp is handled by the GPU.
What's the limiting factor here? Would a faster GPU increase the gains or is this down to PCI-E 4.0 bottlenecking the gen speeds?
However, if I load a model like I usually would (in this case GLM4.5 Q4_K) with only the experts in RAM, the gains become much more marginal. Running exps=cpu with ngl 99 on both, the 2400mhz shitbox now gens at about 6.5t/s @ 4k ctx while the DDR5 one is at about 15t/s. Both are using an A6000 as their GPU. Obviously, this is purely about token generation and pp is handled by the GPU.
What's the limiting factor here? Would a faster GPU increase the gains or is this down to PCI-E 4.0 bottlenecking the gen speeds?
Page 1