Is there a way to roughly calculate what speed you can expect on a cpu + gpu build for MoE models running -ot exps=cpu?
In particular, what would happen if you upgraded the GPU part of your server with something that's a lot faster like a 5090 or switched out the CPU+RAM part for one of those server boards with lots of channels.
By splitting your model across both RAM and GPU it obviously stops scaling linearly with the bandwidth of either which is confusing. Is there a bottleneck hidden here somewhere? Like, that even if you had a 12-channel ddr5 server that it wouldn't make much of a difference in token generation speed between a 3090 and a 5090 despite the latter having twice the memory bandwidth because of something retarded like the PCI-E bandwidth getting in the way.