Search - 4rchive

>>106969311
If you manually offload GPU layers, do you still want to do -n-cpu-moe 999 to get all the experts on the CPU?

As I understand it was want to force all the experts to the CPU, and as many transformer layers to GPU as possible?

Another Q, see picrel: For this quant (and for GLM 4.6) the # of layers isn't readily stated. Which parameter in the model info am I looking for? Or do I need to go and look at the unquantized one?