Report Content

On glm4 air i have 3080+3090 and 128 ddr4 10850k. I cap out at about 8.2 t/s on q2xl. I've offloaded as much as I can with the special layer commands and used all the vram. Is this the best I can do? Anyone getting better with similar setup? no mmap just seemed to slow it down.

.\llama-server.exe -m "C:\Users\____\Downloads\GLM-4.5-Air-UD-Q2_K_XL.gguf" --port 5000 --override-tensor "(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93).ffn_.*_exps.=CPU" --override-tensor "(17|16|15|14|13|12|11|10).ffn_.*_exps.=CUDA1" -ngl 200 -c 8192 -fa --threads 19

Post Preview