Report Content

>>106335536
>try glm 4.5 air, offload almost half of all the layers to a 5090, the rest on 124 gb of vram
~/Github/ik_llama.cpp/build/bin/llama-server \
-m ~/models/GLM-4.5-Air-IQ3_KS-00001-of-00002.gguf \
--ctx-size 65536 \
-ub 1024 -b 1024 \
-ctk q6_0 -ctv q6_0 \
--temp 0.6 \
--n-gpu-layers 23 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.0 \
-fa \
-fmoe \
--jinja \
--threads 8 \
--mlock \

>it runs like shit, way slower than read speed
>let's try a different config
>run the big glm 4.5, might as well try draft models for the first time too, why not
~/Github/ik_llama.cpp/build/bin/llama-server \
-m ~/models/GLM-4.5-IQ2_KL-00001-of-00003.gguf \
--ctx-size 65536 \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
-ngl 99 \
-ot exps=CPU \
--parallel 1 \
--threads 8 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap
-md DRAFT-0.6B-Q4_0.gguf \
-ngld 99 \
--draft 64 \

>even doe it's twice as big as air and my gpu isn't being fully squeezed up for all it's vram's worth at 29/32 gb, it's easily at least twice as fast and probably smarter too
draft gods... i kneel...

Post Preview