Search - 4rchive

Why is my performance so horrible? Llama-server.exe on windows 10 is mindbogglingly slow with multiple gpus. during inference, the reported power usage of my gpus are 80-90 watts when they have a limit of 350w.

call ".\llama-server.exe" --ctx-size 16384 --jinja --flash-attn auto --cache-type-v f16 --mlock --no-mmap --device CUDA0,CUDA1,CUDA2 --gpu-layers 48 --split-mode layer --tensor-split 33,33,33 --main-gpu 0 --model "D:\Models\GLM 4.5 Air 106bA12b Zhipu AI -- bartowski_zai-org_GLM-4.5-Air-GGUF\zai-org_GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf"

Glm air iq4_xs with 0 context shouldn't be this slow on 3090s, no?