Search - 4rchive

>>107182640
>>107182615
>>107182656
This actually helped a lot. I copied the connection string from the image and now I'm getting 15 tokens/s, albeit at medium reasoning.
llama-server --model models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -b 4096 -ub 4096 -fa 1 --gpu-layers 99 -cmoe --mlock --no-mmap --ctx-size 0 --jinja

Very nice. I wonder why I can't get up to 20. Maybe it's cuz I cheaped out and got DDR4 RAM on this box.
>is the model embdedded one bugged?
When OpenAI first dropped the models llama-server wasn't respecting changing the reasoning effort so I just made a custom jinja template and explicitly set it which worked.