Search Results
8/7/2025, 2:54:08 PM
>unsloth q2-k, offloading about 25 layers to 5090 rest on ddr5, 42k context at q8 on oobabooga
> 1.4 tokens per second after painfully compiling 25k context
>switch to iq4_kks from ubergarm + ik_llama.cpp, bump context to 64k, 20 layers to gpu, same context batch size as oobabooga
>same exact prompt now runs at 4.3 tokens per second after quickly processing context
>will probably get it faster with q3 and playing with command flags
Ik_llama.cpp gods... I kneel...
> 1.4 tokens per second after painfully compiling 25k context
>switch to iq4_kks from ubergarm + ik_llama.cpp, bump context to 64k, 20 layers to gpu, same context batch size as oobabooga
>same exact prompt now runs at 4.3 tokens per second after quickly processing context
>will probably get it faster with q3 and playing with command flags
Ik_llama.cpp gods... I kneel...
Page 1