Anonymous
8/6/2025, 3:22:50 AM
No.106157687
>>106157667
>On a related note, long context doesn't really take a lot of VRAM, but due to the sliding window it reprocesses the prompt every time by default (in llama.cpp), and for some reason prompt processing seems much slower than it should be, even after setting batch size to 8k tokens.
ah so thats why it kept on fucking reprocessing the prompt so often! nevermind i had that issue with GLM 4.5 air too
>>106157672
dang anon, i can run it at 8t/s on my 12gb 3060/64gb ddr4 rig at q3_k_m
what are you running it on?
>On a related note, long context doesn't really take a lot of VRAM, but due to the sliding window it reprocesses the prompt every time by default (in llama.cpp), and for some reason prompt processing seems much slower than it should be, even after setting batch size to 8k tokens.
ah so thats why it kept on fucking reprocessing the prompt so often! nevermind i had that issue with GLM 4.5 air too
>>106157672
dang anon, i can run it at 8t/s on my 12gb 3060/64gb ddr4 rig at q3_k_m
what are you running it on?