>>105702790
With Epyc + DDR5 I can run Deepseek at its full 160k context Q6, and get 3.5 t/s initially all the way down to 1 t/s when it fills. Usable for overnight tasks on Openhands or Roocode but not much else realistically.

But at Q2_K_XL, with offload tensors and a 24GB GPU I can squeeze in 100k context and generate at 15t/s going down to 10 t/s at full, which is great for daily use at everything and still smarter than any non-Deepseek model. Might actually be able to fit 128k if I requanted it since I believe there's some extra memory wasted when using normal quants of MLA models on the ik fork, but it's too much trouble to download the full weights.