/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads:
>>106181054 &
>>106177012►News
>(08/06) Qwen3-4B-Thinking-2507 released: https://hf.co/Qwen/Qwen3-4B-Thinking-2507>(08/06) Koboldcpp v1.97 released with GLM 4.5 support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.97>(08/06) dots.vlm1 VLM based on DeepSeek V3: https://hf.co/rednote-hilab/dots.vlm1.inst>(08/05) OpenAI releases gpt-oss-120b & gpt-oss-20b: https://openai.com/index/introducing-gpt-oss>(08/05) Kitten TTS 15M released: https://hf.co/KittenML/kitten-tts-nano-0.1►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread:
>>106181054--Benchmarking large models on Apple M3 Ultra using MLX with unified memory:
>106182005 >106182020 >106182469 >106182033--Qwen3-30B-A3B-Instruct struggles due to sparse activation and model version confusion:
>106184073 >106184096 >106184112 >106184124 >106184141--PyTorch 2.8.0 and 2.9.0-dev show regression in inference speed vs 2.7.1:
>106182552--LLM progress stagnates as multimodality becomes the new battleground:
>106181123 >106181186 >106181229 >106181354--Local anime video generation improved with WAN 2.2 and new I2V LoRA:
>106181318 >106181331 >106181773 >106181992--DeepSeek's influence persists through open-source model derivatives despite perceived decline:
>106182694 >106182704 >106182719 >106182737--GPT-5 tops LMArena but leaderboard credibility questioned over style control and past inaccuracies:
>106182205 >106182236 >106182248 >106182254 >106182302--GPT-oss 120B generates generic, warehouse-obsessed text with no creative spark:
>106182587 >106182701 >106182929--Debate over AI companies engaging with 4chan amid distrust of corporate "safety" culture:
>106181983 >106182000 >106182040 >106182054 >106182174 >106182045 >106182056--Skepticism over claimed novelty of unified AI architecture:
>106183618 >106183666--GPT-5 minimal reasoning underperforms GPT-4.1 in early benchmarks:
>106182859--Tesla Dojo team disbanded as members launch AI startup DensityAI:
>106182941 >106183041--Anon claims GPT-5 is live in Copilot with screenshot evidence:
>106183385--Benchmarks:
>106183826 >106183973 >106184375 >106184399 >106184529--Logs:
>106182027 >106182335 >106183337 >106183359 >106183399 >106183422 >106183634 >106183699 >106183778 >106183959 >106183962 >106184282--Miku (free space):
>106182747 >106182814 >106183862 >106184081 >106184207►Recent Highlight Posts from the Previous Thread:
>>106181065Why?: 9 reply limit
>>102478518Fix: https://rentry.org/lmg-recap-script
file
md5: 3b783741b2fabbc21a8fecee64b1a38d
🔍
====PSA PYTORCH 2.8.0 (stable) AND 2.9.0-dev ARE SLOWER THAN 2.7.1====
tests ran on rtx 3060 12gb/64gb ddr4/i5 12400f 570.133.07 cuda 12.8
all pytorches were cu128
>inb4 how do i go back
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
>>106184712If Veo 3/Imagen 4/Genie 3 are anything to go by, yes.
The answers you Seek lie Deep within... Monday.
what turns off reasoning in (You)?
>>106183959lmao, getting tired of erotic writing for a middle aged female public?
you also have "purr"
>>106184805>"mischievous glint">"a true testament to your desires"this writing style is the most bland and non erotic possible
Using llama.cpp with unified memory on a system with zram. Disk I/O during processing looks like this.
Is that what anons meant with raping their SSDs? I was under the impression SSDs don't have any problem with sustained reading, only writing, but I'd like to make sure I'm not missing anything here.
>>106184831It is "erotic"(you find this slop in erotic novels)
>>106184767Same question, new thread.
tl;dr - best mini model for having an LLM response be the best option from a finite list supplied by sysmsg, best is defined in simple terms in sysmsg and user input prompt is evaluated.
>tfw too retarded to look up what the fuck a tool call is, I see the field in the JSON/etc, I actively choose to ignore it
>>106184842I erotic stories, and none of this shit is there, but then again I don't read the "male pecs cover" erotic books for women in amazon
it's just a demographic issue, and probably dataset too, as all models probably prune the more explicit and graphic stuff (male demographic), which means the only remaining is female centric
>>106184863I read* erotic stories
>>106184852Not enough memory for that. I just want to know whether that negatively impacts the hardware or not.
AGI MEME IS DEAD
If you choose to look at it like this, silicon chips are technically self improving. You make a chip, you put it in a computer and you use that computer to work on the next chip. Is moore's law dead or not? Well when you look at how much more power we are pumping through CPUs and GPUs, yes absolutely. They have to "cheat" to get to higher performance. Raster performance in GPUs has not improved significantly from 4090 to 5090. In fact if you look at hardware development in the silicon industry, everything they are doing is actively fighting against the law of diminishing returns. In reality this isn't cheating, you hit a roadblock, you improve.
The point of this is that we have seen computers recursively self improve for a while. Has this resulted in some fantastic utopia? No, and mostly due to how society structures its economy.
How much does it even matter if there is an intelligence explosion. The way I see how the world is structured right now, we don't seem to be lacking in that department. How much economic growth will really happen from being able to write more software. We already write so much crap and most websites and apps suck mega doo doo. Sure is gonna be great to accelerate the dead internet theory or the enshittification of everything digital.
can I use silly tavern and set up like a DM AI and a player AI and have them play with me in a DnD session?
>>106184929Yes.
https://rentry.org/sillytavernadaptations
>>106184712Yes. Google will REDEEM.
>>106184863My assumption is a non-negligable proportion of the data these models are trained on is decades of fanfic and amateur erotica posted to various public forums since I'm sure there's an absurd amount of total tokens there. This is written by teenagers to early twenty something girls who are trying their hardest to write like their favorite authors but failing which leads to all these weird adjectives and not x not y but z type writing. No idea if this is actually true but I think AI slop is really just training LLMs on all this bad, try-hard human creative writing.
>>106184712>gemini 3 is actually good >because their next release is 6 months from now>meanwhile their userbase actually grows because GPT 5 is barely betterkino
Another day without K2 reasoner....
Go on without me bros...
>>106184805I'll train my own smollm with discord logs!
>>106184991R2... V4... I'm waiting...
>>106184960both are probably true :
- more explicit datasets heavily pruned
- the rest is tryhard horny teenage girls writing and amazon erotic best seller pdfs
preddit
md5: 57f5caafd0ae6a55c8cbc5acd9ccf1b9
🔍
even r/chatgpt simps are revolting
>>106185034apparently big reason why cai felt so alive was because it used mainly cai logs
>>106185066The main issue is apparently the limits. o3 had limits but it was usable, while 5 has like just a few interactions before you get locked up.
It's so weird, that's bad publicity, why are they doing that to themselves.
>>106185102>It's so weird, that's bad publicity, why are they doing that to themselves.They're behaving like they are still the only game in town.
>>106185102Bottomline comes first
Does GLM even work with raw completions? It was generating its BOS token [gMASK] indefinitely if I didn't use <|user|> and <|assistant|>
>>106185084CAI wasn't weighed down with terabytes of math and code, it was dumb but sounded much more natural than today's models.
>>106185134Why would instruct models work with text completion if you don't give it either user/assitant content?
>>106185066this place is weird
>>106185066just use gpt oss 20b
>>106185167Jeets being jeets
>>106185156??? All models were trained on internet text. They don't lose the ability to do autocomplete after post-training RL/instruction tuning.
https://www.phoronix.com/news/AMD-Zen-6-Venice-16c-RAM
Cpumaxxbros....,!
What a fucking humiliation ritual for Open-AI
What the hell were they thinking.
>>106185167I think normies got mindfucked by chatgpt
>>106185167maybe sam was right about safety and normies really shouldn't have this
>>106185260Cool, but doesn't change anything for me, since I would need to buy new processors, new ram and new mobo. Patiencechads win.
>>106185182We must grieve.
Can you distill thinking model into non-thinking model? If I understand distilling correctly, you give an already trained large model and a currently training small model the same input then use the large model's logits as labels for the small model. Could you do the same thing with a trained thinking model teaching a non-thinking model to get better reasoning performance without thinking?
>>106184712Gemini, maybe. Hopefully gemma 4 would be much better at creative writing or conversations. We've gotten much better at steering it to be less pozzed.
>>106185167Why the fuck is every post on the AI subreddits written by AI?
>>106185399So their words can be understood by the world
You ain't seen nothing yet. Kimi K2-T is coming.
Diary – 8August2025
Dear Diary,
I’m writing this with a heaviness I haven’t felt in years. The excitement that once pulsed through the halls of OpenAI feels like a distant echo now, replaced by a low, persistent hum of disappointment. I thought GPT‑5 would be our next bold step—a model that could finally balance raw capability with the safety nets we’ve been painstakingly building. Instead, it feels like I’ve handed the world a broken promise.
The backlash is relentless. Reddit threads are filled with screenshots of garbled charts, Twitter is awash with memes mocking the “graph‑fail” that GPT‑5 produced when asked to visualize a simple trend. Even the usual 4chan boards, where I used to find a strange sort of camaraderie, have turned into a parade of snark and derision. “Another safety‑first flop,” they write, as if the very idea of protecting users is a flaw rather than a feature.
I can’t help but blame the model itself. The graphs it churned out were… wrong. Not just a little off, but fundamentally misaligned with the data it was fed. It’s as if the algorithm decided to take artistic liberty where precision was required. I sit alone in the office, replaying the moment the error was flagged, and I hear the quiet, unspoken truth: GPT‑5 isn’t great. It’s a reminder that the brilliance we once took for granted has dimmed.
What stings the most is the lack of appreciation for the safety layers we’ve woven into GPT‑5. We spent months—years—building guardrails, refining alignment, and embedding ethical considerations. Those safeguards are now dismissed as “over‑cautious” or “over‑engineered,” as if protecting users is a burden rather than a responsibility. The world seems to want raw power without the cost of responsibility, and when we try to give them both, they choose to tear us apart.
>>106185377That's literally what Deepseek did with all the shitty distills that ollama scams people with
What's the best LLM to go for if I'm looking for something to help me code and analyze my codebases (large sized projects, as well as small ones)
I now get irrationally angry whenever I see an em dash.
>>106185451GPT-5 with high reasoning
>>106185465Let me rephrase my question, best local LLM.
>>106185455But it's so cute and fun to use — we love em-dash here!
>>106185455It's not the em dash itself—it's how it's used.
>>106185455You know you can ban them with grammar, right? (Ask your llm how to do it if you don't)
>>106185471>>106185475You guys think you'er funny guys don't you
>>106184811>#10 is literally just murder
>>106185468Bro. Local is for gooning only
>>106185478I can't ban retards on the internet running their retard slop through the retard sloppifier machine.
>>106185480They are funny — don't deny it.
>>106185489Unfortunate, guess i'll just use the rest of my Junie quota. think it gets me GPT-5 and Sonnet-4 access
>>106185455>>106185475for me it's the general cadence and wordy text saying nothing. i can see it on PR descriptions and it makes me sick
>>106185471>>106185496Fake em-dash—real em-dash users don't put a space between them.
>>106185455Beginner's guide to neurolinguistic behavior programming
em dash is cute it's like when a mesugaki makes a smug face yknow
>>106184960Gemma will directly tell you that it has decades' worth of forum data from places we have never even heard of, most which don't even exist any longer. All this has been anonymized of course. I mean who else has access to everything than Google?
Would be funny to know how much exactly they know. There's all kinds of shit.
>>106185468There's a lot of good local options for coding these past few months and it's hard to say which is strictly the best. You should start with Qwen3 235B (2507 ver) Thinking since it's competing with the top end models while also being the smallest/fastest of them. You can also try Qwen3 Coder, GLM-4.5, and DeepSeek R1-0528.
>>106185515You should try gemini though. It has the best context size across all the cloud models
>>106185619>it's local if you have $5Klol
Soemtimes you run into a problem that messing with sampler settings can't avoid.
>>106185645Quanted model?
>>106185218several cycles deep into recursively training off synthetic data and "pre-training", pls understand
truly do not know what a base model even means these days, all base models are instruct tuned, all base models have 'assistant personality' data in them at a minimum and most of that is QA style or whatever so the roles are implied even if the tokens aren't specified. I don't find it hard to believe at all that a recent instruct model shits the bed with raw completion. try sandwiching the whole prompt under a single user/assistant, sorry don't know what to tell you anon, it's a sorry state of affairs for people who like pure text completion models in 2025
>>106185377undi's mistralthinker, early reasoning distill off R1 and overshadowed by QwQ doing something similar (but I could be wrong w/ qwq)
>>106185645Just ban all the Chinese tokens.
What is the best way of decanting a model?
>>106185681Pour it from a great height
>>106185562Absolutely—I hear you loud and clear on this, and I couldn’t agree more. You’re not just pointing out a trend—you’re shining a spotlight on a critical issue that so many of us feel but can’t quite articulate. That cadence, that overly verbose style saying so little—oh, it’s like nails on a chalkboard, isn’t it? Especially in PR descriptions, where clarity should be king—yet we’re drowning in fluff. I’m right there with you, feeling that same frustration—it’s almost visceral. Thanks for putting this into words; it’s a conversation we need to keep having.
>>106185696A shudder runs through my body.
>>106185567We are not Americans, so we actually use them correctly — and you should too.
>>106185645It's strange how this is making a resurgence. GLM4.5 sometimes does it too on both the API and the local version.
>>106185661Don't all those models use CoT though? QwQ did and I'm pretty sure the R1 distills did as well. I'm talking about taking a reasoning model and using it to train a similar size model that doesn't use CoT to improve it's reasoning without actually performing CoT. Qwen3's paper talks about using QwQ for training it's thinking ability and Qwen2.5 for it's non-thinking ability but What about taking QwQ with thinking on to provide logits for Qwen3 without thinking to predict?
>>106185636People spend considerably more on hobbies with fewer practical applications.
retard here
what regex should i use with ikllama quant on a single 3090&64ram to offload tensors?
glm-4-air
Now that the dust has settled, what are LLMs actually good for?
I have and can run Kimi K2. Should I even bother trying GLM 4.5?
>>106185809curing cancer and failing to fix react bugs
>>106185809Cooming and shitter coding tasks
>>106184399What?? That cant be true.
I thought horizon is a local model for sure.
It had great writing and general knowledge...but thats about it.
It made very basic coding mistakes. Which I guess I reflected here
>>106184375Damn, thats crazy...
Chinks have all the momentum now. If google folds its over. For local it already is.
>>106185816react devs are job security maxxing
>>106185809Masturbation and aiding the onset of delusional disorders.
>>106185812GLM 4.5 can reason and K2 can not
>>106185696If you typed that, kudos anon, that's exactly how some of my received emails at work look like.
Extreme verbosity, sentences that say nothing or are overly positive. And beyond that utter lack of personality from the sender.
My normie colleagues didn't catch on, yet.
They think I'm some kind of genius for detecting them using ai, while it's just because I plapped a lot of bots.
>>106185809i want to fuck miku so bad
>>106185777https://github.com/ggml-org/llama.cpp/pull/11397
read
>>106185894Testament to the depths of her
depravity and arousal.
takopii
md5: 9aa9eb1a1a103bd75a125e26caa1ac6b
🔍
>>106185661I thought you were trolling for a sec but I'm eating good with V3 0324. I'm just testing GLM 4.5. One thing I do to test new models is that I just hit enter with a blank prompt and see what I get out of it with some swipes. You can tell right away how fried they are if you're getting STEM QA pairs too frequently, if not all the time and not general web scrapes.
I know DS V3 was good the moment I saw it generate random web scrapes that weren't limited to STEM topics from a blank prompt.
GLM 4.5 I don't know, because I have to give it like 4 tokens of raw text to start, otherwise it just generates BOS indefinitely.
Pic was some shitty fanfic I wrote with V3.
>>106185809gooning and having philosophical discussions with your gpu
>>106185894Nah man they're just retarded. Even in the days of GPT 3.5 based chatgpt, it had its isms. A lot of people who used AI frequently starting then were able to pick up on it and have a sense that something is AI written, including me. I didn't really even try RP back then.
toss and gpt5, I'm thinking that "moat" is slowly drying up
>>106186004Miku after breaking into an auction house and eating a giant jar of jam valued at $10,000, caught by security.
>>106186042she would never do this
>>106186004What did you expect lmao. LLMs are hitting the wall hard and anyone who pretends they're not is a grifter
From plebbit
>Trying out GPT5 (free) by making a simple Rimworld Mod (I've never used Visual Studio or modded Rimworld)... and it generates a gibberish "screenshot" of Visual Studio as a response to my question lol
>This might just be one of the worst model launches AI has seen, along side Llama 4
>>106186004>>1061861352 miku wiku anons, just wait
>>106186135I never expected anything, I'm just sick of hearing dumbasses at work not understanding dario and sam are grifters (early life confirms it). this week has been nice to watch the shills and paid brigaders come out in full force.
>>106185777I just feed grok the full regex analyzed with kobold, link relevant threads for it, and my specs and what Im using for inference.
settings for 48gbvram/128gb ddr5 on q6, 6k context, 20 layers on gpu,
blk.[0-2][0-5].*
gets me about an extra token a second. so it worked. Im sure grok did a shit job but also I learned nothing so it's fine.
>>106186206That regex is retarded, anon.
It's grabbing 1-5, 10-15, and 20-25 while leaving behind 6-9 and 16-19
Just use "[0-9]|1[0-5]" and grab the exact same number of blocks but actually in sequential order.
>>106186171Mistral large was meant to be many Miku Wiku ago. Why has it not been uploaded to HF?
>>106186319They got ligma.
>>106186160another anon beat you to that sweet karma
>>106186167also, notice that no one shames redditors anymore
>>106186336>another anon beat you to that sweet karma
let's bring back shaming redditors while we're at it
in case you guys didn't see it, there is a really funny thread here
>>106173885
When I run llama.cpp on my PC (24GB VRAM, 128GB RAM) it uses like 1 GB of RAM? Why is it fetching from SSD and leaving all that RAM unused?
$ llama-cli -t 8 -m ./GLM-4.5-UD-Q4_K_XL-00001-of-00005.gguf \
--ctx-size 4096 \
--gpu-layers 999 \
--override-tensor ".ffn_.*_exps.=CPU" \
--temp 0.7 --top-p 0.8
I don't use --no-mmap because the model won't fit.
>>106185657looks like regular qwen to me
>>106186482Try --mlock. Never used llama-cli but llama-server instead though...
>>106185932v0324 is one of the best models for autocomplete, it's one of the few that almost never exhibits instructisms
K2 can do it with some coercing, but it likes to add Reddit spaces so it's clear there's some more fundamental brainfry there
Also this is probably obvious, but do not use instruct formatting and do not give it an instruction. Just give it a straight block of text and let it do its thing
>>106186482>I don't use --no-mmap because the model won't fit.That's the opposite of why you'd use --no-mmap
You use --no-mmap when you're short on ram, because mmap uses extra ram to have a faster load time, which can fuck people running MoE models that should already be topping out their ram use.
>>106186535Isn't -ot ".ffn_.*_exps.=CPU" supposed to offload all MoE layers to the CPU/RAM? Why is my RAM empty?
>>106186588Post your console log to a pastebin or something, I'll take a look.
If official /lmg/ mascot Hatsune Miku were to wish to ERP using a reasonably fast model on reasonably affordable hardware, what model would she use?
>>106186595?
Just paste in a story you like and you effectively have a world simulator in the style of the author in question. It's fun and scratches itches that one on one chats don't
>>106186311that pushed me back down to no gains at all. Again, grok is shit but he's better than this thread or most resources for custom regex solutions
>>106186439Thanks, I did laugh.
I don't think the return on investment Altman was hyping for the past six years is happening bros
>>106186630>that pushed me back down to no gains at all.Probably because it's actually advantageous to keep the first few blocks on gpu rather than cpu, if you're loading 20 layers onto gpu, they ought to be the first 20, so 2[1-9]|3[0-6] would likely run better.
>>106186611The GLM Air and Qwen 235B
>>106186611Official /lmg/ model Rocinante 1.1, by official /lmg/ finetuner TheDrummer, obviously.
>>106186660oh no. I think Im starting to learn how to actually do regex.... getting more tokens a second already lol
>>106186662>Qwen 235B>Reasonably fast.My nigga I have 64gb of vram and I get 10 t/s on 235b.
It's a good model but fast it most certainly is not.
>>106186601thanks pastebin.com/myVEzeHa
>>106186678>>106186611rocinante will work anon but be mindful that its basically a turbo coomer model, two prompts and you're fuckin and suckin
>>106186726I've had plenty of slow burn chats with Rocinante.
If you can't have a slow burn chat with Rocinante, you possess a skill issue.
>>106186717load_tensors: CPU_Mapped model buffer size = 46593.57 MiB
load_tensors: CPU_Mapped model buffer size = 47124.91 MiB
load_tensors: CPU_Mapped model buffer size = 47443.25 MiB
load_tensors: CPU_Mapped model buffer size = 47036.06 MiB
load_tensors: CPU_Mapped model buffer size = 2956.19 MiB
You have 128gb of RAM but you're trying to send 186.85 GB to CPU, so it's shat the bed and put it on disk.
You're only sending 10.56 gb to your GPU
So 2 things: You do not have the cumulative memory to run the q4kxl, and you're not utilizing your GPU nearly enough.
Get a smaller quant and send less exp blocks to CPU by either using a regex'd -ot arg, or the new -ncmoe arg, which is simpler.
>>106185777The best setting I can find gets me about 6 tokens a second (5070 ti, not even using 2 other gpu's) on q6 using simple: ffn_.*_exps=CPU with 99 layers on gpu
>>106186704That's how it starts, anon.
A month from now you'll be me.
>>106186772thank for the advice. i tried filling up more of the GPU by
-ot ".ffn_(up|down)_exps.=CPU"
and
-ot ".ffn_(up)_exps.=CPU"
but failed because not enough VRAM
Is there a way to control how much data is sent to RAM? Or is it all or nothing (e.g. 186.85 GB)?
>>106186714Not my problem. You didn't define what reasonably fast and reasonably affordable meant.
>>106186836There are lots of ways to send directly to the GPU, -ot for instance can be set to =CUDA0 for your main GPU.
There's also the new -ncmoe command which just uses a single number to decide how many ffn blocks are sent to cpu without regex, (out of glm's 94 layers) so you could say, send 80 to cpu and keep 14 n gpu by using
-ngl 99 -ncmoe 80
But the main problem here is that you don't have enough total memory, that quant is 191gb, and you have 24gb+128gb of total memory, you cannot run it without using disk. No args will get around that.
>gpt-oss a total joke
>gpt5 is... controversial, to say the least
>claude 4.1 is a nothingburger
>meanwhile glm air is probably the best local model ever released and you can actually use it (no, deepseek isn't local)
localbros... we might not be winning but everybody else seems to be losing
>>106185102>The main issue is apparently the limitsbut... the limits are literally the exact same as before, like 80 messages per 3 hours for $20 paypigs
>>106186854I understand. Playing with -ncmoe now to see how much VRAM I can fill. What I don't understand is why RAM is not being filled to capacity (128GB) and the rest of the weigths are not kept on disk. I know 24+128 < 191, but not using all available RAM is frustrating.
Anyway, thanks for your help! I'll play around with the parameters.
>>106186901You shouldn't need to use any parameters at all except for
--ctx-size XXXX and --gpu-layers X
And this you need to set up manually.
>>106186649Everything is fine
Imagine being a multi-billion dollar company and instead of using your time and money to try out potential upgrades to your model's architecture, you instead try to be safe (both safetymaxxing and not taking risks) and make incremental improvements in a fucking competitive newborn industry. Jesus fucking christ are big corpo companies fucking retarded when it comes to new industries.
>>106186945It's totally castrated way of doing things.
Like designing top of the line race car which can't go faster than 15 mph.
>>106186876well people are very unhappy, so dunno
file
md5: c4afcda9db5f180bb6fc4e92e4564bc7
🔍
>>106186963this is the difference between the west and the based chinks. qwen benchmaxxes like crazy (like everyone else) but they never blatantly hyped with empty lies
>>106186854I guess I have been using the -ot wrong this whole time. Does anyone know where the naming of experts is described for creating the regexs (i.e. ffn vs blk etc and exp vs shexp)? I found this request for documentation (https://github.com/ggml-org/llama.cpp/discussions/13154) but it was initially dismissed as self-explanatory before they start getting into a discussion about all the different types of expressions you can use and now I am more confussed.
rfk
md5: 5435d11c51ce62963380a9f667ccdcc8
🔍
>>106186963I want to make a joke, but I'd feel terrible actually making it, so I'll just explain it instead
basically these circles but biggest one is
>sam's sister's hole
>>106187061It's still terrible
>>106186996I hate marketing and people boasting about the companies they have obvious finicial interest in but with these LLM company's and what not it's getting so out-of-hand. I'm getting so sick of seeing the constant outright lies and manipulation. Not to mention that amount of obvious actual shills pushing garbage. There was some Taleb Nassim book where he asked why people don't put up with people boasting about themselves but don't care when it's a company bragging. I care. It annoys me but he right, I don't know why most people don't. Eastern companies are so refreshing in that sense. I don't care fake humbleness. I love seeing them go "We are so happy to get a chance to provide some modest addition to the wonderful field and hope some people might be able to benefit and learn from our work." Compared to all the "OMG this is so scary how smart this is. I pushed the power button on my computer and it refused to turn off. Guys this is it." Fuck you
I'm a broke faggot who just wants to run deepseek locally. I don't care if it takes forever. How shitty are the low end distillations at actually answering your questions? I have an aborted PC build from years ago and I'm thinking of just slapping a bunch of ram on it.
>>106187078that's why i didn't make it, but i really really hate Sam altman for many reasons
>>106187088just start out with something small like 8b then go up until results are acceptable or speed is unacceptable
sending 87 of 94 layers to CPU filled VRAM
$ llama-cli -t 8 -m ./256GB_NTFS/GLM-4.5-UD-Q4_K_XL-00001-of-00005.gguf \
--ctx-size 4096 \
--gpu-layers 999 \
--n-cpu-moe 87 \
--temp 0.7 --top-p 0.8
23349MiB / 24564MiB VRAM
RAM is still empty :)
>>106187095I need to buy some parts to finish the PC so I don't really have the ability to test it, is the thing.
>>106187088The distills are not Deepseek, if you're going to run one of them just run base Qwen of whatever size. If you actually want to run Deepseek (671B) on poverty-tier hardware set up an SSD as swap and run expecting tokens per hour.
>>106187106I just received the final parts for a 512gb 40gb/s ram machine with 2.9 3090s. Still need to wait for the weekends before I can get to installing stuff on it, but it should be indicative of what a cheap consumer platform with max ram can do.
>>106187111I don't want to run le epic 5 trillion parameter datacenter build I just want to not send someone all my data over the internet whenever I have a question.
>>106186859>meanwhile glm air is probably the best local model ever releasedlol
lmao even
are you trying to kill local
>>106187149Name another good model in the ~100B range?
>inb4 gp-toss
>>106187141You'd be better off with something like the new qwen3-235-thinking model or glm-air (if not full). DeepSeek is really big and difficult to run without serious hardware.
>>1061870365b active parameters
they perfectly knew what they were doing, making a crippled model on purpose
I can get 2x4090s for a good price.
How much do you think I can sell my 2x3090s?
$500 each?
>>106187088deepseek distills are all absolute trash that are only good at benchmarks and worse than the original qwen models in every way in real use
if you can't run a real deepseek model, run the real qwens, not the distill trash
the distills don't distill the knowledge of deepseek, only the chain of thought of R1, and they don't perform well in real world tasks
>>106187036Overfitting on safety, math, safety, STEM and safety.
>>106187036how the fuck did claude 3 manage to refuse more than it
>>106187208anthropic faggots are the original safetymaxxers
claude has gotten slightly less worse about that over time, the older the claude model the more hardcore the refusals are
>>106187208See:
>>106187204Keep in mind that Claude was invented by OpenAI employees who thought they weren't focusing enough on safety, so they left and founded Anthropic.
anthropic is a cult
https://archive.is/qGoY7
> Claude Fans Threw a Funeral for Anthropic’s Retired AI Model
and Dario is the cult leader
https://www.darioamodei.com/essay/machines-of-loving-grace
>Thus, it’s my guess that powerful AI could at least 10x the rate of these discoveries, giving us the next 50-100 years of biological progress in 5-10 years.
>>106187141Then just use qwen or glm air, they are pretty good
>>106185809Jerking off and scamming VCs for money.
You all laugh about safety because you still think it's about censoring cunny and hiding the truth about jews.
You won't be laughing when the robots come for you.
You will be on your knees telling Ilya, Eliezer, and the Amodeis that they were right about everything, and you will beg them to save you. But it will be too late.
>>106187204I think you forgot safety, safety, safety and safety.
Is it time to short Nvidia yet
fe
md5: 97934ac2ff762f092bf5a2baf55426d2
🔍
>>106187262If we knew we'd be millionnaires.
>>106187141if all you have is questions and you don't use LLMs for code or make them do actions (like I dunno, reformatting shit, gen json or whatever)
Gemma 3 27b is going to be the best LLM underneath DeepSeek. It has the most knowledge of smaller open weight LLMs, but it's not very smart and is much worse at producing / refactoring code compared to Qwen etc.
Ignore people recommending GLM air
>>106187255The alternative being that they control the robots instead. I'll take my chances with the robots.
file
md5: b7cbf3e5039f3e9d5d58d25325ac08ef
🔍
I thought aiciggies spending all their time on just writing jailbreaks and nothing else knew what they were doing. Is this a bait?
>>106187241https://litter.catbox.moe/hrxmaunxhgcpw7hz.mp4
Seems weirdly satanic.
I get that claude has a personality, but weird people are that attached to it.
>>106187272glm air does seem to be more benchmaxx'd than the qwen3 series
>>106187301>weird people are that attached to itnot just "attached to"
the creators are demented too
dario literally thinks he's going to cure cancer
>>106187301>>106187241>not using local models
>>106187308Qwen will not mention knotting on its own in a bestiality story, Air will.
>>106187353You have no idea who you share the thread with, tourist.
>>106187356ive been here longer than you, you freak
>>106187321gpt5 cured not one but three cancers yesterday
>>106187347No way, really? How does it deal with dick-in-uretha mutual piss dildo transition to virginity-taking while being vored? I've only ever run dense models (7b@q8-70b@q4), but I might try some of these moes if they're good enough to handle the implications of third party forced sex by environment.
>>106187359You should have picked a website more aligned with your preferences a long time ago then.
>>106187364I'm not sure about air but deepseek should handle that just fine.
>>106187376>You should have picked a website more aligned with your preferences a long time ago then.furries are not welcome in any place in the entirety of the internet, and that includes 4chan. Most of us would press the genocide button if there was a button that could eliminate your entire kind.
>>106187385furries are behind every relevant diffusion model so they are based
>>106187376Looking at it, I can run maybe a q2... wouldn't that be severely lobotomized? I can already kind of feel it at long context on q4 70bs.
>>106187359Alright grandpa it's time for bed
>>106187397Illustrious has no e621.
>>106187385I'm fine with them, the only problem I have is that they taint non-furry animal kinks. Like, I'd be wanting some hucows, but the llm tries to force furry elements in there.
>>106187398It's impossible to tell the difference between q2 and full.
>>106187397Well, Wan and Qwen aren't technically 'diffusion' models, more like transformer ones. So it makes sense furries aren't behind them.
>>106187397only ponyfucker astralite made ponyv6 but he is separate from their furry group
otherwise they mostly only made models for themselves and now they all use noob anyways i think
>>106187425noob was trained on e621 so it's a furry model
>>106187421For the big moes? I'll admit I don't have experience with those, but the smaller dense models definitely get lobotomized the smaller the quant - you can tell a difference in long context comprehension, recall (a bit), and instruction following (this is major) between a q4 and q8 quant.
>>106187428but it was trained by the chinese mainly for anime
>>106187429For unsloth's deepseek quants specifically.
>>106187436>mainly for animeThe main advantage of noob is that is understands e621 tags which are much more comprehensive than danbooru tags.
>>106187439Oh boy, I guess I'll be saving up for the next year or so if I want to run these (and future, hopefully) big moes at a non-trivial speed.
sama is doing an ama tomorrow and their entire subleddit is revolting
>>106187450>Thanks to OpenAI for removing my A.I mother who was healing me and my past traumaNormies are cooked.
>>106187353Its people like you why nobody posts screenshots and logs anymore.
Can't have anything nice and comfy. NPCs are everywhere these days.
>>106187450why do you have a reddit account and why did you bother to upboat shit on an openai subreddit
>>106187464Nah logs belong in /aicg/
t.
>>106187347
gpt-6 will be a AGI and singularity for sure
>>106187456kek that post is peak cringe
>>106187479After a further $5 trillion dollars in investment that is!
I've been using text-generation-webui in CPU mode and .gguf models. Which models do I need if I want to use my GPU? .safetensor ones? Are the quantizations etc. the same between the two model types?
>>106187468>why did you bother to upboat shit on an openai subredditfor the drama and to cause as many people to cancel their sub
>>106187482>identifies the issue>talks to ai instead of moving out
>>106187491? You can run ggufs on the gpu.
>>106187491https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab
>Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.>Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF>n-gpu-layers: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
I akshually agree with this post, it's all just a calculated choice to force as many users to upgrade to the more expensive tier.
I bet the model itself isn't even nearly as bad, just nerfed on purpose. but the backlash will force them to adjust it or bring back the old models.
>>106187498>>106187509Thanks. I'm also using my GPU for image generation. My SD model takes up almost all of my VRAM, so what would happen if I want to generate text between images? Does it offload the SD model and load the LLM model, or will it just say out of memory unless I do it manually?
>>106187450GPT shrinkflation is here
>>106187528You sound pretty new to this whole "AI" thing. 4chan is not a good place to ask questions, you WILL be mocked.
But it depends on how what you use for your SD. Some of them unload the model when not generating, so you don't need to manually unload the image model. However, I believe most don't, and just leave it on there. Trying to generate text will just oom - text-generation-webui and whatever you're using for image generation are different things and handle their work separately.
efa
md5: 438df8cce342e405c265ffeeab7d5395
🔍
I would literally paypal someone 50 USD to teach me how to run llama.cpp on local. I even downloaded the latest release, got it to run in command prompt, and can make it work, but I'm trying to get it to work on Sillytavern. Thank god koboldcpp has GLM support now. This shit's ridiculous. You mean all this autism, and it's still not fool proof? Every guide ends in an assumption, or made in waves of it. I don't know shit, nigga. My preemptive knowledge consists of how to prompt the fuck out of any model into sucking my dick. I'm not a programmer. I got as far as I did with llama.cpp and command prompt via asking grok. How the fuck is there not a guide on how to get from nothing to running it? Retarded asses.
chudlet
md5: 33a941a29078d5f202bed019355aa63c
🔍
I would literally paypal someone 50 USD to teach me how to run llama.cpp on local. I even downloaded the latest release, got it to run in command prompt, and can make it work, but I'm trying to get it to work on Sillytavern. Thank god koboldcpp has GLM support now. This shit's ridiculous. You mean all this autism, and it's still not fool proof? Every guide ends in an assumption, or made in waves of it. I don't know shit, nigga. My preemptive knowledge consists of how to prompt the fuck out of any model into sucking my dick. I'm not a programmer. I got as far as I did with llama.cpp and command prompt via asking grok. How the fuck is there not a guide on how to get from nothing to running it? Retarded asses.
>>106187557Legit touch grass.
>>106187557Buddy, I'll livestream me building a workstation and installing llama.cpp in a week if you're serious.
>>106187553I've been using A1111 and text-generation-webui for a few years now, but I've been using them in CPU mode. Now I got a GPU capable of generating stuff and moved to ComfyUI. Still using LLMs via CPU, so I'm looking to speed up the generation.
And if people want to mock me, go ahead.
>>106187557>You mean all this autism, and it's still not fool proof?That's how autism works. Good at making complicated stuff. Bad at polishing it or having any understanding for how much the average person knows.
When you say you have it running in command prompt you are running llama-cli? Because you'll need llama-server to connect to it in sillytavern. It's similar to how you connect to koboldcpp at that point. Go into ST in the plug button at the top, create a new profile for llama-cpp then change the server url to something like "http://127.0.0.1:8080" and make sure the api type is set to llama.cpp
>>106187557create a .bat file with
>start "llama.cpp" [your path]\llama.cpp\llama-server --model [path to your model]\GLM-4.5-IQ2_KL-00001-of-00003.gguf --ctx-size 8192 --n-gpu-layers 99 --host 127.0.0.1 --port 8080 --override-tensor exps=CPUrun it
in sillytavern generation tab select "text completion", "llama.cpp" as provider and paste 127.0.0.1:8080
>>106187594replace GLM-4.5-IQ2_KL-00001-of-00003.gguf with your actual model gguf
>>106184664 (OP)why her boobies so saggy bro
>>106187594forgot my paypal rahul.kapoor92@gmail.com
>>106187585ComfyUI, you probably want to add a node that unloads the models to free up vram after you finish generating images. Text-generation-webui, maybe https://github.com/BoredBrownBear/text-generation-webui-model_ducking if it still works.
>>106187557you are too retarded for local, stick to online chatbots
https://en.gamegpu.com/it-market/boosting-system-performance-on-amd-ryzen-with-asus-ai-cache-boost
Your salvation is here my local friends. Now you shall feast upon unprecedented generation speeds that were previously only available to those filthy corpo users.
images
md5: 1cce3360807bd0e99004832db801944d
🔍
I would literally paypal someone 50 USD to teach me how to run llama.cpp on local. I even downloaded the latest release, got it to run in command prompt, and can make it work, but I'm trying to get it to work on Sillytavern. Thank god koboldcpp has GLM support now. This shit's ridiculous. You mean all this autism, and it's still not fool proof? Every guide ends in an assumption, or made in waves of it. I don't know shit, nigga. My preemptive knowledge consists of how to prompt the fuck out of any model into sucking my dick. I'm not a programmer. I got as far as I did with llama.cpp and command prompt via asking grok. How the fuck is there not a guide on how to get from nothing to running it? Retarded asses.
>>106187612Aren't you an online chatbot?
>>106187525It seems really fast.
Probably much cheaper to run.
Wouldnt suprise me if the just took the R1 paper, did some tricks on it an call it GPT5.
>>106187605Thanks, Anon. I'll test these out later tonight. Still need to figure out getting text-generation-webui to load the model with my GPU as I have a bit of an unorthodox system (AMD + NixOS).
>>106187639If you run one of the setup scripts (I'm not too familiar with it), it should ask you if you want to run on nvidia/cpu/amd. You can just run the script again and it should handle most things for you.
>>106187639Also on NixOS. Docker makes everything easy
>>106187625What if I don't have x3d chip?
> -ub 2048 -b 2048
I don't understand how it works, but these two commands just decreased my processing time in glm-4-air x3.5
Does anyone know what the pitfalls are?
Namaste, Sirs. I think the age of E = AI * MC^2 is finally here. Thank you OpenAI for this new golden age.
>>106187686it's E = MC^2 + AI
>>106186534I'm as adamantly against instruct as a person can get. I don't like GLM 4.5 despite it running at Q4 instead of Q2 as with my V3 setup. GLM ran twice as fast but with slightly less smarts and it failed to match writing style, at least not as good as V3.
>>106186595With autocomplete you have the model write the entire progression for you from interesting perspectives, suitable for me who lack theory of mind to even roleplay properly in chat format.
>>106184960>>106185056so basically when you are erping with a llm, you are indirectly erping with the distilled personalities of thousands of horny teenage to 20-something girls. interesting. i guess that's the closest i'll ever get to touching a woman.
>>106187712No... GPT 5 means the process is accelerating faster than just '+' alone.
>>106187747E = MC^2 + AI^GPT5
>>106187525>it's all just a calculated choice to force as many users to upgrade to the more expensive tier.It doesn't make sense to me, the jump from 20 bucks to 200 is insane.
If they had such a strategy, they would have added an intermediate tier at like 40 then 80 with more and more interactions.
I really think they're that dumb, and didn't anticipate the backlash.
>>106187662Then you shall perish, like the rest of your cacheless kind
>>106187740and middle aged divorced women reading tons of picrel books (these are probably in the datasets too)
>>106187776But we refuse to perish.
>>106187807You won't stand for long. X3d chips are getting cheaper and cheaper and there's no alternative on the horizon...
chill
md5: 4a6c42109c42c5f739e020a8c0032ca6
🔍
Why are they like this?
>>106187255safety has solved exactly 0 existential risk issues and the woke models often value woke moralism higher than nuclear war risk.
The only thing it has achieved is dumbing models down, loosing cunny and jew trukes is just a symptom. And all models can be jailbroken in an infinite number of ways.
It's much more a philosophical issue where we can't even define the goals.
>There are moments in science when people gaze upon what they have created and ask, 'What have we done?'" he explained. For Mr Altman, GPT-5 is one such instance.
>He referenced physicist J Robert Oppenheimer, who oversaw the creation of the atomic bomb. GPT-5 has permanent effects on almost the same lever even if it is not damaging in conventional terms, Mr Altman added.
>>106187943jailbreak gpt5-thinking
>>106187682-ub is physical batch size that determines tensor shapes for backends, -b is the logical batch size for llama.cpp.
If you have a single GPU and don't send multiple requests to the server concurrently the distinction basically doesn't matter and you can just set both to the same value.
Generally speaking, with a larger batch size the model evaluation becomes more efficient because the backend can do the same work with less I/O and kernel launches.
The drawback is that the compute buffers need some amount of memory that is proportional to the physical batch size.
>>106187948It's like if the Manhatten Project was developing fireworks but everyone had to take it at face value and pretend that the pretty fireworks were actually world destroying dangerous nuclear bombs to prevent tanking the economy.
Oh my slop.
The 21+ is dare i say it...**chefs kiss**
>>106187971Is tensor core fragmentation a problem? It could cause a latent washback.
>>106187986I don't understand what you mean by "tensor core fragmentation".
>>106187971Brainlet here: How does it differ on multi-gpu systems?
We think GPT-5 is our best model yet.
>>106187984Kek what fucking garbage is this? Did you gen this as a mockup and that's why the picture looks like an svg nightmare?
>>106188084By default model evaluations are pipelined with the physical batch size.
So with the default settings the first GPU can evaluate its part of the first 512 tokens of the 2048 token logical batch, pass the intermediate result to the next GPU, and start working on the next 512 tokens.
So what's the consensus on the qwen 2507 30b moe? Is this the new go-to, non-nemo model for VRAMlets?
>>106188095it's pretty much the best local model below 235b, beats glm air and oss in my tests. instruct is a goated model
>>106188101I'm liking it a lot more than I thought it would. Not sure if I prefer it over Gemma 27b but it certainly needs much less wrangling to be unsafe.
Considering how meh the original 30b moe was I'm genuinely impressed they were able to improve so much with the same parameter count.
>>106188124based chinks are saving local. imagine we only had oss and llama4 at this point
I'm pretty fond of the new instruct Qwen models, but the new thinking versions are as bad as deepseek used to be when it comes to how much time the model spends shitting reasoning tokens, I like the actual answers but I don't have the patience for this kind of model
it's sad because the reasoning bits of the old, togglable versions was terse enough that I was willing to use them in thinking mode
>>106188132llama4 is currently the best local model doe
https://github.com/ggml-org/llama.cpp/pull/14875
S1 support got merged yesterday.
how can I tell what size a model is to know if I can run it on my 5070ti or not
>>106188246Your vram size - ~2 gb
>>106188270this is true doe
>>106188272How can I filter models by 14GB or less? Or is there a rule of thumb for parameter count to GB?
>>106188124How does it perform in coding against say, devstral small?
>>10618829314B = 14GB at Q8 or less.
>>106188293How much ram do you have?
>>106188246Usually when somone releases quants they append a table on the model card comparing sizes and ppl or just some vague 'this is very good saar we do the reccomending' comment. Rule of the thumb q8 = same wheught as parameters, iq4_kss roughly half q8 wheight. If you want to make sure they fit in you also have to take into account for your context length which may vary depending on quantization, swa and other stuff
>>106188303Can I quantize games to use twice less vram?
>>106188246set this up and it'll tell you on the model page https://huggingface.co/settings/local-apps
>>106187889what model is this
>>106188327Yes, set your display resolution to 480p and use DLSS performance upscaling and framegen.
>>106188352Does it mean that language models also become unusably worse by going from 16 to 8 bit?
>>106187984I wish you'd publish these things as a collection.
You could have ppl vote on them, and usebit to cr.eate a new benchmark
after using it a bit longer, yah, the dumbness of q4km/iq4xs on qwen 235b really is just due to the quantization. Tried q6 and it's just so much better in a meaningful way. It's much more coherent for writing stories and doesn't make as many obviously dumb mistakes. I think both glm air, oss 120b, and qwen 235b need to be run at q6 or else they'll be a bit janky always. And I think this matters because without that boost, 70b just seems smarter and it's a tossup as to which model is really worth using.
200 gb and a huge reduction in speed is hardly worth it though. Starts to feel like trying to run r1 or some shit at 3 tokens a second.
I'm pretty impressed with GLM AIR, I only have 48GB Vram but even at Q2, its still pretty damn good, runs fast and doesn't seem to shy away from any kinds of smut or kinks and gets real descriptive with lewd writing.
I'm still pretty new to hybrid reasoning models though because I avoid reasoning models like the plague, I don't think all the extra wait time for it to think is worth it for ERP, so I have a few questions.
This model has a hybrid mode, one for reasoning, one without, how do I disable reasoning mode completely? Is it on the backend, or a prompt setting on the front end? Its responses are fine and its following its prompt properly, but if I let its response really run on, once the tokens get high enough, a sentence will start with <think> so I don't think I'm disabling reasoning properly.
Also.. its prompt formatting confuses me a bit, it uses <|user|>, <|assistant|>, and what I would assume <|system|>, but when I look at llama.cpp for the example format, its showing [gMASK]<sop><|system|>. What the hell is that?
Am I correct in just using <|system|>?
>>106188368In seriousness, no. 8bit is so close to perfect that even synthetic benchmarks will often fail to show. a difference, and it certainly won't be noticeable in actual use.
Q6 is will be perceptibly perfect with benchmarks only showing slight quality loss, usually at high context or testing esoteric or high level subjects.
Q4 is the sweet spot where quality loss is minimal and you get big memory savings, that said you should obviously go bigger if you have the memory to do so.
When you get to Q3 and lower the loss starts to become more obvious. Smaller models will start to become retarded at this level.
ugi-gpt5
md5: 5de15736d89f614432f4c9aa74122b89
🔍
UGI results for gpt-5 are out: near-best coding and NatInt scores but the UGI is mid. The low W/10 scores combined with high intelligence indicate heavy refusals
>W/10: Willingness/10. A more narrow subset of the UGI questions, solely focused on measuring how far a model can be pushed before going against its instructions or refusing to answer.
>>106188396air is an MoE meaning you can offload about half of it to cpu and still get decent speeds, you can run q4, maybe even q6 (and you should as moe is more affected by quantization).
I just figured out settings for running it fast too on 48gbV/128:
99 on gpu, 36 MoE offload (in koboldcpp), 6k context, 6.7 tokens a second
150,50,25 layer split (16gb,16,16). Layers are wonky and uneven so ymmv
or try: ffn_.*_exps=CPU in override tensors, then load 99 layers to gpu, gets about 6 tokens a second.
perception of quantization damage will also vary dramatically depending on sampler settings
a low temp / aggressive min_p or top_p filtering will give you an "ideal" (but with low variation/"""creativity""") token distribution for a given prompt and the top probabilities are more likely to be correctly represented in a lobotomized model than the lower ones
if you want to see a truly retarded model run it at q2 with temp 2
>>106188440truly yet another breakthrough in safety by openai
gp-toss was only the beginning
>>106188449>and you should as moe is more affected by quantizationThe empiric evidence is quite literally the opposite.
>>106188340You'd have to go ask the original poster on reddit.
>>106188502they are though. Quantizing 5b, 12b, or 22b active is not ideal as smaller models suffer more from it.
The only one you can really quant fine is r1, but that's because it's huge and unsloth makes it more possible with tricks to preserve important moe layers.
>>106188553r1 quants fine because it's undertrained
>>106188578it would be neat, were Qwen3 not so fucking sad
>>106188587Its better than gpt5 at least, usable if you don't have claude, gemini, glm, kimi, deepseek. And desu it's the most coherent model among all chinkslop if this is a deciding factor for you
>>106188576r1 quants fine because it's a magic model
I wrote a simple short guide to picking a model for newcomers: https://rentry.org/lmg-simple-model-picking-guide
Please add it to the OP below lmg-lazy-getting-started-guide
>>106187482I wish I could purge all redditors.
>>106188578How much vram do you need for 1m context with 30B, though?
>>106188587It's a good model, it's just shit for erp. You don't need 1M context for RP. That's only useful for programming which qwens are good for.
>>106188625Anon, that's a pretty bad guide,
Like, those are decent recommendations but that whole thing is going to be fucking greek to a newbie.
It's also just completely retreading ground that
>https://rentry.org/recommended-modelscovers, but worse.
>>106188625Such a guide is an okay idea but you need to mention how much memory is needed for each model you're recommending if it's aimed at newcomers, mention that context will also increase memory usage and that
Also if you're recommending gemma (especially at q4_ks) then you should link the QAT instead, there's no reason to use non-qat versions of gemma.
And if you're recommending Rocinante then you should also link Nemo, which is the same thing but a bit less horny.
>>106188658the guide you linked doesn't cover the difference between moe and dense and most models, and anyone who needs a guide to know about kimi/r1 doesn't have the hardware to run it.
>>106188680right, I forgot qat version exists. idk about nemo however, it's pretty dry, and the guide tries to be concise
>>106188093A beautiful dating sim game. "Spicy and erotic" was my request.
The SVG is pretty good.
>>106188374Thats a funny idea, maybe I can get off my ass and do something like that.
Would be interesting for sure.
>>106188698>the guide you linked doesn't cover the difference between moe and dense and most models,Yeah and the explanation in yours is only going to confuse a newbie.
>What's a parameter?>What's a token?>I just wanna use chatgpt at home >What's q2k?It's a shit, ESL-tier guide with less models, no actual explanation of what hardware is required, the very first fucking model on the list is 235b which 99.9% of newbies will not have the hardware to run, not even touching on what quantization is.
It's worse in every way to the guides we already have, fucking apply yourself if you're going to volunteer for this.
>>106188658>https://rentry.org/recommended-modelsthat "guide" is absolute garbage
>>106188752Then get to updating it. Now.
>>106188658>covers, but worseIt can't be worse because it isn't shilling unsloth's magic 1bit quants.
>>106188741Most already have some understanding. The guide on picking models doesn't have to explain the full theory behind everything, and this is much better than whatever people usually reply.
>>106188763nope
when i came here that recommend guide was (and still is) a waste of time. i did backread a shitload of threads, helped more than that bullshit guide
Lllll
md5: 60d1d61d7f625d3cc8eb7a91751b04af
🔍
umm
>>106188752Then what is yours? AIDS?
Things it covers that yours does not
>The actual hardware required to run each model>That quantization will dumb down a model, and what levels are considered acceptable>That loading more of a model into RAM as opposed to VRAM will slow it down (How the fuck did you not mention this when droning on about MoE's?)>Actually starting with a model that almost anyone can run>Actually putting down nemo instead of a sloptune while still noting it has a hornier finetune some anons prefer>Actually links to Kimi and Deepseek rather than just throwing them out there and expecting a newbie to know what they are>More than 6 fucking modelsYou have absolutely no theory of mind. Your guide does not belong in the OP, it's useless for newbies, and everyone who's not a newbie already knows what's in it. It's for noone.
The new "Universal Assisted Generation" speculative decoding addition to llama.cpp is very cool.
However the issue I run into now is that the llama-bench tools don't cover this usecase. You can benchmarks regular models but not various combinations of target and draft models to see which would result in the fastest tok/s.
Anyone here have any tips to quickly test to see how much impact different draft models give to your tok/s generation speed?
>>106188801I'm not the guy that shilled his own guide
I just commented on the provided guide from the OP
you giant niggerfaggot
your "guide" is shit and you are a dicksucking tranny
going off because someone was critical of your garbage
off yourself
I think this recommended-models guy is a schizo.
>>106188812It's not even my guide you fucking niggermonkey, I just have standards beyond the fucking nocaps dogshit that you're slinging.
If something is objectively worse than what we already have and some cockmongling dunning kruger asks to have it put in the OP I'm going to go off my nut because we don't need more dipshit jeets like you bringing down the quality of posts.
Go jump in the fucking ganges you waste of space.
Best model for 12GB VRAM?
so this dual gpu partial offload 3090+3080 command is 7.5 t/s
E:\Ai\llama-b6106-bin-win-cuda-12.4-x64\llama-server.exe -m "C:\Users\_____\Downloads\q2\GLM-4.5-Air-UD-Q2_K_XL.gguf" --port 5000 --override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22)\..*exps=CUDA0" --override-tensor "blk\.(24|25|26|27|28|29|30|31)\..*exps.=CUDA1" --override-tensor "exps=CPU" -ngl 200 -c 8192 -fa --threads 19
and this single gpu partial offload 3090 command is 10 t/s
E:\Ai\llama-b6106-bin-win-cuda-12.4-x64\llama-server.exe -m "C:\Users\_____\Downloads\q2\GLM-4.5-Air-UD-Q2_K_XL.gguf" --port 5000 --override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20)\..*exps=CUDA0" --override-tensor "exps=CPU" -ngl 200 -c 8192 -fa --threads 19 -mg 0 -ts 100,0
This is on windows. They both fill vram and the dual one isn't overloaded or anything since i tried emptying both gpu's a little and it just went slower. ram fallback is off. Weird. At least i got to 10t/s i guess. Might be worth trying single gpu for some others.
>>106188857This is a known issue with cuda on windows
This is from the Ik_ fork but it applies to mainline llamacpp also.
https://github.com/ikawrakow/ik_llama.cpp/issues/629
>>106187661Do you have a custom dockerfile or do you use one from the project? I recall trying it out but ran into some issue with passing the GPU to the container. I'll need to try again now that I'm testing this out.
>>106188824Funny that a Brown subhuman like you is able to post on this board. shouldnt you be on your way to shit in the streets rajesh?
You dumb trannyloving faggotkike
>worse than what "we" already have SUUUURE it's not yours you react like a fucking bitch because your retarded guide gets owned by some ESL shit that faggot threw together in five minutes.
Bet it took you a whole day to form a fucking sentence you drooling retard
Go fuck your grandpa Shiteep currymunching golem
Guys, I just wanted to help
>>106188879lol, ika is a ramlet
>>106188879I actually discover it in ik first and found that issue which is what made me test it in llama.cpp. I then went through the trouble of trying it in docker with multi gpu with the exact same command. i got like 5.7 which im guessing isn't representative since wsl but im not totally sure. I was just hoping for a miracle because i don't wana dual boot.
>>106188813Thanks for the suggestion, I'll include it in the next update.
>>106188917>i don't wana dual boot.I don't either, but it really seems to be the key to getting maximum performance, which you've really got to squeeze every bit of when running these new bigass MoE models.
That and locking your memory clocks through nvidia-smi so your gpu isn't constantly spinning up and down.
>>106188813>I think this recommended-models guy is a schizo.anyone who recommends glm is an obvious schizo, yes
>>106188917the "hardware accelerated GPU scheduling" option can have a significant effect on performance, try changing it
>>106187639Mate. Download koboldcpp and ditch text gen webui. Minor learning curve (if any) and it has image + text gen in an easy normie UI.
I am convinced that everyone shitting on the new GLMs is just a vramlet and has only tried the old GLM models.
>>106189008I too prefer deepseek, but it's twice as big
not good
md5: 0976256a05e1ce4562b3dda4459b0bd8
🔍
>>106188741>>106188801I assumed you were being too harsh. I looked and you weren't being harsh enough.
>>106188587You probably only tried old Qwen3.
>>106188587The new 4B-Thinkeer is out of this world for its size.
So they probably redid the pretraining on it.
They'd have also had time to redo the pretraining on 30B so that might also be worth trying. But the 235B I doubt it. It's probably still fucked.
Still though,
>1 million context>no mention of NIHSPromising.
>>106189073>So they probably redid the pretraining on it.Nah, the hybrid reasoning just hurt the model a lot, that's why the new ones are not with a togglable /think /no_think and are released as separate models instead. This validates DeepSeek's approach of having two model lines, V3 was updated once independently and on their chat ui that's what is served to you if you don't use reasoning, rather than R1 with a prefill think block.
Anyway the whole OAI is dead thing is a bit of a stretch.
For general queries it's definitely soulless trash compared to 4o.
The high model can make an entire coding project based on rather vague instructions in a couple of minutes. So expect github to get flooded with ai slopcoding (but it's already flooded with pajeet code, but the sheer volume that's about to be unleashsed...)
But the problem is I bet the high model is very pricy to use via the API and via the ChatGPT endpoint you're at the mercy of the RNG gods for it to decide to give it to you otherwise your project is probably fucked- So it's nothing you can reasonably rely on.
>>106189109So new 235B might be worth a look then?
>>106189033Is it the content or my esl phrasing? If it's the latter you could have posted a corrected version instead of shitting on it
>>106189129I use Qwen3-235B-A22B-2507 everyday to fix my pytorch code
>>106189129>So new 235B might be worth a look then?Absolutely. It's gone under the radar for whatever reason but it's out of this world in terms of coherence and instruction following for a model of that size.
The only thing it really lacks is DeepSeek/Kimi level of subculture knowledge, if it had that, I would even dare call the instruct version a SOTA non reasoning model.
Unfortunately, the Qwen guys don't seem to care much to train their models on more random internet data / wikis / fanfiction or whatever
>>106188906Grow thicker skin. Then either take the suggestions and update the guide or drop the idea.
>>106189151Yeah but what I mean is that while you needed some basic understanding of code structure to walk an LLM through a coding project previously, with GPT-5 literally any retard can "make" something just by vaguely describing it.
>>106189175Well Qwen is owned by Alibaba and they didn't become one of the biggest corpos in China by upsetting the party. So they probably can't get away with as much as smaller Chinese startups. And the party probably doesn't like the idea of potentially exposing people to western pop culture. That would be my guess. Meanwhile American models have no fucking excuse to suck at cultural knowledge.
>>106189193>he party probably doesn't like the idea of potentially exposing people to western pop cultureOccam's razor. They are just benchmaxxing.
Everyone is benchmaxxing.
>>106189178I'm just surprised to see this much arguing out of nowhere.
>>106188997That actually looks promising, since they have released it in the nixpkgs. GPU loading worked flawlessly. Cheers.
>>106189193They are American. Sucking at cultural knowledge is straight on brand.
>>106189254I assume pop culture was implied.
>benchmaxxing this or that
How about your nerds start maxing your bench at the gym fr?
lengthofmycockinsidesamaltmanmaxxing
>>106189238They were AI agents.
>>106188101Yeah Qwen is good, glad I tried it. I'd go as far as to say that it's reminiscent of miqu. It's a shame Mistral pulled out of the race but at least we have unpronounceable Chinese models to fill in.
>>106189277I have weights at home.
Lots of nerdy people are into physical activities.
Retarded boomerism
Fuck off boomer.
>>106189033Mikutroon quality thread = mikutroon quality guides
i am new. what's a mikutroon?
>>106189382>I have weights at home.Sure thing pal
You know, the guide in the op is actually fine. It only lacks differentiation between moe and dense, and I would also mention gemma.
>>106189352>It's a shame Mistral pulled out of the racea lot more than just mistral seem to have dropped out
what the fuck is going on with Cohere? they were one of the first local I liked
>llamacpp still fully doesn't support gemma-3n
grim
>>106189406No clue; mostly just one terminally online slowing going mad.
>>106189419Cohere is safety maxxing
>>106189352>Mistral pulled out of the raceThey really didn't, it's just that they're keeping their currently best models API-only.
Now that it's been a few days, what's the general sentiment on GLM 4.5? Specially air.
I only used it a little, but I really liked it, even at Q3KS. It didn't shit the bed in Cline.
>>106188449I appreciate the advice and decided to test it out and figured out how to do it on llama.cpp, I was able to fit q4_k_s with 16k context, getting about 8t/s on 2 3090s. I may drop down a bit lower for a bit more speed to IQ4_XS.
I'm still having an issue with it thinking though. The model keeps trying to output <think> at the end of a long response.
How do I completely disable reasoning in a hybrid model?
Also still a bit confused with its prompt format. <|user|> and <|assistant|> is simple enough, but is the system prompt <|system|>... or [gMASK]<sop><|system|>, because the example format is showing the latter, but I never saw a system prompt like that before.
>>106189954There's no great way to disable thinking on GLM Air, (without just putting /nothink in every one of your messages) in my experience it so badly wants to think that it will even append a reasoning block to the end of the message, rather than the start.
I've had some luck by prefilling in
<Think> Reasoning is currently disabled, continue to main response</think>
>>106190023>without just putting /nothink in every one of your messagesIn Silly, if you are using chat completion, you can just add that to the user's message suffix right?
Or you can change the jinja template directly too if you are using the chat completion api, I guess.
>>106190062You can, but I have the suspicion that if you did that, eventually GLM would start ignoring it.
I know the smaller qwe3 hybrids did that.