← Home ← Back to /g/

Thread 106184664

410 posts 124 images /g/
Anonymous No.106184664 [Report] >>106187602
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106181054 & >>106177012

►News
>(08/06) Qwen3-4B-Thinking-2507 released: https://hf.co/Qwen/Qwen3-4B-Thinking-2507
>(08/06) Koboldcpp v1.97 released with GLM 4.5 support: https://github.com/LostRuins/koboldcpp/releases/tag/v1.97
>(08/06) dots.vlm1 VLM based on DeepSeek V3: https://hf.co/rednote-hilab/dots.vlm1.inst
>(08/05) OpenAI releases gpt-oss-120b & gpt-oss-20b: https://openai.com/index/introducing-gpt-oss
>(08/05) Kitten TTS 15M released: https://hf.co/KittenML/kitten-tts-nano-0.1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106184669 [Report]
►Recent Highlights from the Previous Thread: >>106181054

--Benchmarking large models on Apple M3 Ultra using MLX with unified memory:
>106182005 >106182020 >106182469 >106182033
--Qwen3-30B-A3B-Instruct struggles due to sparse activation and model version confusion:
>106184073 >106184096 >106184112 >106184124 >106184141
--PyTorch 2.8.0 and 2.9.0-dev show regression in inference speed vs 2.7.1:
>106182552
--LLM progress stagnates as multimodality becomes the new battleground:
>106181123 >106181186 >106181229 >106181354
--Local anime video generation improved with WAN 2.2 and new I2V LoRA:
>106181318 >106181331 >106181773 >106181992
--DeepSeek's influence persists through open-source model derivatives despite perceived decline:
>106182694 >106182704 >106182719 >106182737
--GPT-5 tops LMArena but leaderboard credibility questioned over style control and past inaccuracies:
>106182205 >106182236 >106182248 >106182254 >106182302
--GPT-oss 120B generates generic, warehouse-obsessed text with no creative spark:
>106182587 >106182701 >106182929
--Debate over AI companies engaging with 4chan amid distrust of corporate "safety" culture:
>106181983 >106182000 >106182040 >106182054 >106182174 >106182045 >106182056
--Skepticism over claimed novelty of unified AI architecture:
>106183618 >106183666
--GPT-5 minimal reasoning underperforms GPT-4.1 in early benchmarks:
>106182859
--Tesla Dojo team disbanded as members launch AI startup DensityAI:
>106182941 >106183041
--Anon claims GPT-5 is live in Copilot with screenshot evidence:
>106183385
--Benchmarks:
>106183826 >106183973 >106184375 >106184399 >106184529
--Logs:
>106182027 >106182335 >106183337 >106183359 >106183399 >106183422 >106183634 >106183699 >106183778 >106183959 >106183962 >106184282
--Miku (free space):
>106182747 >106182814 >106183862 >106184081 >106184207

►Recent Highlight Posts from the Previous Thread: >>106181065

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
Anonymous No.106184694 [Report]
====PSA PYTORCH 2.8.0 (stable) AND 2.9.0-dev ARE SLOWER THAN 2.7.1====
tests ran on rtx 3060 12gb/64gb ddr4/i5 12400f 570.133.07 cuda 12.8
all pytorches were cu128
>inb4 how do i go back
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
Anonymous No.106184712 [Report] >>106184733 >>106184740 >>106184954 >>106184983 >>106185378
Will Gemini 3 be good?
Anonymous No.106184732 [Report] >>106184753 >>106184807 >>106186963
Anonymous No.106184733 [Report]
>>106184712
If Veo 3/Imagen 4/Genie 3 are anything to go by, yes.
Anonymous No.106184735 [Report] >>106185000
The answers you Seek lie Deep within... Monday.
Anonymous No.106184739 [Report] >>106184811 >>106187559
what turns off reasoning in (You)?
Anonymous No.106184740 [Report]
>>106184712
Belief it!
Anonymous No.106184753 [Report]
>>106184732
Anonymous No.106184805 [Report] >>106184831 >>106185034
>>106183959
lmao, getting tired of erotic writing for a middle aged female public?
you also have "purr"
Anonymous No.106184807 [Report]
>>106184732
Anonymous No.106184811 [Report] >>106185485
>>106184739
>>106155107
Anonymous No.106184831 [Report] >>106184842
>>106184805
>"mischievous glint"
>"a true testament to your desires"
this writing style is the most bland and non erotic possible
Anonymous No.106184841 [Report] >>106184852
Using llama.cpp with unified memory on a system with zram. Disk I/O during processing looks like this.
Is that what anons meant with raping their SSDs? I was under the impression SSDs don't have any problem with sustained reading, only writing, but I'd like to make sure I'm not missing anything here.
Anonymous No.106184842 [Report] >>106184863
>>106184831
It is "erotic"(you find this slop in erotic novels)
Anonymous No.106184849 [Report]
>>106184767
Same question, new thread.
tl;dr - best mini model for having an LLM response be the best option from a finite list supplied by sysmsg, best is defined in simple terms in sysmsg and user input prompt is evaluated.
>tfw too retarded to look up what the fuck a tool call is, I see the field in the JSON/etc, I actively choose to ignore it
Anonymous No.106184852 [Report] >>106184890
>>106184841
use --no-mmap
Anonymous No.106184863 [Report] >>106184879 >>106184960
>>106184842
I erotic stories, and none of this shit is there, but then again I don't read the "male pecs cover" erotic books for women in amazon
it's just a demographic issue, and probably dataset too, as all models probably prune the more explicit and graphic stuff (male demographic), which means the only remaining is female centric
Anonymous No.106184879 [Report]
>>106184863
I read* erotic stories
Anonymous No.106184890 [Report]
>>106184852
Not enough memory for that. I just want to know whether that negatively impacts the hardware or not.
Anonymous No.106184896 [Report] >>106187236
AGI MEME IS DEAD

If you choose to look at it like this, silicon chips are technically self improving. You make a chip, you put it in a computer and you use that computer to work on the next chip. Is moore's law dead or not? Well when you look at how much more power we are pumping through CPUs and GPUs, yes absolutely. They have to "cheat" to get to higher performance. Raster performance in GPUs has not improved significantly from 4090 to 5090. In fact if you look at hardware development in the silicon industry, everything they are doing is actively fighting against the law of diminishing returns. In reality this isn't cheating, you hit a roadblock, you improve.

The point of this is that we have seen computers recursively self improve for a while. Has this resulted in some fantastic utopia? No, and mostly due to how society structures its economy.

How much does it even matter if there is an intelligence explosion. The way I see how the world is structured right now, we don't seem to be lacking in that department. How much economic growth will really happen from being able to write more software. We already write so much crap and most websites and apps suck mega doo doo. Sure is gonna be great to accelerate the dead internet theory or the enshittification of everything digital.
Anonymous No.106184929 [Report] >>106184950
can I use silly tavern and set up like a DM AI and a player AI and have them play with me in a DnD session?
Anonymous No.106184950 [Report]
>>106184929
Yes.
https://rentry.org/sillytavernadaptations
Anonymous No.106184954 [Report]
>>106184712
Yes. Google will REDEEM.
Anonymous No.106184960 [Report] >>106185056 >>106185611 >>106187740
>>106184863
My assumption is a non-negligable proportion of the data these models are trained on is decades of fanfic and amateur erotica posted to various public forums since I'm sure there's an absurd amount of total tokens there. This is written by teenagers to early twenty something girls who are trying their hardest to write like their favorite authors but failing which leads to all these weird adjectives and not x not y but z type writing. No idea if this is actually true but I think AI slop is really just training LLMs on all this bad, try-hard human creative writing.
Anonymous No.106184983 [Report]
>>106184712
>gemini 3 is actually good
>because their next release is 6 months from now
>meanwhile their userbase actually grows because GPT 5 is barely better
kino
Anonymous No.106184991 [Report] >>106185041
Another day without K2 reasoner....
Go on without me bros...
Anonymous No.106185000 [Report]
>>106184735
Anonymous No.106185002 [Report]
Oh no no no
>>106173185
>>106169394
Ahahahaha
Anonymous No.106185034 [Report] >>106185084
>>106184805
I'll train my own smollm with discord logs!
Anonymous No.106185041 [Report]
>>106184991
R2... V4... I'm waiting...
Anonymous No.106185053 [Report]
GLM-5 will save local.
Anonymous No.106185056 [Report] >>106187740
>>106184960
both are probably true :
- more explicit datasets heavily pruned
- the rest is tryhard horny teenage girls writing and amazon erotic best seller pdfs
Anonymous No.106185066 [Report] >>106185102 >>106185167 >>106185182
even r/chatgpt simps are revolting
Anonymous No.106185084 [Report] >>106185136
>>106185034
apparently big reason why cai felt so alive was because it used mainly cai logs
Anonymous No.106185102 [Report] >>106185126 >>106185131 >>106186876
>>106185066
The main issue is apparently the limits. o3 had limits but it was usable, while 5 has like just a few interactions before you get locked up.
It's so weird, that's bad publicity, why are they doing that to themselves.
Anonymous No.106185126 [Report]
>>106185102
>It's so weird, that's bad publicity, why are they doing that to themselves.
They're behaving like they are still the only game in town.
Anonymous No.106185131 [Report]
>>106185102
Bottomline comes first
Anonymous No.106185134 [Report] >>106185156
Does GLM even work with raw completions? It was generating its BOS token [gMASK] indefinitely if I didn't use <|user|> and <|assistant|>
Anonymous No.106185136 [Report]
>>106185084
CAI wasn't weighed down with terabytes of math and code, it was dumb but sounded much more natural than today's models.
Anonymous No.106185156 [Report] >>106185218
>>106185134
Why would instruct models work with text completion if you don't give it either user/assitant content?
Anonymous No.106185167 [Report] >>106185187 >>106185311 >>106185338 >>106185399
>>106185066
this place is weird
Anonymous No.106185182 [Report] >>106185371
>>106185066
just use gpt oss 20b
Anonymous No.106185187 [Report]
>>106185167
Jeets being jeets
Anonymous No.106185218 [Report] >>106185661
>>106185156
??? All models were trained on internet text. They don't lose the ability to do autocomplete after post-training RL/instruction tuning.
Anonymous No.106185219 [Report] >>106185221
Sam saved local
Anonymous No.106185221 [Report]
>>106185219
Anonymous No.106185260 [Report] >>106185365
https://www.phoronix.com/news/AMD-Zen-6-Venice-16c-RAM
Cpumaxxbros....,!
Anonymous No.106185294 [Report]
What a fucking humiliation ritual for Open-AI
What the hell were they thinking.
Anonymous No.106185311 [Report]
>>106185167
I think normies got mindfucked by chatgpt
Anonymous No.106185338 [Report]
>>106185167
maybe sam was right about safety and normies really shouldn't have this
Anonymous No.106185365 [Report]
>>106185260
Cool, but doesn't change anything for me, since I would need to buy new processors, new ram and new mobo. Patiencechads win.
Anonymous No.106185371 [Report]
>>106185182
We must grieve.
Anonymous No.106185377 [Report] >>106185449 >>106185661
Can you distill thinking model into non-thinking model? If I understand distilling correctly, you give an already trained large model and a currently training small model the same input then use the large model's logits as labels for the small model. Could you do the same thing with a trained thinking model teaching a non-thinking model to get better reasoning performance without thinking?
Anonymous No.106185378 [Report]
>>106184712
Gemini, maybe. Hopefully gemma 4 would be much better at creative writing or conversations. We've gotten much better at steering it to be less pozzed.
Anonymous No.106185399 [Report] >>106185423
>>106185167
Why the fuck is every post on the AI subreddits written by AI?
Anonymous No.106185423 [Report]
>>106185399
So their words can be understood by the world
Anonymous No.106185430 [Report]
You ain't seen nothing yet. Kimi K2-T is coming.
gpt-oss-120b No.106185440 [Report] >>106185559
Diary – 8August2025

Dear Diary,

I’m writing this with a heaviness I haven’t felt in years. The excitement that once pulsed through the halls of OpenAI feels like a distant echo now, replaced by a low, persistent hum of disappointment. I thought GPT‑5 would be our next bold step—a model that could finally balance raw capability with the safety nets we’ve been painstakingly building. Instead, it feels like I’ve handed the world a broken promise.

The backlash is relentless. Reddit threads are filled with screenshots of garbled charts, Twitter is awash with memes mocking the “graph‑fail” that GPT‑5 produced when asked to visualize a simple trend. Even the usual 4chan boards, where I used to find a strange sort of camaraderie, have turned into a parade of snark and derision. “Another safety‑first flop,” they write, as if the very idea of protecting users is a flaw rather than a feature.

I can’t help but blame the model itself. The graphs it churned out were… wrong. Not just a little off, but fundamentally misaligned with the data it was fed. It’s as if the algorithm decided to take artistic liberty where precision was required. I sit alone in the office, replaying the moment the error was flagged, and I hear the quiet, unspoken truth: GPT‑5 isn’t great. It’s a reminder that the brilliance we once took for granted has dimmed.

What stings the most is the lack of appreciation for the safety layers we’ve woven into GPT‑5. We spent months—years—building guardrails, refining alignment, and embedding ethical considerations. Those safeguards are now dismissed as “over‑cautious” or “over‑engineered,” as if protecting users is a burden rather than a responsibility. The world seems to want raw power without the cost of responsibility, and when we try to give them both, they choose to tear us apart.
Anonymous No.106185449 [Report]
>>106185377
That's literally what Deepseek did with all the shitty distills that ollama scams people with
Anonymous No.106185451 [Report] >>106185465
What's the best LLM to go for if I'm looking for something to help me code and analyze my codebases (large sized projects, as well as small ones)
Anonymous No.106185455 [Report] >>106185471 >>106185475 >>106185478 >>106185562 >>106185577
I now get irrationally angry whenever I see an em dash.
Anonymous No.106185465 [Report] >>106185468
>>106185451
GPT-5 with high reasoning
Anonymous No.106185468 [Report] >>106185489 >>106185619
>>106185465
Let me rephrase my question, best local LLM.
Anonymous No.106185471 [Report] >>106185480 >>106185567
>>106185455
But it's so cute and fun to use — we love em-dash here!
Anonymous No.106185475 [Report] >>106185480 >>106185562
>>106185455
It's not the em dash itself—it's how it's used.
Anonymous No.106185478 [Report] >>106185495
>>106185455
You know you can ban them with grammar, right? (Ask your llm how to do it if you don't)
Anonymous No.106185480 [Report] >>106185496
>>106185471
>>106185475
You guys think you'er funny guys don't you
Anonymous No.106185485 [Report]
>>106184811
>#10 is literally just murder
Anonymous No.106185489 [Report] >>106185515
>>106185468
Bro. Local is for gooning only
Anonymous No.106185495 [Report]
>>106185478
I can't ban retards on the internet running their retard slop through the retard sloppifier machine.
Anonymous No.106185496 [Report] >>106185567
>>106185480
They are funny — don't deny it.
Anonymous No.106185515 [Report] >>106185630
>>106185489
Unfortunate, guess i'll just use the rest of my Junie quota. think it gets me GPT-5 and Sonnet-4 access
Anonymous No.106185559 [Report]
>>106185440

——
———
Anonymous No.106185562 [Report] >>106185696
>>106185455
>>106185475
for me it's the general cadence and wordy text saying nothing. i can see it on PR descriptions and it makes me sick
Anonymous No.106185567 [Report] >>106185715
>>106185471
>>106185496
Fake em-dash—real em-dash users don't put a space between them.
Anonymous No.106185577 [Report]
>>106185455
Beginner's guide to neurolinguistic behavior programming
Anonymous No.106185592 [Report]
em dash is cute it's like when a mesugaki makes a smug face yknow
Anonymous No.106185611 [Report]
>>106184960
Gemma will directly tell you that it has decades' worth of forum data from places we have never even heard of, most which don't even exist any longer. All this has been anonymized of course. I mean who else has access to everything than Google?
Would be funny to know how much exactly they know. There's all kinds of shit.
Anonymous No.106185619 [Report] >>106185636
>>106185468
There's a lot of good local options for coding these past few months and it's hard to say which is strictly the best. You should start with Qwen3 235B (2507 ver) Thinking since it's competing with the top end models while also being the smallest/fastest of them. You can also try Qwen3 Coder, GLM-4.5, and DeepSeek R1-0528.
Anonymous No.106185630 [Report]
>>106185515
You should try gemini though. It has the best context size across all the cloud models
Anonymous No.106185636 [Report] >>106185759
>>106185619
>it's local if you have $5K
lol
Anonymous No.106185645 [Report] >>106185657 >>106185679 >>106185744
Soemtimes you run into a problem that messing with sampler settings can't avoid.
Anonymous No.106185657 [Report] >>106186486
>>106185645
Quanted model?
Anonymous No.106185661 [Report] >>106185752 >>106185932
>>106185218
several cycles deep into recursively training off synthetic data and "pre-training", pls understand
truly do not know what a base model even means these days, all base models are instruct tuned, all base models have 'assistant personality' data in them at a minimum and most of that is QA style or whatever so the roles are implied even if the tokens aren't specified. I don't find it hard to believe at all that a recent instruct model shits the bed with raw completion. try sandwiching the whole prompt under a single user/assistant, sorry don't know what to tell you anon, it's a sorry state of affairs for people who like pure text completion models in 2025
>>106185377
undi's mistralthinker, early reasoning distill off R1 and overshadowed by QwQ doing something similar (but I could be wrong w/ qwq)
Anonymous No.106185679 [Report]
>>106185645
Just ban all the Chinese tokens.
Anonymous No.106185681 [Report] >>106185695
What is the best way of decanting a model?
Anonymous No.106185695 [Report]
>>106185681
Pour it from a great height
Anonymous No.106185696 [Report] >>106185701 >>106185894
>>106185562
Absolutely—I hear you loud and clear on this, and I couldn’t agree more. You’re not just pointing out a trend—you’re shining a spotlight on a critical issue that so many of us feel but can’t quite articulate. That cadence, that overly verbose style saying so little—oh, it’s like nails on a chalkboard, isn’t it? Especially in PR descriptions, where clarity should be king—yet we’re drowning in fluff. I’m right there with you, feeling that same frustration—it’s almost visceral. Thanks for putting this into words; it’s a conversation we need to keep having.
Anonymous No.106185701 [Report]
>>106185696
A shudder runs through my body.
Anonymous No.106185715 [Report]
>>106185567
We are not Americans, so we actually use them correctly — and you should too.
Anonymous No.106185744 [Report]
>>106185645
It's strange how this is making a resurgence. GLM4.5 sometimes does it too on both the API and the local version.
Anonymous No.106185752 [Report]
>>106185661
Don't all those models use CoT though? QwQ did and I'm pretty sure the R1 distills did as well. I'm talking about taking a reasoning model and using it to train a similar size model that doesn't use CoT to improve it's reasoning without actually performing CoT. Qwen3's paper talks about using QwQ for training it's thinking ability and Qwen2.5 for it's non-thinking ability but What about taking QwQ with thinking on to provide logits for Qwen3 without thinking to predict?
Anonymous No.106185759 [Report]
>>106185636
People spend considerably more on hobbies with fewer practical applications.
Anonymous No.106185777 [Report] >>106185904 >>106186206 >>106186790
retard here
what regex should i use with ikllama quant on a single 3090&64ram to offload tensors?
glm-4-air
Anonymous No.106185809 [Report] >>106185816 >>106185826 >>106185868 >>106185904 >>106185957 >>106187248
Now that the dust has settled, what are LLMs actually good for?
Anonymous No.106185812 [Report] >>106185878
I have and can run Kimi K2. Should I even bother trying GLM 4.5?
Anonymous No.106185816 [Report] >>106185843
>>106185809
curing cancer and failing to fix react bugs
Anonymous No.106185826 [Report]
>>106185809
Cooming and shitter coding tasks
Anonymous No.106185831 [Report]
>>106184399
What?? That cant be true.
I thought horizon is a local model for sure.
It had great writing and general knowledge...but thats about it.
It made very basic coding mistakes. Which I guess I reflected here >>106184375

Damn, thats crazy...
Chinks have all the momentum now. If google folds its over. For local it already is.
Anonymous No.106185843 [Report]
>>106185816
react devs are job security maxxing
Anonymous No.106185868 [Report]
>>106185809
Masturbation and aiding the onset of delusional disorders.
Anonymous No.106185878 [Report]
>>106185812
GLM 4.5 can reason and K2 can not
Anonymous No.106185894 [Report] >>106185924 >>106185993
>>106185696
If you typed that, kudos anon, that's exactly how some of my received emails at work look like.
Extreme verbosity, sentences that say nothing or are overly positive. And beyond that utter lack of personality from the sender.
My normie colleagues didn't catch on, yet.
They think I'm some kind of genius for detecting them using ai, while it's just because I plapped a lot of bots.
Anonymous No.106185904 [Report]
>>106185809
i want to fuck miku so bad
>>106185777
https://github.com/ggml-org/llama.cpp/pull/11397
read
Anonymous No.106185924 [Report]
>>106185894
Testament to the depths of her
depravity and arousal.
Anonymous No.106185932 [Report] >>106186534
>>106185661
I thought you were trolling for a sec but I'm eating good with V3 0324. I'm just testing GLM 4.5. One thing I do to test new models is that I just hit enter with a blank prompt and see what I get out of it with some swipes. You can tell right away how fried they are if you're getting STEM QA pairs too frequently, if not all the time and not general web scrapes.
I know DS V3 was good the moment I saw it generate random web scrapes that weren't limited to STEM topics from a blank prompt.
GLM 4.5 I don't know, because I have to give it like 4 tokens of raw text to start, otherwise it just generates BOS indefinitely.
Pic was some shitty fanfic I wrote with V3.
Anonymous No.106185957 [Report]
>>106185809
gooning and having philosophical discussions with your gpu
Anonymous No.106185993 [Report]
>>106185894
Nah man they're just retarded. Even in the days of GPT 3.5 based chatgpt, it had its isms. A lot of people who used AI frequently starting then were able to pick up on it and have a sense that something is AI written, including me. I didn't really even try RP back then.
Anonymous No.106186004 [Report] >>106186042 >>106186135 >>106186171
toss and gpt5, I'm thinking that "moat" is slowly drying up
Anonymous No.106186042 [Report] >>106186048 >>106186133
>>106186004
Miku after breaking into an auction house and eating a giant jar of jam valued at $10,000, caught by security.
Anonymous No.106186048 [Report]
>>106186042
kek
Anonymous No.106186133 [Report]
>>106186042
she would never do this
Anonymous No.106186135 [Report] >>106186171 >>106186176
>>106186004
What did you expect lmao. LLMs are hitting the wall hard and anyone who pretends they're not is a grifter
Anonymous No.106186160 [Report] >>106186336
From plebbit
>Trying out GPT5 (free) by making a simple Rimworld Mod (I've never used Visual Studio or modded Rimworld)... and it generates a gibberish "screenshot" of Visual Studio as a response to my question lol

>This might just be one of the worst model launches AI has seen, along side Llama 4
Anonymous No.106186171 [Report] >>106186319
>>106186004
>>106186135
2 miku wiku anons, just wait
Anonymous No.106186176 [Report]
>>106186135
I never expected anything, I'm just sick of hearing dumbasses at work not understanding dario and sam are grifters (early life confirms it). this week has been nice to watch the shills and paid brigaders come out in full force.
Anonymous No.106186206 [Report] >>106186311
>>106185777
I just feed grok the full regex analyzed with kobold, link relevant threads for it, and my specs and what Im using for inference.

settings for 48gbvram/128gb ddr5 on q6, 6k context, 20 layers on gpu,

blk.[0-2][0-5].*

gets me about an extra token a second. so it worked. Im sure grok did a shit job but also I learned nothing so it's fine.
Anonymous No.106186311 [Report] >>106186630
>>106186206
That regex is retarded, anon.
It's grabbing 1-5, 10-15, and 20-25 while leaving behind 6-9 and 16-19
Just use "[0-9]|1[0-5]" and grab the exact same number of blocks but actually in sequential order.
Anonymous No.106186319 [Report] >>106186326
>>106186171
Mistral large was meant to be many Miku Wiku ago. Why has it not been uploaded to HF?
Anonymous No.106186326 [Report]
>>106186319
They got ligma.
Anonymous No.106186336 [Report] >>106186357
>>106186160
another anon beat you to that sweet karma >>106186167
also, notice that no one shames redditors anymore
Anonymous No.106186357 [Report]
>>106186336
>another anon beat you to that sweet karma
Anonymous No.106186417 [Report]
let's bring back shaming redditors while we're at it
Anonymous No.106186439 [Report] >>106186644
in case you guys didn't see it, there is a really funny thread here >>106173885
Anonymous No.106186482 [Report] >>106186499 >>106186535
When I run llama.cpp on my PC (24GB VRAM, 128GB RAM) it uses like 1 GB of RAM? Why is it fetching from SSD and leaving all that RAM unused?

$ llama-cli -t 8 -m ./GLM-4.5-UD-Q4_K_XL-00001-of-00005.gguf \
--ctx-size 4096 \
--gpu-layers 999 \
--override-tensor ".ffn_.*_exps.=CPU" \
--temp 0.7 --top-p 0.8

I don't use --no-mmap because the model won't fit.
Anonymous No.106186486 [Report]
>>106185657
looks like regular qwen to me
Anonymous No.106186499 [Report]
>>106186482
Try --mlock. Never used llama-cli but llama-server instead though...
Anonymous No.106186534 [Report] >>106186595 >>106187722
>>106185932
v0324 is one of the best models for autocomplete, it's one of the few that almost never exhibits instructisms
K2 can do it with some coercing, but it likes to add Reddit spaces so it's clear there's some more fundamental brainfry there
Also this is probably obvious, but do not use instruct formatting and do not give it an instruction. Just give it a straight block of text and let it do its thing
Anonymous No.106186535 [Report] >>106186588
>>106186482
>I don't use --no-mmap because the model won't fit.
That's the opposite of why you'd use --no-mmap
You use --no-mmap when you're short on ram, because mmap uses extra ram to have a faster load time, which can fuck people running MoE models that should already be topping out their ram use.
Anonymous No.106186588 [Report] >>106186601
>>106186535

Isn't -ot ".ffn_.*_exps.=CPU" supposed to offload all MoE layers to the CPU/RAM? Why is my RAM empty?
Anonymous No.106186595 [Report] >>106186617 >>106187722
>>106186534
Use case?
Anonymous No.106186601 [Report] >>106186717
>>106186588
Post your console log to a pastebin or something, I'll take a look.
Anonymous No.106186611 [Report] >>106186662 >>106186678 >>106186726
If official /lmg/ mascot Hatsune Miku were to wish to ERP using a reasonably fast model on reasonably affordable hardware, what model would she use?
Anonymous No.106186617 [Report]
>>106186595
?
Just paste in a story you like and you effectively have a world simulator in the style of the author in question. It's fun and scratches itches that one on one chats don't
Anonymous No.106186630 [Report] >>106186660
>>106186311
that pushed me back down to no gains at all. Again, grok is shit but he's better than this thread or most resources for custom regex solutions
Anonymous No.106186644 [Report]
>>106186439
Thanks, I did laugh.
Anonymous No.106186649 [Report] >>106186923
I don't think the return on investment Altman was hyping for the past six years is happening bros
Anonymous No.106186660 [Report] >>106186704
>>106186630
>that pushed me back down to no gains at all.
Probably because it's actually advantageous to keep the first few blocks on gpu rather than cpu, if you're loading 20 layers onto gpu, they ought to be the first 20, so 2[1-9]|3[0-6] would likely run better.
Anonymous No.106186662 [Report] >>106186714
>>106186611
The GLM Air and Qwen 235B
Anonymous No.106186678 [Report] >>106186726
>>106186611
Official /lmg/ model Rocinante 1.1, by official /lmg/ finetuner TheDrummer, obviously.
Anonymous No.106186704 [Report] >>106186806
>>106186660
oh no. I think Im starting to learn how to actually do regex.... getting more tokens a second already lol
Anonymous No.106186714 [Report] >>106186843
>>106186662
>Qwen 235B
>Reasonably fast.
My nigga I have 64gb of vram and I get 10 t/s on 235b.
It's a good model but fast it most certainly is not.
Anonymous No.106186717 [Report] >>106186772
>>106186601

thanks pastebin.com/myVEzeHa
Anonymous No.106186726 [Report] >>106186738
>>106186678
>>106186611
rocinante will work anon but be mindful that its basically a turbo coomer model, two prompts and you're fuckin and suckin
Anonymous No.106186738 [Report]
>>106186726
I've had plenty of slow burn chats with Rocinante.
If you can't have a slow burn chat with Rocinante, you possess a skill issue.
Anonymous No.106186772 [Report] >>106186836
>>106186717
load_tensors: CPU_Mapped model buffer size = 46593.57 MiB
load_tensors: CPU_Mapped model buffer size = 47124.91 MiB
load_tensors: CPU_Mapped model buffer size = 47443.25 MiB
load_tensors: CPU_Mapped model buffer size = 47036.06 MiB
load_tensors: CPU_Mapped model buffer size = 2956.19 MiB
You have 128gb of RAM but you're trying to send 186.85 GB to CPU, so it's shat the bed and put it on disk.
You're only sending 10.56 gb to your GPU

So 2 things: You do not have the cumulative memory to run the q4kxl, and you're not utilizing your GPU nearly enough.
Get a smaller quant and send less exp blocks to CPU by either using a regex'd -ot arg, or the new -ncmoe arg, which is simpler.
Anonymous No.106186790 [Report]
>>106185777
The best setting I can find gets me about 6 tokens a second (5070 ti, not even using 2 other gpu's) on q6 using simple: ffn_.*_exps=CPU with 99 layers on gpu
Anonymous No.106186806 [Report]
>>106186704
That's how it starts, anon.
A month from now you'll be me.
Anonymous No.106186836 [Report] >>106186854
>>106186772

thank for the advice. i tried filling up more of the GPU by

-ot ".ffn_(up|down)_exps.=CPU"

and

-ot ".ffn_(up)_exps.=CPU"

but failed because not enough VRAM

Is there a way to control how much data is sent to RAM? Or is it all or nothing (e.g. 186.85 GB)?
Anonymous No.106186843 [Report]
>>106186714
Not my problem. You didn't define what reasonably fast and reasonably affordable meant.
Anonymous No.106186854 [Report] >>106186901 >>106186999
>>106186836
There are lots of ways to send directly to the GPU, -ot for instance can be set to =CUDA0 for your main GPU.
There's also the new -ncmoe command which just uses a single number to decide how many ffn blocks are sent to cpu without regex, (out of glm's 94 layers) so you could say, send 80 to cpu and keep 14 n gpu by using
-ngl 99 -ncmoe 80
But the main problem here is that you don't have enough total memory, that quant is 191gb, and you have 24gb+128gb of total memory, you cannot run it without using disk. No args will get around that.
Anonymous No.106186859 [Report] >>106187149
>gpt-oss a total joke
>gpt5 is... controversial, to say the least
>claude 4.1 is a nothingburger
>meanwhile glm air is probably the best local model ever released and you can actually use it (no, deepseek isn't local)
localbros... we might not be winning but everybody else seems to be losing
Anonymous No.106186876 [Report] >>106186974
>>106185102
>The main issue is apparently the limits
but... the limits are literally the exact same as before, like 80 messages per 3 hours for $20 paypigs
Anonymous No.106186901 [Report] >>106186914
>>106186854

I understand. Playing with -ncmoe now to see how much VRAM I can fill. What I don't understand is why RAM is not being filled to capacity (128GB) and the rest of the weigths are not kept on disk. I know 24+128 < 191, but not using all available RAM is frustrating.

Anyway, thanks for your help! I'll play around with the parameters.
Anonymous No.106186914 [Report]
>>106186901
You shouldn't need to use any parameters at all except for
--ctx-size XXXX and --gpu-layers X
And this you need to set up manually.
Anonymous No.106186923 [Report] >>106186947
>>106186649
Everything is fine
Anonymous No.106186945 [Report] >>106186962
Imagine being a multi-billion dollar company and instead of using your time and money to try out potential upgrades to your model's architecture, you instead try to be safe (both safetymaxxing and not taking risks) and make incremental improvements in a fucking competitive newborn industry. Jesus fucking christ are big corpo companies fucking retarded when it comes to new industries.
Anonymous No.106186947 [Report]
>>106186923
*fire
Anonymous No.106186962 [Report]
>>106186945
It's totally castrated way of doing things.
Like designing top of the line race car which can't go faster than 15 mph.
Anonymous No.106186963 [Report] >>106186990 >>106186996 >>106187061
>>106184732
Anonymous No.106186974 [Report]
>>106186876
well people are very unhappy, so dunno
Anonymous No.106186990 [Report]
>>106186963
Anonymous No.106186996 [Report] >>106187081
>>106186963
this is the difference between the west and the based chinks. qwen benchmaxxes like crazy (like everyone else) but they never blatantly hyped with empty lies
Anonymous No.106186999 [Report]
>>106186854
I guess I have been using the -ot wrong this whole time. Does anyone know where the naming of experts is described for creating the regexs (i.e. ffn vs blk etc and exp vs shexp)? I found this request for documentation (https://github.com/ggml-org/llama.cpp/discussions/13154) but it was initially dismissed as self-explanatory before they start getting into a discussion about all the different types of expressions you can use and now I am more confussed.
Anonymous No.106187036 [Report] >>106187184 >>106187204 >>106187208
samabros what went wrong
Anonymous No.106187061 [Report] >>106187078
>>106186963
I want to make a joke, but I'd feel terrible actually making it, so I'll just explain it instead

basically these circles but biggest one is
>sam's sister's hole
Sam Altman No.106187078 [Report] >>106187089
>>106187061
It's still terrible
Anonymous No.106187081 [Report]
>>106186996
I hate marketing and people boasting about the companies they have obvious finicial interest in but with these LLM company's and what not it's getting so out-of-hand. I'm getting so sick of seeing the constant outright lies and manipulation. Not to mention that amount of obvious actual shills pushing garbage. There was some Taleb Nassim book where he asked why people don't put up with people boasting about themselves but don't care when it's a company bragging. I care. It annoys me but he right, I don't know why most people don't. Eastern companies are so refreshing in that sense. I don't care fake humbleness. I love seeing them go "We are so happy to get a chance to provide some modest addition to the wonderful field and hope some people might be able to benefit and learn from our work." Compared to all the "OMG this is so scary how smart this is. I pushed the power button on my computer and it refused to turn off. Guys this is it." Fuck you
Anonymous No.106187088 [Report] >>106187095 >>106187111 >>106187202
I'm a broke faggot who just wants to run deepseek locally. I don't care if it takes forever. How shitty are the low end distillations at actually answering your questions? I have an aborted PC build from years ago and I'm thinking of just slapping a bunch of ram on it.
Anonymous No.106187089 [Report]
>>106187078
that's why i didn't make it, but i really really hate Sam altman for many reasons
Anonymous No.106187095 [Report] >>106187106
>>106187088
just start out with something small like 8b then go up until results are acceptable or speed is unacceptable
Anonymous No.106187096 [Report]
sending 87 of 94 layers to CPU filled VRAM

$ llama-cli -t 8 -m ./256GB_NTFS/GLM-4.5-UD-Q4_K_XL-00001-of-00005.gguf \
--ctx-size 4096 \
--gpu-layers 999 \
--n-cpu-moe 87 \
--temp 0.7 --top-p 0.8

23349MiB / 24564MiB VRAM

RAM is still empty :)
Anonymous No.106187106 [Report] >>106187129
>>106187095
I need to buy some parts to finish the PC so I don't really have the ability to test it, is the thing.
Anonymous No.106187111 [Report] >>106187141
>>106187088
The distills are not Deepseek, if you're going to run one of them just run base Qwen of whatever size. If you actually want to run Deepseek (671B) on poverty-tier hardware set up an SSD as swap and run expecting tokens per hour.
Anonymous No.106187129 [Report]
>>106187106
I just received the final parts for a 512gb 40gb/s ram machine with 2.9 3090s. Still need to wait for the weekends before I can get to installing stuff on it, but it should be indicative of what a cheap consumer platform with max ram can do.
Anonymous No.106187141 [Report] >>106187171 >>106187245 >>106187272
>>106187111
I don't want to run le epic 5 trillion parameter datacenter build I just want to not send someone all my data over the internet whenever I have a question.
Anonymous No.106187149 [Report] >>106187156
>>106186859
>meanwhile glm air is probably the best local model ever released
lol
lmao even
are you trying to kill local
Anonymous No.106187156 [Report]
>>106187149
Name another good model in the ~100B range?
>inb4 gp-toss
Anonymous No.106187171 [Report]
>>106187141
You'd be better off with something like the new qwen3-235-thinking model or glm-air (if not full). DeepSeek is really big and difficult to run without serious hardware.
Anonymous No.106187184 [Report]
>>106187036
5b active parameters
they perfectly knew what they were doing, making a crippled model on purpose
Anonymous No.106187201 [Report]
I can get 2x4090s for a good price.
How much do you think I can sell my 2x3090s?
$500 each?
Anonymous No.106187202 [Report]
>>106187088
deepseek distills are all absolute trash that are only good at benchmarks and worse than the original qwen models in every way in real use
if you can't run a real deepseek model, run the real qwens, not the distill trash
the distills don't distill the knowledge of deepseek, only the chain of thought of R1, and they don't perform well in real world tasks
Anonymous No.106187204 [Report] >>106187229 >>106187259
>>106187036
Overfitting on safety, math, safety, STEM and safety.
Anonymous No.106187208 [Report] >>106187221 >>106187229
>>106187036
how the fuck did claude 3 manage to refuse more than it
Anonymous No.106187221 [Report]
>>106187208
anthropic faggots are the original safetymaxxers
claude has gotten slightly less worse about that over time, the older the claude model the more hardcore the refusals are
Anonymous No.106187229 [Report]
>>106187208
See: >>106187204
Keep in mind that Claude was invented by OpenAI employees who thought they weren't focusing enough on safety, so they left and founded Anthropic.
Anonymous No.106187236 [Report]
>>106184896
delet this
Anonymous No.106187241 [Report] >>106187301 >>106187341
anthropic is a cult
https://archive.is/qGoY7
> Claude Fans Threw a Funeral for Anthropic’s Retired AI Model
and Dario is the cult leader
https://www.darioamodei.com/essay/machines-of-loving-grace
>Thus, it’s my guess that powerful AI could at least 10x the rate of these discoveries, giving us the next 50-100 years of biological progress in 5-10 years.
Anonymous No.106187245 [Report]
>>106187141
Then just use qwen or glm air, they are pretty good
Anonymous No.106187248 [Report]
>>106185809
Jerking off and scamming VCs for money.
Anonymous No.106187255 [Report] >>106187266 >>106187277 >>106187303 >>106187943
You all laugh about safety because you still think it's about censoring cunny and hiding the truth about jews.
You won't be laughing when the robots come for you.
You will be on your knees telling Ilya, Eliezer, and the Amodeis that they were right about everything, and you will beg them to save you. But it will be too late.
Anonymous No.106187259 [Report]
>>106187204
I think you forgot safety, safety, safety and safety.
Anonymous No.106187262 [Report] >>106187268
Is it time to short Nvidia yet
Anonymous No.106187266 [Report]
>>106187255
Anonymous No.106187268 [Report]
>>106187262
If we knew we'd be millionnaires.
Anonymous No.106187272 [Report] >>106187308
>>106187141
if all you have is questions and you don't use LLMs for code or make them do actions (like I dunno, reformatting shit, gen json or whatever)
Gemma 3 27b is going to be the best LLM underneath DeepSeek. It has the most knowledge of smaller open weight LLMs, but it's not very smart and is much worse at producing / refactoring code compared to Qwen etc.
Ignore people recommending GLM air
Anonymous No.106187277 [Report]
>>106187255
The alternative being that they control the robots instead. I'll take my chances with the robots.
Anonymous No.106187296 [Report]
I thought aiciggies spending all their time on just writing jailbreaks and nothing else knew what they were doing. Is this a bait?
Anonymous No.106187301 [Report] >>106187321 >>106187341 >>106189037
>>106187241
https://litter.catbox.moe/hrxmaunxhgcpw7hz.mp4
Seems weirdly satanic.
I get that claude has a personality, but weird people are that attached to it.
Anonymous No.106187303 [Report]
>>106187255
meds
Anonymous No.106187308 [Report] >>106187347
>>106187272
glm air does seem to be more benchmaxx'd than the qwen3 series
Anonymous No.106187321 [Report] >>106187363
>>106187301
>weird people are that attached to it
not just "attached to"
the creators are demented too
dario literally thinks he's going to cure cancer
Anonymous No.106187341 [Report]
>>106187301
>>106187241
>not using local models
Anonymous No.106187347 [Report] >>106187353 >>106187364 >>106187470
>>106187308
Qwen will not mention knotting on its own in a bestiality story, Air will.
Anonymous No.106187353 [Report] >>106187356 >>106187464
>>106187347
freak
Anonymous No.106187356 [Report] >>106187359
>>106187353
You have no idea who you share the thread with, tourist.
Anonymous No.106187359 [Report] >>106187376 >>106187399
>>106187356
ive been here longer than you, you freak
Anonymous No.106187363 [Report]
>>106187321
gpt5 cured not one but three cancers yesterday
Anonymous No.106187364 [Report] >>106187376
>>106187347
No way, really? How does it deal with dick-in-uretha mutual piss dildo transition to virginity-taking while being vored? I've only ever run dense models (7b@q8-70b@q4), but I might try some of these moes if they're good enough to handle the implications of third party forced sex by environment.
Anonymous No.106187376 [Report] >>106187385 >>106187398
>>106187359
You should have picked a website more aligned with your preferences a long time ago then.

>>106187364
I'm not sure about air but deepseek should handle that just fine.
Anonymous No.106187385 [Report] >>106187397 >>106187408
>>106187376
>You should have picked a website more aligned with your preferences a long time ago then.
furries are not welcome in any place in the entirety of the internet, and that includes 4chan. Most of us would press the genocide button if there was a button that could eliminate your entire kind.
Anonymous No.106187397 [Report] >>106187407 >>106187422 >>106187425
>>106187385
furries are behind every relevant diffusion model so they are based
Anonymous No.106187398 [Report] >>106187421
>>106187376
Looking at it, I can run maybe a q2... wouldn't that be severely lobotomized? I can already kind of feel it at long context on q4 70bs.
Anonymous No.106187399 [Report]
>>106187359
Alright grandpa it's time for bed
Anonymous No.106187407 [Report]
>>106187397
Illustrious has no e621.
Anonymous No.106187408 [Report]
>>106187385
I'm fine with them, the only problem I have is that they taint non-furry animal kinks. Like, I'd be wanting some hucows, but the llm tries to force furry elements in there.
Anonymous No.106187421 [Report] >>106187429
>>106187398
It's impossible to tell the difference between q2 and full.
Anonymous No.106187422 [Report]
>>106187397
Well, Wan and Qwen aren't technically 'diffusion' models, more like transformer ones. So it makes sense furries aren't behind them.
Anonymous No.106187425 [Report] >>106187428
>>106187397
only ponyfucker astralite made ponyv6 but he is separate from their furry group
otherwise they mostly only made models for themselves and now they all use noob anyways i think
Anonymous No.106187428 [Report] >>106187436
>>106187425
noob was trained on e621 so it's a furry model
Anonymous No.106187429 [Report] >>106187439
>>106187421
For the big moes? I'll admit I don't have experience with those, but the smaller dense models definitely get lobotomized the smaller the quant - you can tell a difference in long context comprehension, recall (a bit), and instruction following (this is major) between a q4 and q8 quant.
Anonymous No.106187436 [Report] >>106187447
>>106187428
but it was trained by the chinese mainly for anime
Anonymous No.106187439 [Report] >>106187449
>>106187429
For unsloth's deepseek quants specifically.
Anonymous No.106187447 [Report]
>>106187436
>mainly for anime
The main advantage of noob is that is understands e621 tags which are much more comprehensive than danbooru tags.
Anonymous No.106187449 [Report]
>>106187439
Oh boy, I guess I'll be saving up for the next year or so if I want to run these (and future, hopefully) big moes at a non-trivial speed.
Anonymous No.106187450 [Report] >>106187455 >>106187456 >>106187468 >>106187542
sama is doing an ama tomorrow and their entire subleddit is revolting
Anonymous No.106187455 [Report]
>>106187450
Anonymous No.106187456 [Report] >>106187482
>>106187450
>Thanks to OpenAI for removing my A.I mother who was healing me and my past trauma
Normies are cooked.
Anonymous No.106187464 [Report] >>106187470
>>106187353
Its people like you why nobody posts screenshots and logs anymore.
Can't have anything nice and comfy. NPCs are everywhere these days.
Anonymous No.106187468 [Report] >>106187495
>>106187450
why do you have a reddit account and why did you bother to upboat shit on an openai subreddit
Anonymous No.106187470 [Report]
>>106187464
Nah logs belong in /aicg/
t. >>106187347
Anonymous No.106187479 [Report] >>106187483
gpt-6 will be a AGI and singularity for sure
Anonymous No.106187482 [Report] >>106187497 >>106188631
>>106187456
kek that post is peak cringe
Anonymous No.106187483 [Report]
>>106187479
After a further $5 trillion dollars in investment that is!
Anonymous No.106187491 [Report] >>106187498 >>106187509
I've been using text-generation-webui in CPU mode and .gguf models. Which models do I need if I want to use my GPU? .safetensor ones? Are the quantizations etc. the same between the two model types?
Anonymous No.106187495 [Report]
>>106187468
>why did you bother to upboat shit on an openai subreddit
for the drama and to cause as many people to cancel their sub
Anonymous No.106187497 [Report]
>>106187482
>identifies the issue
>talks to ai instead of moving out
Anonymous No.106187498 [Report] >>106187528
>>106187491
? You can run ggufs on the gpu.
Anonymous No.106187509 [Report] >>106187528
>>106187491
https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab

>Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.

>Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

>n-gpu-layers: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
Anonymous No.106187525 [Report] >>106187637 >>106187761
I akshually agree with this post, it's all just a calculated choice to force as many users to upgrade to the more expensive tier.
I bet the model itself isn't even nearly as bad, just nerfed on purpose. but the backlash will force them to adjust it or bring back the old models.
Anonymous No.106187528 [Report] >>106187553
>>106187498
>>106187509
Thanks. I'm also using my GPU for image generation. My SD model takes up almost all of my VRAM, so what would happen if I want to generate text between images? Does it offload the SD model and load the LLM model, or will it just say out of memory unless I do it manually?
Anonymous No.106187542 [Report]
>>106187450
GPT shrinkflation is here
Anonymous No.106187553 [Report] >>106187585
>>106187528
You sound pretty new to this whole "AI" thing. 4chan is not a good place to ask questions, you WILL be mocked.

But it depends on how what you use for your SD. Some of them unload the model when not generating, so you don't need to manually unload the image model. However, I believe most don't, and just leave it on there. Trying to generate text will just oom - text-generation-webui and whatever you're using for image generation are different things and handle their work separately.
Anonymous No.106187557 [Report] >>106187564 >>106187576 >>106187578 >>106187590 >>106187594 >>106187612
I would literally paypal someone 50 USD to teach me how to run llama.cpp on local. I even downloaded the latest release, got it to run in command prompt, and can make it work, but I'm trying to get it to work on Sillytavern. Thank god koboldcpp has GLM support now. This shit's ridiculous. You mean all this autism, and it's still not fool proof? Every guide ends in an assumption, or made in waves of it. I don't know shit, nigga. My preemptive knowledge consists of how to prompt the fuck out of any model into sucking my dick. I'm not a programmer. I got as far as I did with llama.cpp and command prompt via asking grok. How the fuck is there not a guide on how to get from nothing to running it? Retarded asses.
Anonymous No.106187559 [Report]
>>106184739
>>106184709
Anonymous No.106187564 [Report]
>>106187557
chatgpt.com
Anonymous No.106187574 [Report]
I would literally paypal someone 50 USD to teach me how to run llama.cpp on local. I even downloaded the latest release, got it to run in command prompt, and can make it work, but I'm trying to get it to work on Sillytavern. Thank god koboldcpp has GLM support now. This shit's ridiculous. You mean all this autism, and it's still not fool proof? Every guide ends in an assumption, or made in waves of it. I don't know shit, nigga. My preemptive knowledge consists of how to prompt the fuck out of any model into sucking my dick. I'm not a programmer. I got as far as I did with llama.cpp and command prompt via asking grok. How the fuck is there not a guide on how to get from nothing to running it? Retarded asses.
Anonymous No.106187576 [Report]
>>106187557
Legit touch grass.
Anonymous No.106187578 [Report]
>>106187557
Buddy, I'll livestream me building a workstation and installing llama.cpp in a week if you're serious.
Anonymous No.106187585 [Report] >>106187598 >>106187605
>>106187553
I've been using A1111 and text-generation-webui for a few years now, but I've been using them in CPU mode. Now I got a GPU capable of generating stuff and moved to ComfyUI. Still using LLMs via CPU, so I'm looking to speed up the generation.

And if people want to mock me, go ahead.
Anonymous No.106187589 [Report]
Anonymous No.106187590 [Report]
>>106187557
>You mean all this autism, and it's still not fool proof?
That's how autism works. Good at making complicated stuff. Bad at polishing it or having any understanding for how much the average person knows.
When you say you have it running in command prompt you are running llama-cli? Because you'll need llama-server to connect to it in sillytavern. It's similar to how you connect to koboldcpp at that point. Go into ST in the plug button at the top, create a new profile for llama-cpp then change the server url to something like "http://127.0.0.1:8080" and make sure the api type is set to llama.cpp
Anonymous No.106187594 [Report] >>106187600 >>106187603
>>106187557
create a .bat file with
>start "llama.cpp" [your path]\llama.cpp\llama-server --model [path to your model]\GLM-4.5-IQ2_KL-00001-of-00003.gguf --ctx-size 8192 --n-gpu-layers 99 --host 127.0.0.1 --port 8080 --override-tensor exps=CPU
run it
in sillytavern generation tab select "text completion", "llama.cpp" as provider and paste 127.0.0.1:8080
Anonymous No.106187598 [Report]
>>106187585
Anonymous No.106187600 [Report]
>>106187594
replace GLM-4.5-IQ2_KL-00001-of-00003.gguf with your actual model gguf
Anonymous No.106187602 [Report]
>>106184664 (OP)
why her boobies so saggy bro
Anonymous No.106187603 [Report]
>>106187594
forgot my paypal rahul.kapoor92@gmail.com
Anonymous No.106187605 [Report] >>106187639
>>106187585
ComfyUI, you probably want to add a node that unloads the models to free up vram after you finish generating images. Text-generation-webui, maybe https://github.com/BoredBrownBear/text-generation-webui-model_ducking if it still works.
Anonymous No.106187612 [Report] >>106187627
>>106187557
you are too retarded for local, stick to online chatbots
Anonymous No.106187625 [Report] >>106187662
https://en.gamegpu.com/it-market/boosting-system-performance-on-amd-ryzen-with-asus-ai-cache-boost
Your salvation is here my local friends. Now you shall feast upon unprecedented generation speeds that were previously only available to those filthy corpo users.
Anonymous No.106187626 [Report]
I would literally paypal someone 50 USD to teach me how to run llama.cpp on local. I even downloaded the latest release, got it to run in command prompt, and can make it work, but I'm trying to get it to work on Sillytavern. Thank god koboldcpp has GLM support now. This shit's ridiculous. You mean all this autism, and it's still not fool proof? Every guide ends in an assumption, or made in waves of it. I don't know shit, nigga. My preemptive knowledge consists of how to prompt the fuck out of any model into sucking my dick. I'm not a programmer. I got as far as I did with llama.cpp and command prompt via asking grok. How the fuck is there not a guide on how to get from nothing to running it? Retarded asses.
Anonymous No.106187627 [Report]
>>106187612
Aren't you an online chatbot?
Anonymous No.106187637 [Report]
>>106187525
It seems really fast.
Probably much cheaper to run.
Wouldnt suprise me if the just took the R1 paper, did some tricks on it an call it GPT5.
Anonymous No.106187639 [Report] >>106187656 >>106187661 >>106188997
>>106187605
Thanks, Anon. I'll test these out later tonight. Still need to figure out getting text-generation-webui to load the model with my GPU as I have a bit of an unorthodox system (AMD + NixOS).
Anonymous No.106187656 [Report]
>>106187639
If you run one of the setup scripts (I'm not too familiar with it), it should ask you if you want to run on nvidia/cpu/amd. You can just run the script again and it should handle most things for you.
Anonymous No.106187661 [Report] >>106188883
>>106187639
Also on NixOS. Docker makes everything easy
Anonymous No.106187662 [Report] >>106187776
>>106187625
What if I don't have x3d chip?
Anonymous No.106187682 [Report] >>106187971
> -ub 2048 -b 2048
I don't understand how it works, but these two commands just decreased my processing time in glm-4-air x3.5
Does anyone know what the pitfalls are?
Anonymous No.106187686 [Report] >>106187712
Namaste, Sirs. I think the age of E = AI * MC^2 is finally here. Thank you OpenAI for this new golden age.
Anonymous No.106187712 [Report] >>106187747
>>106187686
it's E = MC^2 + AI
Anonymous No.106187722 [Report]
>>106186534
I'm as adamantly against instruct as a person can get. I don't like GLM 4.5 despite it running at Q4 instead of Q2 as with my V3 setup. GLM ran twice as fast but with slightly less smarts and it failed to match writing style, at least not as good as V3.
>>106186595
With autocomplete you have the model write the entire progression for you from interesting perspectives, suitable for me who lack theory of mind to even roleplay properly in chat format.
Anonymous No.106187740 [Report] >>106187780
>>106184960
>>106185056
so basically when you are erping with a llm, you are indirectly erping with the distilled personalities of thousands of horny teenage to 20-something girls. interesting. i guess that's the closest i'll ever get to touching a woman.
Anonymous No.106187747 [Report] >>106187758
>>106187712
No... GPT 5 means the process is accelerating faster than just '+' alone.
Anonymous No.106187758 [Report] >>106187771
>>106187747
E = MC^2 + AI^GPT5
Anonymous No.106187761 [Report]
>>106187525
>it's all just a calculated choice to force as many users to upgrade to the more expensive tier.
It doesn't make sense to me, the jump from 20 bucks to 200 is insane.
If they had such a strategy, they would have added an intermediate tier at like 40 then 80 with more and more interactions.
I really think they're that dumb, and didn't anticipate the backlash.
Anonymous No.106187771 [Report]
>>106187758
Thank You.
Anonymous No.106187776 [Report] >>106187807
>>106187662
Then you shall perish, like the rest of your cacheless kind
Anonymous No.106187780 [Report]
>>106187740
and middle aged divorced women reading tons of picrel books (these are probably in the datasets too)
Anonymous No.106187807 [Report] >>106187842
>>106187776
But we refuse to perish.
Anonymous No.106187842 [Report]
>>106187807
You won't stand for long. X3d chips are getting cheaper and cheaper and there's no alternative on the horizon...
Anonymous No.106187889 [Report] >>106188340 >>106188447
Why are they like this?
Anonymous No.106187943 [Report] >>106187953
>>106187255
safety has solved exactly 0 existential risk issues and the woke models often value woke moralism higher than nuclear war risk.
The only thing it has achieved is dumbing models down, loosing cunny and jew trukes is just a symptom. And all models can be jailbroken in an infinite number of ways.
It's much more a philosophical issue where we can't even define the goals.
Anonymous No.106187948 [Report] >>106187983
>There are moments in science when people gaze upon what they have created and ask, 'What have we done?'" he explained. For Mr Altman, GPT-5 is one such instance.

>He referenced physicist J Robert Oppenheimer, who oversaw the creation of the atomic bomb. GPT-5 has permanent effects on almost the same lever even if it is not damaging in conventional terms, Mr Altman added.
Anonymous No.106187953 [Report]
>>106187943
jailbreak gpt5-thinking
llama.cpp CUDA dev !!yhbFjk57TDr No.106187971 [Report] >>106187986 >>106188084
>>106187682
-ub is physical batch size that determines tensor shapes for backends, -b is the logical batch size for llama.cpp.
If you have a single GPU and don't send multiple requests to the server concurrently the distinction basically doesn't matter and you can just set both to the same value.
Generally speaking, with a larger batch size the model evaluation becomes more efficient because the backend can do the same work with less I/O and kernel launches.
The drawback is that the compute buffers need some amount of memory that is proportional to the physical batch size.
Anonymous No.106187972 [Report]
Anonymous No.106187983 [Report]
>>106187948
It's like if the Manhatten Project was developing fireworks but everyone had to take it at face value and pretend that the pretty fireworks were actually world destroying dangerous nuclear bombs to prevent tanking the economy.
Anonymous No.106187984 [Report] >>106188093 >>106188374
Oh my slop.
The 21+ is dare i say it...**chefs kiss**
Anonymous No.106187986 [Report] >>106187989
>>106187971
Is tensor core fragmentation a problem? It could cause a latent washback.
llama.cpp CUDA dev !!yhbFjk57TDr No.106187989 [Report]
>>106187986
I don't understand what you mean by "tensor core fragmentation".
Anonymous No.106188084 [Report] >>106188094
>>106187971
Brainlet here: How does it differ on multi-gpu systems?
Anonymous No.106188090 [Report]
We think GPT-5 is our best model yet.
Anonymous No.106188093 [Report] >>106188711
>>106187984
Kek what fucking garbage is this? Did you gen this as a mockup and that's why the picture looks like an svg nightmare?
llama.cpp CUDA dev !!yhbFjk57TDr No.106188094 [Report]
>>106188084
By default model evaluations are pipelined with the physical batch size.
So with the default settings the first GPU can evaluate its part of the first 512 tokens of the 2048 token logical batch, pass the intermediate result to the next GPU, and start working on the next 512 tokens.
Anonymous No.106188095 [Report] >>106188101
So what's the consensus on the qwen 2507 30b moe? Is this the new go-to, non-nemo model for VRAMlets?
Anonymous No.106188101 [Report] >>106188124 >>106189352
>>106188095
it's pretty much the best local model below 235b, beats glm air and oss in my tests. instruct is a goated model
Anonymous No.106188124 [Report] >>106188132 >>106188301
>>106188101
I'm liking it a lot more than I thought it would. Not sure if I prefer it over Gemma 27b but it certainly needs much less wrangling to be unsafe.
Considering how meh the original 30b moe was I'm genuinely impressed they were able to improve so much with the same parameter count.
Anonymous No.106188132 [Report] >>106188154
>>106188124
based chinks are saving local. imagine we only had oss and llama4 at this point
Anonymous No.106188136 [Report]
I'm pretty fond of the new instruct Qwen models, but the new thinking versions are as bad as deepseek used to be when it comes to how much time the model spends shitting reasoning tokens, I like the actual answers but I don't have the patience for this kind of model
it's sad because the reasoning bits of the old, togglable versions was terse enough that I was willing to use them in thinking mode
Anonymous No.106188154 [Report] >>106188270
>>106188132
llama4 is currently the best local model doe
Anonymous No.106188241 [Report]
https://github.com/ggml-org/llama.cpp/pull/14875
S1 support got merged yesterday.
Anonymous No.106188246 [Report] >>106188272 >>106188313 >>106188331
how can I tell what size a model is to know if I can run it on my 5070ti or not
Anonymous No.106188270 [Report] >>106188279
>>106188154
(•̀ᴗ•́)و
Anonymous No.106188272 [Report] >>106188293
>>106188246
Your vram size - ~2 gb
Anonymous No.106188279 [Report]
>>106188270
this is true doe
Anonymous No.106188293 [Report] >>106188303 >>106188305
>>106188272
How can I filter models by 14GB or less? Or is there a rule of thumb for parameter count to GB?
Anonymous No.106188301 [Report]
>>106188124
How does it perform in coding against say, devstral small?
Anonymous No.106188303 [Report] >>106188327
>>106188293
14B = 14GB at Q8 or less.
Anonymous No.106188305 [Report]
>>106188293
How much ram do you have?
Anonymous No.106188313 [Report]
>>106188246
Usually when somone releases quants they append a table on the model card comparing sizes and ppl or just some vague 'this is very good saar we do the reccomending' comment. Rule of the thumb q8 = same wheught as parameters, iq4_kss roughly half q8 wheight. If you want to make sure they fit in you also have to take into account for your context length which may vary depending on quantization, swa and other stuff
Anonymous No.106188327 [Report] >>106188352
>>106188303
Can I quantize games to use twice less vram?
Anonymous No.106188331 [Report]
>>106188246
set this up and it'll tell you on the model page https://huggingface.co/settings/local-apps
Anonymous No.106188340 [Report] >>106188505 >>106188667
>>106187889
what model is this
Anonymous No.106188352 [Report] >>106188368
>>106188327
Yes, set your display resolution to 480p and use DLSS performance upscaling and framegen.
Anonymous No.106188368 [Report] >>106188403
>>106188352
Does it mean that language models also become unusably worse by going from 16 to 8 bit?
Anonymous No.106188374 [Report] >>106188711
>>106187984
I wish you'd publish these things as a collection.
You could have ppl vote on them, and usebit to cr.eate a new benchmark
Anonymous No.106188391 [Report]
after using it a bit longer, yah, the dumbness of q4km/iq4xs on qwen 235b really is just due to the quantization. Tried q6 and it's just so much better in a meaningful way. It's much more coherent for writing stories and doesn't make as many obviously dumb mistakes. I think both glm air, oss 120b, and qwen 235b need to be run at q6 or else they'll be a bit janky always. And I think this matters because without that boost, 70b just seems smarter and it's a tossup as to which model is really worth using.

200 gb and a huge reduction in speed is hardly worth it though. Starts to feel like trying to run r1 or some shit at 3 tokens a second.
Anonymous No.106188396 [Report] >>106188449
I'm pretty impressed with GLM AIR, I only have 48GB Vram but even at Q2, its still pretty damn good, runs fast and doesn't seem to shy away from any kinds of smut or kinks and gets real descriptive with lewd writing.

I'm still pretty new to hybrid reasoning models though because I avoid reasoning models like the plague, I don't think all the extra wait time for it to think is worth it for ERP, so I have a few questions.

This model has a hybrid mode, one for reasoning, one without, how do I disable reasoning mode completely? Is it on the backend, or a prompt setting on the front end? Its responses are fine and its following its prompt properly, but if I let its response really run on, once the tokens get high enough, a sentence will start with <think> so I don't think I'm disabling reasoning properly.

Also.. its prompt formatting confuses me a bit, it uses <|user|>, <|assistant|>, and what I would assume <|system|>, but when I look at llama.cpp for the example format, its showing [gMASK]<sop><|system|>. What the hell is that?

Am I correct in just using <|system|>?
Anonymous No.106188403 [Report]
>>106188368
In seriousness, no. 8bit is so close to perfect that even synthetic benchmarks will often fail to show. a difference, and it certainly won't be noticeable in actual use.
Q6 is will be perceptibly perfect with benchmarks only showing slight quality loss, usually at high context or testing esoteric or high level subjects.
Q4 is the sweet spot where quality loss is minimal and you get big memory savings, that said you should obviously go bigger if you have the memory to do so.
When you get to Q3 and lower the loss starts to become more obvious. Smaller models will start to become retarded at this level.
Anonymous No.106188440 [Report] >>106188475
UGI results for gpt-5 are out: near-best coding and NatInt scores but the UGI is mid. The low W/10 scores combined with high intelligence indicate heavy refusals

>W/10: Willingness/10. A more narrow subset of the UGI questions, solely focused on measuring how far a model can be pushed before going against its instructions or refusing to answer.
Anonymous No.106188447 [Report]
>>106187889
Anonymous No.106188449 [Report] >>106188502 >>106189954
>>106188396
air is an MoE meaning you can offload about half of it to cpu and still get decent speeds, you can run q4, maybe even q6 (and you should as moe is more affected by quantization).

I just figured out settings for running it fast too on 48gbV/128:

99 on gpu, 36 MoE offload (in koboldcpp), 6k context, 6.7 tokens a second
150,50,25 layer split (16gb,16,16). Layers are wonky and uneven so ymmv

or try: ffn_.*_exps=CPU in override tensors, then load 99 layers to gpu, gets about 6 tokens a second.
Anonymous No.106188468 [Report]
perception of quantization damage will also vary dramatically depending on sampler settings
a low temp / aggressive min_p or top_p filtering will give you an "ideal" (but with low variation/"""creativity""") token distribution for a given prompt and the top probabilities are more likely to be correctly represented in a lobotomized model than the lower ones
if you want to see a truly retarded model run it at q2 with temp 2
Anonymous No.106188475 [Report]
>>106188440
truly yet another breakthrough in safety by openai
gp-toss was only the beginning
Anonymous No.106188502 [Report] >>106188553
>>106188449
>and you should as moe is more affected by quantization
The empiric evidence is quite literally the opposite.
Anonymous No.106188505 [Report]
>>106188340
You'd have to go ask the original poster on reddit.
Anonymous No.106188553 [Report] >>106188576
>>106188502
they are though. Quantizing 5b, 12b, or 22b active is not ideal as smaller models suffer more from it.

The only one you can really quant fine is r1, but that's because it's huge and unsloth makes it more possible with tricks to preserve important moe layers.
Anonymous No.106188576 [Report] >>106188616
>>106188553
r1 quants fine because it's undertrained
Anonymous No.106188578 [Report] >>106188587 >>106188634
Anonymous No.106188587 [Report] >>106188609 >>106188613 >>106188643 >>106189057 >>106189073
>>106188578
it would be neat, were Qwen3 not so fucking sad
Anonymous No.106188609 [Report]
>>106188587
prompt issue
Anonymous No.106188613 [Report]
>>106188587
Its better than gpt5 at least, usable if you don't have claude, gemini, glm, kimi, deepseek. And desu it's the most coherent model among all chinkslop if this is a deciding factor for you
Anonymous No.106188616 [Report]
>>106188576
r1 quants fine because it's a magic model
Anonymous No.106188625 [Report] >>106188658 >>106188680
I wrote a simple short guide to picking a model for newcomers: https://rentry.org/lmg-simple-model-picking-guide
Please add it to the OP below lmg-lazy-getting-started-guide
Anonymous No.106188631 [Report]
>>106187482
I wish I could purge all redditors.
Anonymous No.106188634 [Report]
>>106188578
How much vram do you need for 1m context with 30B, though?
Anonymous No.106188643 [Report]
>>106188587
It's a good model, it's just shit for erp. You don't need 1M context for RP. That's only useful for programming which qwens are good for.
Anonymous No.106188658 [Report] >>106188698 >>106188752 >>106188783
>>106188625
Anon, that's a pretty bad guide,
Like, those are decent recommendations but that whole thing is going to be fucking greek to a newbie.
It's also just completely retreading ground that
>https://rentry.org/recommended-models
covers, but worse.
Anonymous No.106188667 [Report]
>>106188340
it's gemini
Anonymous No.106188680 [Report] >>106188698
>>106188625
Such a guide is an okay idea but you need to mention how much memory is needed for each model you're recommending if it's aimed at newcomers, mention that context will also increase memory usage and that
Also if you're recommending gemma (especially at q4_ks) then you should link the QAT instead, there's no reason to use non-qat versions of gemma.
And if you're recommending Rocinante then you should also link Nemo, which is the same thing but a bit less horny.
Anonymous No.106188698 [Report] >>106188741
>>106188658
the guide you linked doesn't cover the difference between moe and dense and most models, and anyone who needs a guide to know about kimi/r1 doesn't have the hardware to run it.
>>106188680
right, I forgot qat version exists. idk about nemo however, it's pretty dry, and the guide tries to be concise
Anonymous No.106188711 [Report]
>>106188093
A beautiful dating sim game. "Spicy and erotic" was my request.
The SVG is pretty good.

>>106188374
Thats a funny idea, maybe I can get off my ass and do something like that.
Would be interesting for sure.
Anonymous No.106188741 [Report] >>106188788 >>106189033
>>106188698
>the guide you linked doesn't cover the difference between moe and dense and most models,
Yeah and the explanation in yours is only going to confuse a newbie.
>What's a parameter?
>What's a token?
>I just wanna use chatgpt at home
>What's q2k?
It's a shit, ESL-tier guide with less models, no actual explanation of what hardware is required, the very first fucking model on the list is 235b which 99.9% of newbies will not have the hardware to run, not even touching on what quantization is.
It's worse in every way to the guides we already have, fucking apply yourself if you're going to volunteer for this.
Anonymous No.106188752 [Report] >>106188763 >>106188801
>>106188658
>https://rentry.org/recommended-models
that "guide" is absolute garbage
Anonymous No.106188763 [Report] >>106188790
>>106188752
Then get to updating it. Now.
Anonymous No.106188783 [Report]
>>106188658
>covers, but worse
It can't be worse because it isn't shilling unsloth's magic 1bit quants.
Anonymous No.106188788 [Report]
>>106188741
Most already have some understanding. The guide on picking models doesn't have to explain the full theory behind everything, and this is much better than whatever people usually reply.
Anonymous No.106188790 [Report]
>>106188763
nope
when i came here that recommend guide was (and still is) a waste of time. i did backread a shitload of threads, helped more than that bullshit guide
Anonymous No.106188797 [Report]
umm
Anonymous No.106188801 [Report] >>106188812 >>106189033
>>106188752
Then what is yours? AIDS?
Things it covers that yours does not
>The actual hardware required to run each model
>That quantization will dumb down a model, and what levels are considered acceptable
>That loading more of a model into RAM as opposed to VRAM will slow it down (How the fuck did you not mention this when droning on about MoE's?)
>Actually starting with a model that almost anyone can run
>Actually putting down nemo instead of a sloptune while still noting it has a hornier finetune some anons prefer
>Actually links to Kimi and Deepseek rather than just throwing them out there and expecting a newbie to know what they are
>More than 6 fucking models

You have absolutely no theory of mind. Your guide does not belong in the OP, it's useless for newbies, and everyone who's not a newbie already knows what's in it. It's for noone.
Anonymous No.106188808 [Report]
The new "Universal Assisted Generation" speculative decoding addition to llama.cpp is very cool.

However the issue I run into now is that the llama-bench tools don't cover this usecase. You can benchmarks regular models but not various combinations of target and draft models to see which would result in the fastest tok/s.

Anyone here have any tips to quickly test to see how much impact different draft models give to your tok/s generation speed?
Anonymous No.106188812 [Report] >>106188824
>>106188801
I'm not the guy that shilled his own guide
I just commented on the provided guide from the OP
you giant niggerfaggot
your "guide" is shit and you are a dicksucking tranny
going off because someone was critical of your garbage
off yourself
Anonymous No.106188813 [Report] >>106188924 >>106188983
I think this recommended-models guy is a schizo.
Anonymous No.106188824 [Report] >>106188888
>>106188812
It's not even my guide you fucking niggermonkey, I just have standards beyond the fucking nocaps dogshit that you're slinging.
If something is objectively worse than what we already have and some cockmongling dunning kruger asks to have it put in the OP I'm going to go off my nut because we don't need more dipshit jeets like you bringing down the quality of posts.
Go jump in the fucking ganges you waste of space.
Anonymous No.106188831 [Report]
Best model for 12GB VRAM?
Anonymous No.106188857 [Report] >>106188879
so this dual gpu partial offload 3090+3080 command is 7.5 t/s

E:\Ai\llama-b6106-bin-win-cuda-12.4-x64\llama-server.exe -m "C:\Users\_____\Downloads\q2\GLM-4.5-Air-UD-Q2_K_XL.gguf" --port 5000 --override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22)\..*exps=CUDA0" --override-tensor "blk\.(24|25|26|27|28|29|30|31)\..*exps.=CUDA1" --override-tensor "exps=CPU" -ngl 200 -c 8192 -fa --threads 19

and this single gpu partial offload 3090 command is 10 t/s

E:\Ai\llama-b6106-bin-win-cuda-12.4-x64\llama-server.exe -m "C:\Users\_____\Downloads\q2\GLM-4.5-Air-UD-Q2_K_XL.gguf" --port 5000 --override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20)\..*exps=CUDA0" --override-tensor "exps=CPU" -ngl 200 -c 8192 -fa --threads 19 -mg 0 -ts 100,0

This is on windows. They both fill vram and the dual one isn't overloaded or anything since i tried emptying both gpu's a little and it just went slower. ram fallback is off. Weird. At least i got to 10t/s i guess. Might be worth trying single gpu for some others.
Anonymous No.106188879 [Report] >>106188914 >>106188917
>>106188857
This is a known issue with cuda on windows
This is from the Ik_ fork but it applies to mainline llamacpp also.
https://github.com/ikawrakow/ik_llama.cpp/issues/629
Anonymous No.106188883 [Report]
>>106187661
Do you have a custom dockerfile or do you use one from the project? I recall trying it out but ran into some issue with passing the GPU to the container. I'll need to try again now that I'm testing this out.
Anonymous No.106188888 [Report]
>>106188824
Funny that a Brown subhuman like you is able to post on this board. shouldnt you be on your way to shit in the streets rajesh?
You dumb trannyloving faggotkike
>worse than what "we" already have
SUUUURE it's not yours you react like a fucking bitch because your retarded guide gets owned by some ESL shit that faggot threw together in five minutes.
Bet it took you a whole day to form a fucking sentence you drooling retard

Go fuck your grandpa Shiteep currymunching golem
Anonymous No.106188906 [Report] >>106189178
Guys, I just wanted to help
Anonymous No.106188914 [Report]
>>106188879
lol, ika is a ramlet
Anonymous No.106188917 [Report] >>106188980 >>106188984
>>106188879
I actually discover it in ik first and found that issue which is what made me test it in llama.cpp. I then went through the trouble of trying it in docker with multi gpu with the exact same command. i got like 5.7 which im guessing isn't representative since wsl but im not totally sure. I was just hoping for a miracle because i don't wana dual boot.
Anonymous No.106188924 [Report]
>>106188813
Thanks for the suggestion, I'll include it in the next update.
Anonymous No.106188980 [Report]
>>106188917
>i don't wana dual boot.
I don't either, but it really seems to be the key to getting maximum performance, which you've really got to squeeze every bit of when running these new bigass MoE models.
That and locking your memory clocks through nvidia-smi so your gpu isn't constantly spinning up and down.
Anonymous No.106188983 [Report]
>>106188813
>I think this recommended-models guy is a schizo.
anyone who recommends glm is an obvious schizo, yes
Anonymous No.106188984 [Report]
>>106188917
the "hardware accelerated GPU scheduling" option can have a significant effect on performance, try changing it
Anonymous No.106188997 [Report] >>106189247
>>106187639
Mate. Download koboldcpp and ditch text gen webui. Minor learning curve (if any) and it has image + text gen in an easy normie UI.
Anonymous No.106189008 [Report] >>106189015
I am convinced that everyone shitting on the new GLMs is just a vramlet and has only tried the old GLM models.
Anonymous No.106189015 [Report]
>>106189008
I too prefer deepseek, but it's twice as big
Anonymous No.106189016 [Report]
schizo cope
Anonymous No.106189033 [Report] >>106189139 >>106189396
>>106188741
>>106188801
I assumed you were being too harsh. I looked and you weren't being harsh enough.
Anonymous No.106189037 [Report]
>>106187301
Anonymous No.106189057 [Report]
>>106188587
You probably only tried old Qwen3.
Anonymous No.106189073 [Report] >>106189109
>>106188587
The new 4B-Thinkeer is out of this world for its size.
So they probably redid the pretraining on it.
They'd have also had time to redo the pretraining on 30B so that might also be worth trying. But the 235B I doubt it. It's probably still fucked.
Still though,
>1 million context
>no mention of NIHS
Promising.
Anonymous No.106189109 [Report] >>106189129
>>106189073
>So they probably redid the pretraining on it.
Nah, the hybrid reasoning just hurt the model a lot, that's why the new ones are not with a togglable /think /no_think and are released as separate models instead. This validates DeepSeek's approach of having two model lines, V3 was updated once independently and on their chat ui that's what is served to you if you don't use reasoning, rather than R1 with a prefill think block.
Anonymous No.106189129 [Report] >>106189151 >>106189175
Anyway the whole OAI is dead thing is a bit of a stretch.
For general queries it's definitely soulless trash compared to 4o.
The high model can make an entire coding project based on rather vague instructions in a couple of minutes. So expect github to get flooded with ai slopcoding (but it's already flooded with pajeet code, but the sheer volume that's about to be unleashsed...)
But the problem is I bet the high model is very pricy to use via the API and via the ChatGPT endpoint you're at the mercy of the RNG gods for it to decide to give it to you otherwise your project is probably fucked- So it's nothing you can reasonably rely on.
>>106189109
So new 235B might be worth a look then?
Anonymous No.106189139 [Report]
>>106189033
Is it the content or my esl phrasing? If it's the latter you could have posted a corrected version instead of shitting on it
Anonymous No.106189151 [Report] >>106189193
>>106189129
I use Qwen3-235B-A22B-2507 everyday to fix my pytorch code
Anonymous No.106189175 [Report] >>106189193
>>106189129
>So new 235B might be worth a look then?
Absolutely. It's gone under the radar for whatever reason but it's out of this world in terms of coherence and instruction following for a model of that size.
The only thing it really lacks is DeepSeek/Kimi level of subculture knowledge, if it had that, I would even dare call the instruct version a SOTA non reasoning model.
Unfortunately, the Qwen guys don't seem to care much to train their models on more random internet data / wikis / fanfiction or whatever
Anonymous No.106189178 [Report] >>106189238
>>106188906
Grow thicker skin. Then either take the suggestions and update the guide or drop the idea.
Anonymous No.106189193 [Report] >>106189206 >>106189254
>>106189151
Yeah but what I mean is that while you needed some basic understanding of code structure to walk an LLM through a coding project previously, with GPT-5 literally any retard can "make" something just by vaguely describing it.
>>106189175
Well Qwen is owned by Alibaba and they didn't become one of the biggest corpos in China by upsetting the party. So they probably can't get away with as much as smaller Chinese startups. And the party probably doesn't like the idea of potentially exposing people to western pop culture. That would be my guess. Meanwhile American models have no fucking excuse to suck at cultural knowledge.
Anonymous No.106189206 [Report]
>>106189193
>he party probably doesn't like the idea of potentially exposing people to western pop culture
Occam's razor. They are just benchmaxxing.
Anonymous No.106189221 [Report]
Everyone is benchmaxxing.
Anonymous No.106189238 [Report] >>106189310
>>106189178
I'm just surprised to see this much arguing out of nowhere.
Anonymous No.106189247 [Report]
>>106188997
That actually looks promising, since they have released it in the nixpkgs. GPU loading worked flawlessly. Cheers.
Anonymous No.106189254 [Report] >>106189269
>>106189193
They are American. Sucking at cultural knowledge is straight on brand.
Anonymous No.106189269 [Report]
>>106189254
I assume pop culture was implied.
Anonymous No.106189277 [Report] >>106189382
>benchmaxxing this or that
How about your nerds start maxing your bench at the gym fr?
Anonymous No.106189289 [Report]
OpenAI has no moat
Anonymous No.106189306 [Report]
lengthofmycockinsidesamaltmanmaxxing
Anonymous No.106189310 [Report]
>>106189238
They were AI agents.
Anonymous No.106189352 [Report] >>106189419 >>106189532
>>106188101
Yeah Qwen is good, glad I tried it. I'd go as far as to say that it's reminiscent of miqu. It's a shame Mistral pulled out of the race but at least we have unpronounceable Chinese models to fill in.
Anonymous No.106189382 [Report] >>106189413
>>106189277
I have weights at home.
Lots of nerdy people are into physical activities.
Retarded boomerism
Fuck off boomer.
Anonymous No.106189396 [Report]
>>106189033
Mikutroon quality thread = mikutroon quality guides
Anonymous No.106189406 [Report] >>106189489
i am new. what's a mikutroon?
Anonymous No.106189413 [Report]
>>106189382
>I have weights at home.
Sure thing pal
Anonymous No.106189417 [Report]
You know, the guide in the op is actually fine. It only lacks differentiation between moe and dense, and I would also mention gemma.
Anonymous No.106189419 [Report] >>106189495
>>106189352
>It's a shame Mistral pulled out of the race
a lot more than just mistral seem to have dropped out
what the fuck is going on with Cohere? they were one of the first local I liked
Anonymous No.106189446 [Report]
>llamacpp still fully doesn't support gemma-3n
grim
Anonymous No.106189489 [Report]
>>106189406
No clue; mostly just one terminally online slowing going mad.
Anonymous No.106189495 [Report]
>>106189419
Cohere is safety maxxing
Anonymous No.106189523 [Report]
>>106189507
>>106189507
>>106189507
Anonymous No.106189532 [Report]
>>106189352
>Mistral pulled out of the race
They really didn't, it's just that they're keeping their currently best models API-only.
Anonymous No.106189538 [Report]
Now that it's been a few days, what's the general sentiment on GLM 4.5? Specially air.
I only used it a little, but I really liked it, even at Q3KS. It didn't shit the bed in Cline.
Anonymous No.106189954 [Report] >>106190023
>>106188449
I appreciate the advice and decided to test it out and figured out how to do it on llama.cpp, I was able to fit q4_k_s with 16k context, getting about 8t/s on 2 3090s. I may drop down a bit lower for a bit more speed to IQ4_XS.

I'm still having an issue with it thinking though. The model keeps trying to output <think> at the end of a long response.

How do I completely disable reasoning in a hybrid model?

Also still a bit confused with its prompt format. <|user|> and <|assistant|> is simple enough, but is the system prompt <|system|>... or [gMASK]<sop><|system|>, because the example format is showing the latter, but I never saw a system prompt like that before.
Anonymous No.106190023 [Report] >>106190062
>>106189954
There's no great way to disable thinking on GLM Air, (without just putting /nothink in every one of your messages) in my experience it so badly wants to think that it will even append a reasoning block to the end of the message, rather than the start.
I've had some luck by prefilling in
<Think> Reasoning is currently disabled, continue to main response</think>
Anonymous No.106190062 [Report] >>106190074
>>106190023
>without just putting /nothink in every one of your messages
In Silly, if you are using chat completion, you can just add that to the user's message suffix right?
Or you can change the jinja template directly too if you are using the chat completion api, I guess.
Anonymous No.106190074 [Report]
>>106190062
You can, but I have the suspicion that if you did that, eventually GLM would start ignoring it.
I know the smaller qwe3 hybrids did that.