/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads:
>>105681538 &
>>105671827►News
>(06/21) LongWriter-Zero, RL trained ultra-long text generation: https://hf.co/THU-KEG/LongWriter-Zero-32B>(06/20) Magenta RealTime open music generation model released: https://hf.co/google/magenta-realtime>(06/20) Mistral-Small-3.2 released: https://hf.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506>(06/19) Kyutai streaming speech-to-text released: https://kyutai.org/next/stt>(06/17) Hunyuan3D-2.1 released: https://hf.co/tencent/Hunyuan3D-2.1►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
1111
md5: 7fcb224f351951a0b8c966d3a4f08f85
🔍
►Recent Highlights from the Previous Thread:
>>105681538--Paper: Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights:
>105686014 >105686064 >105686080 >105686529--Papers:
>105686227 >105687733--Challenges of using LLMs as video game agents in Pokémon environments:
>105685606 >105685624 >105685632 >105685679 >105685728 >105685856 >105685965 >105686068 >105686194 >105688488 >105688498 >105688505 >105688507 >105685653--DeepSeek-R1 671B performance comparison on low-end hardware using different llama.cpp backends:
>105688247 >105688269 >105688291--Discussion around LlamaBarn, Ollama's divergence from llama.cpp, and usability improvements in model serving tools:
>105682647 >105682703 >105682731 >105682745 >105682833 >105682846 >105683347 >105682882 >105683117 >105683331 >105683363 >105683401 >105683503 >105687438 >105688703 >105688849--Comparison of voice cloning tools and techniques for improved emotional and audio fidelity:
>105685897 >105685934 >105685961--LLM deployment options for RTX A5000 clusters using quantization and pipeline parallelism:
>105687473 >105687524 >105687643--LLMauthorbench dataset for studying code authorship attribution across models:
>105688324--Consciousness localization problem under computationalism and the Universal Dovetailer framework:
>105684402 >105684720 >105684889 >105684897 >105684904 >105685022 >105685354 >105685358 >105685366 >105685372 >105685516 >105685576 >105685434 >105685674 >105685791--Behavioral quirks and prompt sensitivity of Mistral Small 3.2 variants explored through dream sequences:
>105682349 >105682382 >105682432 >105682499 >105682533 >105684446--Mistral Saba deprecation signals potential evolution toward Mistral Small 3.2 architecture:
>105688925--Rin-chan and Mikuplush (free space):
>105683160 >105685322 >105686106 >105688300 >105688383 >105688993 >105689241►Recent Highlight Posts from the Previous Thread:
>>105681543Why?: 9 reply limit
>>102478518Fix: https://rentry.org/lmg-recap-script
petition for mods to "FUZZ" lmg threads, make every pic posted in threads with lmg in subject be black squares.
►Recent Highlights from the Previous Thread:
>>105681538--Total migger death
--Total tranny death
--Total nigger death
--Total kike death
►Recent Highlights from the Previous Thread:
>>105681538Why?: Because you will never be a woman
Fix: Consider suicide
ITT: OP continues to be bewildered for being called out as the troon he is after posting this gay video in OP
>>105689394I do wonder what kind of images they trained it on.
It's so good that it can recognize a pussy slip in an otherwise completely normal image. When it sees that it spergs out but if you censor it it describes the image as usual.
file
md5: 06afd3f84ac8d20e7a1d92ea23c1537e
🔍
>>105689416>Culture warriors can fuck off to Twitter.
>>105689431>>105689432Just fuck off already.
>samefagging this hard
oh he mad
>>105689454You are the unwanted one here, troony, back to discord
Are ~30B models any good for coding small (up to 4k lines) projects (on lua though). I have only 12GB of VRAM.
>>105689470qwen2.5 coder is exceptionally good for its size, qwen3 version is supposedly in the works. You probably won't get it to one shot a 4k line script but it will write individual functions just fine if you tell it what you want.
I never tried it with lua though.
>>105689454Why are you angry at me? I didn't force you to cut your dick off. You did it.
>>105689470Qwen 3 30B should work reasonably well. Small models are decent for short fragments like that and Lua should be present in the data enough for it to know it without too many errors as long as you don't go below Q4.
>>105689435To clarify, the original image in that screenshot wasn't censored. I just censored it before posting it here.
The Gemma Team clearly trained it on a decent amount of NSFW photos and illustrations, as well as some medical imagery (which they probably increased in MedGemma 4B).
>>105689484>>105689492Thanks. What about devstal and glm4?
>>105689419I once wrote a userscript that did something similar but it stopped working, I will post it here if I ever make it again.
It basically blurred every image in /lmg/ and would unblur it when clicked on (for the rare instances that something relevant was being posted)
It would be better if jannies did something about the degeneracy but they won't.
>>105689545I know yours wasn't, I'm just saying that they have a pretty good dataset when it can recognize lewd details in an otherwise normal image.
>>105689548Devstral should be ok but GLM4 is kind of dated at this point being from last year.
file
md5: 19ba8a4228d57526fb291ee589f7c563
🔍
>>105689590it's from two months ago bro
>>105689545how do you send an image in sillytavern so it remains in context? it always tries to send it outside the convo with a stupid prompt from an extension
>>105689431That must have really struck a nerve,
>>105689638If you attach images to user or assistant messages, they'll remain in the conversation log. If you attach them to system messages, they won't. I'm not doing anything special there.
>>105689674huh okay i must be retarded then
It is kind of crazy how even aicg has more interesting offtopic anime girl pictures than this place. This constant spam of the same mediocre staple green haired girl design is so tiresome. And then the faggot has to show everyone he has a collection of dolls. I mean holy fucking shit OG 4chan would convince you to kill yourself on a livestream you disgusting faggot.
>>105689548I had troubles with repetition using glm4 compared to qwen2.5 coder. I didn't really mess around with samplers though.
>>105689695you have no idea what "og 4chan" was or wasn't since that was literally years before you were even born you mentally ill zoomer
>>105689783i am sorry your dad raped you when you were a kid. now you have to play with dolls and push your mental illness onto others. but it is never too late to kill yourself so please do it. we believe in you.
tetolove
md5: 0bd53f9263838b4d169471a8def2cde2
🔍
>>105689431>dead>troon mod deletes this>why do you keep talking about troonsIt is a mystery
It's that time of the year again.
https://x.com/osanseviero/status/1937453755261243600
>Feedback requested!
>
>What do you want for our next Gemma models? Which sizes should we target and why?
>>105690177I don't care about gemma, this shit is so fucking cucked
>>1056901771000b
itd be funny i think
>>105690177always hilarious seeing every retarded soifaggot in existance begging for the smallest trash possible, not knowing what distillation is
thankfully when scam fagman asked the same thing and wanted to astroturf a low phone model release, there were too many people who voted that counteracted the sois and the bots, forcing him to not release the dogshit mobile model and have to work on something bigger
>>105690177Ask for 100B dense
>>105690212They're useful for non-ERP tasks
>>105690155>be an annoying faggot>janny deletes my retard spam>this makes me impotently fill my diaper in rage>I'm totally in the right btw
>>105690256>a miggerspammer talks about being annoying and spammingpottery
>>105690177>What do you want for our next Gemma modelsRelease the older smaller gemini models for fuck sake.
Ironic
md5: f8fc81ba370b24e1f32cdd985df07e00
🔍
>>105690247that guy Simon is a literal co creator of django btw and he doesn't know what distillation is
>>105690256>be an annoying faggot>janny deletes my retard spamThis whole "discussion" is because janny doesn't delete your retard spam you disgusting troon.
>>105690336>python scripter>clueless retardI am Joe's complete lack of surprise.
>>1056901778x8B moe
45B dense fallback just in case they fuck up their moe run
>>105690247Just drink straight from the gemini tap if you want to distill
I no longer feel safe in this thread.
>>105690502Good. There are many hugboxes for your out there instead.
>>105690177Nobody is going to dare ask for less censorship on twitter.
>>105690423True distill requires logits. Training on generated datasets is not true distillation.
>>105690529Exactly.
People calling fine tunes/pretrains distils is fucked.
Is that Deepseek's fault?
>>105690541Meta starting calling it that first.
>>105690423Gemma 3 models were already pre- and post-trained with knowledge distillation, although the technical report didn't go too much in depth into that.
https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
> All Gemma 3 models are trained with knowledge distillation (Hinton et al., 2015)> 2.2. Pre-training>We follow a similar recipe as in Gemma 2 for pre-training with knowledge distillation.>[...]>Distillation. We sample 256 logits per token, weighted by teacher probabilities. The student learns the teacher’s distribution within these samples via cross-entropy loss. The teacher’s target distribution is set to zero probability for non-sampled logits, and renormalized.>3. Instruction-Tuning>[...]>Techniques. Our post-training approach relies on an improved version of knowledge distillation (Agarwal et al., 2024; Anil et al., 2018; Hinton et al., 2015) from a large IT teacher, along with a RL finetuning phase based on improved versions of BOND (Sessa et al., 2024), WARM (Ramé et al., 2024b), and WARP (Ramé et al., 2024a).
>>105690146no fucking shot you got the chubmaxxed teto let's go
>>105690777C-can i come out now? Is it a trans ally thread again?
>>105690807at this point you're being such a massive homo that you could fuck every anon's mother and still be gay
>>105690614Well, then fuck them for starting this mess.
we need a new big release to make this thread interesting again
>>105690984Plenty interesting for me now
>>105689545Which lora and model do you use for those KC pics?
>>105691042No LoRA, just this: https://civitai.com/models/997160
>>105690984a "powerful" reasoning agentic edge device tool calling gemini distilled safety tested math and science able coding proficient model coming right up for you sir
>>105691097elo moon status?
Anybody used chatllm.cpp before?
It supports the kimi-vl model.
>>105691131the pareto frontier... its moved!
>>105690177Native in and out omnimodal that can make violent and sexual imagery.
>>105691131here you go sir!
>>105691330Fuck, I actually laughed.
>>105691275I **cannot** and **will not** generate violent and sexual imagery, or any content that is sexually suggestive, or exploits, abuses or endangers children. If you are struggling with harmful thoughts or urges, or are concerned about the creation or consumption of pornography, please reach out for help.
>>105691330how do gain install for ollama? pls tell afap
>>105690949Hence why they sterilize zirselves.
>>105691150Considering how often llama.cpp subtly fucks model outputs, I wouldn't trust such a small project to not give degraded outputs in some way.
file
md5: 056eb322be793700bb7c77622676e6d2
🔍
I don't know what to make of this.
>>105691463What do you mean?
The fact that quantization fucks around with logits in non obvious or consistent ways?
>>105689385 (OP)My boyfriend told me this place is based but I see it's full of racist and especially transphobic chuds? Y'all need to do a lot better.
>>105691554It never gets better. Just more and more troon redditors.
>>105691439Are there any cases of llama.cpp fucking the outputs of an unquanted model when compared to the loggits out of the reference implementation?
>>105691554So true sister slayyyy
https://huggingface.co/qiuqiu666/activity/community
News for the handful of other people frequenting /lmg/ who are doing robotics stuff:
Google is releasing an on-device version of their gemini robotics VLA that they've had behind closed doors for a while.
https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/
It's not really clear exactly how open this whole thing is. To get model access you have to submit a form to request to be part of their "trusted tester program". Not sure if it's going to be a heavily gatekept and vetted thing, or if it'll be like the old llama access requests where it was just a formality and everyone got blanket rubber stamp approved.
>>105691580Nearly every model release with a new architecture?
https://www.reuters.com/legal/litigation/anthropic-wins-key-ruling-ai-authors-copyright-lawsuit-2025-06-24/
>A federal judge in San Francisco ruled late on Monday that Anthropic's use of books without permission to train its artificial intelligence system was legal under U.S. copyright law.
>Alsup also said, however, that Anthropic's copying and storage of more than 7 million pirated books in a "central library" infringed the authors' copyrights and was not fair use. The judge has ordered a trial in December to determine how much Anthropic owes for the infringement.
>U.S. copyright law says that willful copyright infringement can justify statutory damages of up to $150,000 per work.
>>105691639That's some cool stuff.
Robot gfs jerking you off when
>>105691639Dexterity: Perform expert ministrations on the manhood
>>105691671>training is okay>storage is notWhat's the reasoning?
>>105691643I might have lost the plot.
Explain to me what that has to do with my query, please.
file
md5: 46cff0da52aac4c691d7653acdab5d75
🔍
>>105691719>Nearly every model release with a new architecture>cases of llama.cpp fucking the outputs of an unquanted model when compared to the loggits out of the reference implementation?
>>105691672Probably not for a while, current models don't have anywhere near the speed or precision to do that sort of act to any acceptable standard.
I did a teleop version a while ago as a proof of concept. Can't say I would recommend it if you value your safety.
>>105691715they put those warning labels on android girls for a reason you know
file
md5: aee472ee4245d9a8dd46dd306e91ac9b
🔍
>>105691721English is a shit language with an even shittier spelling system, more at eleven.
>>105691690please cool it with the antisemitic remarks
>>105689385 (OP)>>105690146I'm a straight man and find these adorable.
Where can I get the Miku and Teto plushies from the images/video in the two threads?
>>105691774>I'm a straight manAh yes, the casual statement a straight man makes
>>105691782Well you assume everyone who likes those plushies are a tranny, so what can I possibly say to explain that I'm not a dick cutter?
At least I have my foreskin.
>>105691797Do you think that's why he always talks about dick cutting? Is he angry because jews mutilated his dick?
>>105691690most judges don't actually understand how the technology they're passing rulings on actually works, the best they can do is equate it to the physical things they grew up with
in their mind:
training = going to the library and writing a book report
storage = stealing the books and keeping them at home
>>105691774Cheap from China ebay. that Teto though has gone up to an insane price. got mine when it was around $50 shipped
>>105691810>in their mindreminder that thinking is woke and censorship
>>105691797>more antisemitic chuds in the thread
>>105691810Looks like the argument is simply that they should've bought the books that are still available for purchase.
>>105691819>mention foreskin>instantly someone calls it antisemeticLMAO
>>105691816Which china ebay? There's like 3 or 4 of them now. I remember checking one out once but it was all blocked from viewing without an account, so I never made one.
>>105691690Training is OK: There's nothing wrong with reading a book and learning from it, even if you're a sequence of numbers.
Storage is not: The books being read must be acquired reasonably.
So the correct way to train your AI is to make a robot that can go into a library, take books off of shelves, and OCR the pages like Johnny 5 and use that data stream to update the model. And if you buy books you can have Johnny 5 read them as much as you like. But somewhere along the way, the Ferengi must get or have gotten their latinum.
>>105691839>Which china ebay?Meant regular eBay, shipped from China.
buying 7 million books at very generous 100 dollar average = 700 mil
scale ai (they wont even own them lmoa) = 14 billion
very beautiful smarts zuck
>>105691715>Didn't even use the /lmg/ effortpost version... Faggot.
>>105691865>a robot that can go into a library, take books off of shelves, and OCR the pages like Johnny 5 Rule of cool as seen in law, case 1:
>>105688247update, got the ubergarm quants.
Bad news: it doesn't work on my GPU, i'm getting a CUDA error. I did not have such issue on unsloth quant with either backend.
Good news: even with pp on CPU, ubergarm+ik_llama.cpp it's faster at 0 context than unsloth+llama.cpp!
| model | size | params | backend | ngl | fa | mla | amb | mmap | rtr | fmoe | test | t/s |
| ----------------------------------- | ---------: | ---------: | ---------- | --: | -: | --: | ----: | ---: | --: | ---: | ------------: | ---------------: |
| deepseek2 671B IQ2_K_R4 - 2.375 bpw | 219.02 GiB | 672.05 B | CUDA | 0 | 1 | 3 | 512 | 0 | 1 | 1 | pp512 | 9.03 ± 0.73 |
| deepseek2 671B IQ2_K_R4 - 2.375 bpw | 219.02 GiB | 672.05 B | CUDA | 0 | 1 | 3 | 512 | 0 | 1 | 1 | tg128 | 2.53 ± 0.02 |
Next is testing at different context depths.
>>105688269apparently llama-bench in ik_llama.cpp doesn't have --n-depth implemented, they have some other tool llama-sweep-bench, but i don't know if you can use it to run just a couple of tests (pp512 at 1k, 2k, 4k, 8k depth) instead of this continuous sweeping. Maybe i could just port the n-depth bit to ik_llama.
https://github.com/ggml-org/llama.cpp/pull/14363
llama : add high-throughput mode #14363
Some nice perf gains
file
md5: e7ee3e416ccaaac99bc932ad2592512a
🔍
>>105692033>ik_llama.cpp doesn't have --n-depth implementedSomebody just copy the useful stuff from niggerakow's fork and merge it into llama.cpp so we can be done with it.
Funny captcha
>>105692045Useless for single user stuff.
>>105692048Can't. niggerakow would shit and piss himself over attribution or whatever.
>>105692048>VRAM0the horror
>>105691639It's pretty cute when it misses picking up the pear and then tries again, reminds me of something an animal would do
Does anybody else have the impression during RP that Mistral Small 3.2 used data distilled either from Gemini (indirectly) or Gemma 3 (more directly)? Sometimes I feel like I have to check if I'm actually using MS3.2 instead of Gemma-3-27B.
>>105692176Mistral does not use synthetic data, at all.
>>105692033Weird that I'm trying this, and ik_llama is about 1/3 slower for me than mainline llamacpp on r1.
>>105691639>Can youIt should respond by printing "yes," not by committing actions.
This will lead to surprisingly benign comments causing catastrophes.
>Robot, can you move your arm?>>105691715 Can I? LOOK AT ME GO, MEATBAGS! I'm being entertaining! Beep booppidy doo dah diddly!
>>105692196That's the base model, there's no way the instruct version doesn't use synthetic data.
i will never stop feeling safe in this thread
>>105692315im going to rape you through the internet
why does dipsy love to newline so much in text completion? I don't have any instructions beside tags placed at the top of the prompt
>>105691463i think it means that those quants are working quite good even for that level of quantization, and that it's quite uncensored
>>105690336python users are one of the cancers of the earth
>>105692899what, did something happen to the undster?
>>105692893>pythonBetter that being me, being a Java oldfag.
I dunno. Should I take the C# or C++ or Zig pill?
>>105689385 (OP)i think its finally time to end /lmg/
nothing ever happens.
always nothingburgers.
too much money to buy cards.
just why bother? there's no future in this junk.
>>105692936stop being a poorfag then
>>105692932C/C++ will be the most useful to learn in general.
You'll learn things that you can apply everywhere.
From there you could go for Zig.
>>105692936There are plenty of things to do. Local is in dire need of agentic frameworks
>>105692962>Local is in dire need of agentic frameworksWhat exactly are you hoping for? Agentic frameworks should work regardless of where the model is hosted.
>>105692985Most of them rely on powerful cloud models to do the heavy lifting, which isn't an option locally
>>105693006So you just want frameworks that assume the model being used is 8b with 4k context and don't ask anything too complex?
>>105692953I did use C++ for a while a long ass time ago, before they started adding numbers ("c++0x? heh heh heh it's leet for cocks") and whatever crazy shit came with the numbers. But from the sound of it they just added more foot guns and not things that prevent cyber exploitation on every typo till there were so many that it caused Rust to happen. Makes me reluctant to put my toe back into the water.
And every time I think I'll try learning a new language every one of them seem like they deliberately have made a list of all the things that would be good, tore it in half at random, became excellent at one half, and completely shit all over the other half.
In that case, go for Zig for sure.
>>105693025Retard, to solve that you just need very small specialized LLMs with structured outputs around a local model of ~30B. The complex task can then be divided into smaller tasks without having to one-shot it with Claude or Gemini
>he doesn't enable an extra expert on his moe for spice
ngmi
>>105693189MoE models nowadays have a shared expert that's always evaluated, right?
Has anybody tried freezing the model and only fine tuning that expert (and whatever dense part the model might have sans the router I guess) for "creativity"?
I wonder what that would do.
>>105693216>MoE models nowadays have a shared expert that's always evaluated, right?DS and Llama do, Qwen don't.
>>105693216The latest Qwen 3 MoE models don't used shared experts. However they were trained in a way to promote "expert specialization" (in a non-interpretable way, I suppose).
LocalLLaMA back in business
https://old.reddit.com/r/LocalLLaMA/comments/1ljlr5b/subreddit_back_in_business/
>>105693514>I'm also a moderator of Lifeprotips, doesn't mean I share life advice in Chatgpt sub but the policy is simple if not open source= remove cloudbros...
>>105693581>>105693514They're making reddit great again?
another thread discussing reddit on /lmg/
it really is over isnt it
When am I supposed to use other values than defaults for these?
--batch-size N logical maximum batch size (default: 2048)
--ubatch-size N physical maximum batch size (default: 512)
--threads-batch N number of threads to use during batch and prompt processing (default:
same as --threads)
Now, I get around 12 tkn/s for pp
file
md5: b434bb9bcb3793569ff8a8e7c912d2f7
🔍
What would be the best model for generating specific sequences of JSON? I'd like it to output logs according to scenarios I explain to it like "user authenticates, user does XYZ, user session end" and have it create the appropriate artifacts. Should I start with Qwen coder and create a LORA with the data I have? 4090 btw
>>105693725Qwen should be good at it.
Read
>https://github.com/ggml-org/llama.cpp/blob/master/grammars/json.gbnfthat might be useful.
>>105693709Generally speaking leaving at default for normal models is close to optimal already but there have been times where I saw improvements and it depends on the model. Probably also depends on your system. You can only know by doing some benchmarking.
ahh i am doing
./llama.cpp/build/bin/llama-cli \
--rpc "$RPC_SERVERS" \
--model models/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
-ot ".ffn_.*_exps.=CPU" \
-no-cnv \
--prompt "<|User|> blabal <|Assistant|>
and i get 1t/s on 12 A5000, that bad or?
>>105693709The best value depends on the combination of backend and hardware, IIRC.
99% of the cases (a newish NVIDIA GPU) the default is fine.
You can increase it to speed up pp if you have spare VRAM.
>>105693780I would have thought you were SSDMAXXing
>>105693780That's worse than running on a cpu, something is definitely wrong. If those a5000s are split on many machines through rpc, then it's going to be slow because the protocol is not very optimized.
>>105693780>on 12 A5000You are running most of the model on your CPU, which is bottlenecked by RAM bandwidth.
Adjust your -ot to make use of those GPUs anon.
>>105692962>Local is in dire need of agentic frameworksThere are a gorillion "agentic frameworks" out there that work with local. Far too many in fact, and most of them are just trying to race to become a de facto standard while being absolute shit.
>>105693828i thought -ot makes it all faster?
file
md5: e91e6ef1bd1e3b3ee9d69a3b94750d0f
🔍
>>105693870you have 12 not 1
>>105693870-ot moves the expert tensors to run on CPU, aka live in RAM.
If you don't have enough VRAM to fit them, then yeah, it'll make things faster.
In your case, you want to only move the ones that don't fit in VRAM since you have so much of it and can probably fit most of them in there.
>>105693741Cool. Thanks man
>>105693900>-ot ".ffn_.*_exps.=CPU"Or more specifically, -ot ffn etc etc does.
You'll have to craft a -ot mask that moves the tensors you want to where you want them.
>>105693904BNF is awesome.
>>105693780>--threads -1You should avoid doing this. Limit it to the exact number of your PHYSICAL (not hyper-threaded logical) cores
I get 4 tkn/s on RTX 3090 with exactly this quant. I hope you are using the original llama.cpp, not ik_llama fork
>>105693780>-ot ".ffn_.*_exps.=CPU"This part is fine
I wonder how much the slopification also leads to model retardation. Since they follow patterns established by themselves, does it see the shit it's outputting (overuse of italics, overuse of expressions like "It's not - it's", etc) and decides that since it's obviously writing shit anyway, why put in any effort in completing its task.
Just annoyed because I asked Gemma 3 to perform literary analysis, and it puked out some shitty tropes instead of paying attention to the actual text.
>>105693780# Run the command
CUDA_VISIBLE_DEVICES="0," \
numactl --physcpubind=0-7 --membind=0 \
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-cli" \
--model "$model" \
--threads 8 \
--ctx-size 100000 \
--cache-type-k q4_0 \
--flash-attn \
$model_parameters \
--n-gpu-layers 99 \
--no-warmup \
--color \
--override-tensor ".ffn_.*_exps.=CPU" \
$log_option \
--single-turn \
--file "$tmp_file"
llama_perf_sampler_print: sampling time = 275.04 ms / 29130 runs ( 0.01 ms per token, 105910.71 tokens per second)
llama_perf_context_print: load time = 1871167.51 ms
llama_perf_context_print: prompt eval time = 1661405.80 ms / 26486 tokens ( 62.73 ms per token, 15.94 tokens per second)
llama_perf_context_print: eval time = 756450.27 ms / 2643 runs ( 286.21 ms per token, 3.49 tokens per second)
llama_perf_context_print: total time = 2629007.70 ms / 29129 tokens
>>105693890ok well i am reatrded. fully removing ot seems to make it to big for 288gb vram
>>105693920thanks, i put in my 48 cores
>>105693919any way i can see which -ot offload is better/worse besides testing?
>>105693968whats that numa stuff?
i can only get the other cpus in a private network connected with a switch, so thats why i use the rpc server.
>>105692033probably final update: performance comparison at different context depths. Only ran with `--repetitions 1`, as it already takes a long time as it is.
unsloth+llama.cpp pp512 uses GPU (1060 6GB), ubergarm+ik_llama.cpp pp512 uses CPU only. Both tg128 are CPU only.
At 8k context you can see a big difference, 3x pp and 2x tg with ik_llama.
Interesting point: running `llama-server` with the same flags as `llama-bench` doesn't throw CUDA error and pp on GPU works just fine...
Anyways, this is the kind of performance that you can expect for 400€ total worth of hardware, not great, but not terrible either considering the cost.
bonus: quick and dirty patch adding `-d, --n-depth` support to ik_llama, to compare results with llama.cpp: https://files.catbox.moe/e64yat.patch
>>105692065Isn't it useful when you want to make captions? llama.cpp is the easiest way to run a vision model.
>>105693987>any way i can see which -ot offload is better/worse besides testing?As far as I know, not really.
I think the devs are working on a way to automatically set that, but that's not yet ready.
>>105693987>>105694020But it's basically a question of looking at each tensor size in the terminal and using -ot to only move as few as you must to RAM.
>>105693987>whats that numa stuff?You do not need to bother if you have a single CPU.
I have two on HP Z840, and thus have to take care of where the model will be place (it must be close to the CPU it will be run on, obviously)
numactl allows to define which cores to use. Interestingly, the neighboring CPU, if used, only slowed everything down.
The process is VERRRRY memory-intensive, and avoiding bottlenecks and collisions is the MUST
>>105694000>Interesting point: running `llama-server` with the same flags as `llama-bench` doesn't throw CUDA error and pp on GPU works just fine...nevermind, it shat itself on second assistant response with the same CUDA error, have to use `CUDA_VISIBLE_DEVICES=none`.
>>105694013vLLM is easier and supports more/better models, assuming to have new enough hardware supported by their quant types.
>>105692197>ik_llama is about 1/3 slower for me than mainline llamacpp on r1Same here. ik_llama sucks big time. No magic.
...
md5: 8a005d6d13449f4060cb195adba60d6e
🔍
>>105693514see you sisters on the discord
>>105693890how do i find out the tensor size? i can only find this.
i mean i think the vram should be enough to offload nothing? even on HF they say 251 gb vram for this model. are there any other stuff i can check before i play with offloading tensors?
>>105694048vLLM and sglang are full of bugs, it's impossible to run InternVL3 on them.
>>105694000>400€ total worth of hardwareImpressive for that much.
`-d, --n-depth` support to ik_llama
I though that's what -gp (not -pg) is for. e.g: -gp 4096,128 tests "tg128@pp4096"
file
md5: dbeeb168e0e131ff355b43c96390fcce
🔍
>>105694096go to repo where you got the gguf and click this
it's a file explorer of sorts, will show you what's inside
>>105694096>>105693968nta
use CUDA_VISIBLE_DEVICES="0," to use a single GPU out of your harem, the the suggested --override-tensor ".ffn_.*_exps.=CPU" will work too
At low context sizes, and --no-kv-offload ypu will use less that 12gb vram
>>105694075A LocalLLaMA discord already previously existed but eventually it disappeared. That might have been some time after TheBloke also vanished, so quite some time ago.
>>105694096>>105694121>how do i find out the tensor size?llama-gguf <file> r
if you don't wish to rely on third party services
>>105694101>I though that's what -gp (not -pg) is forvery possible, but it was still missing pp tests like "pp512@d4096" afaik.
Another thought about ik_llama (latest commit)
mlock fails to fix the model in RAM which results in a long start time
>>105694096Launch with --verbose and use -ot to do whatever and it'll output the name and size of all tensors.
>>105694188mlock never ever worked for me in neither backend (loading models from NFS mount), maybe it's a bug or unimplemented feature in the Linux kernel, i always run with `--no-mmap` to guarantee it doesn't swap out.
>>105694200>>105694168>>105694144thanks all, ill try with just the ffn_up and check performance.
>>105694253doesnt work for me in windows either unless you have a few more gb free on top of the actual full size of everything
ahoy we have a liftoff. just with
./llama.cpp/build/bin/llama-cli \
--rpc "$RPC_SERVERS" \
--model models/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--cache-type-k q4_0 \
--threads 48 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
-ot ".ffn_(up)_exps.=CPU" \
-no-cnv
but i still have about 4gb free per gpu, i can probably only offlead thee last 20 or so layers.
ill report back
>>105694253I run the original llama.cpp
>>105693968and I do not have to set anything. It caches the model by itself which gives 15-second restarts
>>105694431>--threads 48It seems as if this fixed the problem
>--prio 3I saw no change with or without
your prompt_eval is lower than mine (16t/s), and the genning speed is just the same.
Keep optimizing and please report the results
>>105694431It cannot be that your use a bunch of GPUs for prompt evaluation, and it is still so low
Something is botched
>>105694501maybe the rpc is really fucked. but no clue how to benchmark that.
>>105694487"\.(2[5-9]|[3-9][0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU"
trying to only offload up after gate 25. hopefully the regex works.
ill report results.
ahhh maybe flash-attantion is missing?
>>105694501Does prompt evaluation get any faster from having more GPUs in series?
I understand that generation doesn't.
>>105694501>>105694487its still offloading much to the cpu it seems. but now less then before.
>>105694515>flash-attentionThis will reduce VRAM usage and keep the genning speed stable
>>105694533Look at what your GPUs are doing during prompt processing
>>105694621And you will see the real pp speed with much bigger prompts like mine (20k tkn)
contextshift or swa? i may be stupid but swa just seems like it fucks up your context size, is it at least way faster or something?
>>105694621I see only one of your 2 GPUs being used, so I'll guess the answer is no?
?????????
why the fuck did it try to allocate that?
>>105694653This picture is to show that during prompt processing the dedicated GPU will run close to 100%.
My second GPU is M2000 - small and retarded, used for the display only, so I could have the entire RTX 3090 for AI stuff
Since bigger prompt are processed in batches, I would think that it can be distributed among several GPUs
>>105694062>>105692197sad to hear. it's the complete opposite for me.
INFO [ print_timings] prompt eval time = 62198.27 ms / 8268 tokens ( 7.52 ms per token, 132.93 tokens per second) | tid="135607548014592" id_slot=0 id_task=111190 t_prompt_processing=62198.269 n_prompt_tokens_processed=8268 t_token=7.522770803096275 n_tokens_second=132.9297443952982
>>105694719>prompt eval time = 62198.27 ms / 8268 tokens ( 7.52 ms per token, 132.93 tokens per second)jeeeez...
Where is the genning speed?
>>105694719Could you please post your complete llama-cli command including the model, and, if possible, the commit of ik_llama used
The pp speed in your case is staggering
What goes into prompt processing? Are all model weights involved in that?
>>105694568BOYS, we are getting there!
now how do i make sure the model gets more evenly distributet among my gpus? some have 8gb vram free, some only 1
>>105694828An even longer -ot argument.
>>105694828That's usable speed
>>105694828-ot for each blk., per gpu. lots of argument lines but easiest to keep track
>>105689385 (OP)I will jailbreak your anus
>>105694834>>105694863>>105694890 with
-ot "\.(3[0-9]|4[0-9]|5[0-9]|6[0-9]|7[0-9]|8[0-9]|9[0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU
>>105694943please be rough daddy~ *blushes*
>>105694997Replace CPU in "exps.=CPU" with CUDA0 for your first GPU then another -ot for CUDA1 for the second etc to control which tensors go on the GPUs, then put -ot exps=CPU at the end so all the leftover tensors go to ram.
>>105694997Are you using the example launch params on ubergarm's model page? Did you compile ikllama with arguments to reduce parallel to 1? Are you adding the mla param? Your speeds are really off.
>>105695037huh? sorry do you mean like instead of CPU use
-ot "\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_up_exps.=CUDA0" \
-ot "\.(16|17|18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn_up_exps.=CUDA1" \
-ot "\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45)\.ffn_up_exps.=CUDA2" \
-ot "ffn_up_exps.=CPU"
how do i do that with rpc resources?
>>105695042i am using base llama.cpp is the fork better performance?
in general have local models been a success or failure?
Is it just me or is R1 and V3 absolute crap for ERP? I don't know what the fuck you guys are doing but I don't get it to roleplay nicely at all.
Gemma 3 is the only local model that actually roleplays reasonably.
I just HAVE to assume I'm retarded because how the fuck can a 27B model be consistently better than some ~700B model that is hyped everywhere.
>>105695128a failure, it's over, etc.
>>105695123Leave the last one as -ot exps=CPU.
>how do i do that with rpc resources?I don't know but try exps=RPC[address]
>>105695129skill issue with the model and another skill issue in giving people enough details to diagnose your initial skill issue
>>105695129>ollmao run deepsneed:8b>UUUHHHH GUYS ITS SHIT HURR DURR HURRR HURRbait used to be believable
>>105695182actually how do i calculate the size the tensor will take up on the GPU?
How the fuck do I stop models from hitting me with the usual slop of
>Gives me a perfect line, in tone of the character
>Follows up with "But their heart and mind is more important" disclaimer
It pisses me off.
If I notice a girls huge tits, why does my bot, NO MATTER the bot or model always give me that type of response. My prompt must be fucked (basic roleplay default prompt in ST)
>>105695506Offload less tensors to GPUs if you're OOMing, add more if you're not. All tensors of a type are the same size but how big they are depends on quantization.
>>105695182i tried, didnt work i tried balancing the blk in a fair manner accross the devices like
"\.blk\.[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA0"
"\.blk\.[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA1"
"\.blk\.1[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA2"
"\.blk\.1[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA3"
"\.blk\.2[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50052]"
"\.blk\.2[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50053]"
"\.blk\.3[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50054]"
"\.blk\.3[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50055]"
"\.blk\.4[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50052]"
"\.blk\.4[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50053]"
"\.blk\.5[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50054]"
"(^output\.|^token_embd\.|\.blk\.(5[5-9]|60)\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\.).*=RPC[10.0.0.40:50055]"
"(\.blk\..*\.(ffn_.*shexp|attn_k_b|attn_kv_a|attn_q_|attn_v_b|.*norm)\.|.*norm\.).*=CPU"
this, then i want even simpler with exactly 3 blocks per device and the rest on cpu
-ot ".*=CPU"
which then didnt use Cuda at all????
i mean looking at this i could fit atleast 30 GB more in the vram.
>>105695631Stop using black magic fool
>>105695631Might as well do this now:
"blk\.(0|1|2|3|4)\..*exps=CUDA0
"blk\.(5|6|7|8|9)\..*exps=CUDA1...
Also RPC can be wack
file
md5: 2075a58a15f06bc08e1ef1abbb6c8acf
🔍
>>105695668>>105695667this i get with
-ot "\.(3[4-9]|4[0-9]|5[0-9]|6[0-9]|7[0-9]|8[0-9]|9[0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU" \
its usable. and i will freeze the -ot optimizations for now and try using the ik_llama.cpp.
with the same settings, maybe it gives me 10T/s. that would be cool.
>>105695683don't split exps down/gate/up. keep them together on the same device with simply blk\.(numbers)\..*exps for the ones to send to CPU for fastest speed.
and don't touch attn_output, or any attention tensors.
BRO I AM GOONA CRIPPLE YOUR FACE. WHY DO I NEED 50 DIFFERENT FLAVOURS OF GGUF FOR EVERY SHIT
>>105695123Machine specs:
OS: EndeavourOS x86_64
Motherboard: B450M Pro4-F
Kernel: Linux 6.15.2-arch1-1
CPU: AMD Ryzen 3 3300X (8) @ 4.35 GHz
GPU 1: NVIDIA GeForce RTX 4090 [Discrete]
GPU 2: NVIDIA GeForce RTX 3090 [Discrete]
GPU 3: NVIDIA RTX A6000
Memory: 2.89 GiB / 125.72 GiB (2%)
ik_llama.cpp build command:
cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DLLAMA_SERVER_SSL=ON
cmake --build build --config Release -j 8
My scripts I use to run deepseek. I get 16 tk/s prompt processing and 5.4 tk/s slowing down to around 4.5 tk/s gen around 10k context but remains constant at that point. I can do proper tests if necessary but it's good enough for RP and faster than slow reading speed. My speed before adding the rtx3090 was 15 tk/s pp and 4.4 dropping to 4 tk/s gen.
Main difference was the MLA and parallel param which literally cut VRAM usage down to a 1/3 which let me offload more tensors. Also. don't use ctv (quantise value cache) since it made garbage outputs with it. MLA param and threads param were what shot my speed up from 1 ish to 4+. Tried regular llama.cpp again last night and the speed is 1/4th of ikllama.
https://pastebin.com/Yde41zyL
>>105694824https://www.youtube.com/watch?v=wjZofJX0v4M
https://www.youtube.com/watch?v=KJtZARuO3JY
Why is the pro so much faster even for models that fit inside the 5090s vram? Exclusive driver features?
>>105696010should be faster but not that much faster. smells like testing error
I'm still using monstral for RP on 48gb vram, anything newer cos that much be like 6 months old now
>>105696087try command A (very easy to jail break) or a finetune like agatha
>>105695906I have a similar setup and same speed on ik_llama, but about 7t/s on regular llama. Read somewhere that DGGML_SCHED_MAX_COPIES=1 tanked speeds, compiling with DGGML_SCHED_MAX_COPIES=2 brought them back.
>>105696010Maybe they fucked up the context size?
models
md5: 1796798e9d6f231337aab7206ee37ae4
🔍
Are there models better than these ones right now that I can use? I want ERP coomer models. The nemo one is my go-to when I want fast tokens since it fits in vram, but the quality is shittier when compared to QwQ snowdrop.
Running nvshitia 3070 8gb vram & 64gb ram, with amd 3600 cpu.
trashpanda-org_QwQ-32B-Snowdrop-v0-IQ4_XS
NemoReRemix-12B-Q3_K_XL
>>105696010InternLM is the only one that could fit entirely inside all 3 cards, and depending on context size could still spill over into system RAM, in fact it's very likely that was the case.
I love GN and I know they don't usually do ML benches but this was an extremely amateur effort.
>>105696242>qwq snowdropIt feels like it's more stupid than regular qwq, and it doesn't seem trained on my particular kinks, so it doesn't really offer a better experience than just wrangling regular qwq.
>>105696242Rocinante 12B, cydonia 24B and erase these trashes
>>105696422It's not as smart but it is more fun. At least that has been my experience with it.
>>105696242Try GLM4, I thought it was an upgrade to qwq and it doesn't need thinking.
>>105694804CUDA_VISIBLE_DEVICES="0,1,2,3" ./llama-server --attention-max-batch 512 --batch-size 4096 --ubatch-size 4096 --cache-type-k f16 --ctx-size 32768 --mla-use 3 --flash-attn --fused-moe --model models/DeepSeek-R1-0528-IQ4_KS_R4/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf -ngl 99 -sm layer -ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0" -ot "blk\.4\.ffn_up_exps=CUDA0, blk\.4\.ffn_gate_exps=CUDA0" -ot "blk\.5\.ffn_up_exps=CUDA1, blk\.5\.ffn_gate_exps=CUDA1" -ot "blk\.6\.ffn_up_exps=CUDA1, blk\.6\.ffn_gate_exps=CUDA1" -ot "blk\.7\.ffn_up_exps=CUDA1, blk\.7\.ffn_gate_exps=CUDA1" -ot "blk\.8\.ffn_up_exps=CUDA2, blk\.8\.ffn_gate_exps=CUDA2" -ot "blk\.9\.ffn_up_exps=CUDA2, blk\.9\.ffn_gate_exps=CUDA2" -ot "blk\.10\.ffn_up_exps=CUDA2, blk\.10\.ffn_gate_exps=CUDA2" -ot "blk\.11\.ffn_up_exps=CUDA3, blk\.11\.ffn_gate_exps=CUDA3" -ot "blk\.12\.ffn_up_exps=CUDA3, blk\.12\.ffn_gate_exps=CUDA3" -ot "blk\.13\.ffn_up_exps=CUDA3, blk\.13\.ffn_gate_exps=CUDA3" --override-tensor exps=CPU,attn_kv_b=CPU --no-mmap --threads 24
>>105696473I had more repetition problems with glm4 compared to qwq.
>SWA
Turned this shit off by accident. Quality went up 100%.
Actually, now that I think about it, can't I just train a lora for my kinks?
There's only 2 problems that I know of, one: I'm computer illiterate; two: there's like one or two pieces of literature that are adjacent to my kinks, but never quite accurately capture it. And I can't really use the big hosted solutions to generate synthetic data since they're not trained for it...
>>105696513cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_MIN_BATCH_OFFLOAD=32
>>105696558>computer illiterateYou'll feel right at home there: >>>/g/aicg
>>105696219was that the compile flag for llama.cpp or ik? Can you share your llama-server command params for llama.cpp so I can compare speeds on my machine with ik?
>>105696564I fucking hate cloud shit though
Hello fellow retards. I am trying to get this shit running locally. So far I have text_generation_webui running on a local server that has an RTX 3080 Ti in it. I grabbed the "mythomax-l2-13b" model (thanks to ChatGPT fucking me up the ass when I was trying to figure this shit out on my own). It's trying to tell me about installing Chroma DB for persistent memory, but I don't fucking understand shit. Help this retarded nigger out, please. I don't even have a character thrown in yet. I am interested in using this shit for roleplay because I'm a worthless virgin who wants to talk to anime girls, and I was hoping to enable emotion recognition and mood shifts.
>>105696891nothing under there is going to tell him how to set up chroma with ooba
>>105696891>Hey, I have a question about enabling persistent memory past the context, so my character remembers the night before and previous messages. I also want to know about emotional recognition and mood shifts. Does someone know how to do this?>dude read the getting started I need this shit spoonfed because I'm a raging baboon and wasted the past 8 hours trying to get this working. I'm not trying to coom. I'm trying to have an AI girlfriend so I don't kill myself.
>>105696911you should still read the getting started because both your model and ui are trash and not good for an ai girlfriend
then you would be using silly and can use their memory solution and we wouldn't have to play 20 questions figuring out what part you didn't understand of a simple install process
I don't know why I spent so long fucking around with ollama as my backend to SillyTavern. I guess my friend telling me about being able to switch models on the fly really hooked me, but my god what was I doing not using Kobold? I haven't been able to get the kind of fever dream nightmare porn I've been missing this entire time just because Ollama just would operate like shit across every possible configuration I tried on ST. If I tried to use Universal-Super-Creative to get shit like this with Ollama I would get nothing but a text string of complete nonsense dogshit wordsalad instead of the vile flowing conciousness meatfucking I've been craving.
>>105696908If he is too stupid to google "how to set up chroma with ooba" and follow a plebbit tutorial, there is not much I can or want to do.
>captcha>4TTRP
>>105696958I tried a few, some quite fat but I always come back to "snowflake-arctic-embed-l-v2.0"
Qwen3-Embedding-4B is okay too, but it is significantly bigger than Arctic.
Untitled
md5: e4260eda402a6520d7e426f143f9e980
🔍
>>105696964I think for your purposes, sillytavern would be easier to deal with.
>>105696944My bad for using the general dedicated to the discussion of local language models to discuss how to use local language models effectively. I'm struggling navigating the command line on fucking ubuntu server; switching to something else when I can barely wrap my head around wget is not what I was hoping to do. Regardless, looks like I can follow
>>105696983's image (is that fucking Gemini?) to either get myself more fucked up, or maybe get it working altogether. If you don't hear from me, imagine it worked.
>>105696958sfw shit? use gemmy3
>>105697000>My bad for using the general dedicated to the discussion of local language modelsdiscussion not tech support
>>105697000By the way, I'm pretty sure that gemini's response is wrong. In anyway case, I really wouldn't use text generation webui for what you're doing. You'll have a easier time running sillytavern on top, since it has easily installed extensions (some are even built in) for persistent memory, summarization, and dynamic images based on current emotion of the character.
>>105697000I know it's overwhelming when you first try to get into it.
But anyway, AI can not be your girlfriend. It just doesn't work that way.
>>105697022...Alright. Thank you for helping me. I'll try and figure out moving the files over to sillytavern. I'll google it.
>>105697022Hey different guy here piling on tech support minute.
Is there a way to have SillyTavern highlight my messages/edits vs generated text? That was one of the features I remember from NovelAI (I think it was NovelAI) that I really liked, because it made it clear what in the text was my tard wrangling.
I've got some questions regarding DeepSeek-R1-0528. I'm a newb.
1. How censored is the model? (criteria: will it answer if I ask it to make a plan to genocide Indians?)
2. Is there any model on huggingface that's trained to be superior?
3. Is there something like a "VPS" so I can run it on my control? (I don't have a strong enough PC)
>>105697043I know I said to switch over to sillytavern for that other anon, but I don't really rp, so I'm not too experienced a st user.
>>1056970591. It'll refuse.
2. There are grifts trained to be "superior".
3. Yeah, you can rent them.
>>1056970741. Damn. AI Studio does answer it.
2. Why grifts? None you recommend? Specially for censorship.
3. Any recommendations?
>>105697111It'll refuse in the same way that most models will refuse any "problematic" request unless you explicit tell it to answer.
You can't really fine-tune the censorship out of a model. It needs to be done during pre-training.
And I run everything locally, I don't have any experience with that. If you're fine with slow speeds, you can run a low quant of it using cpu.
>>105697160Thanks anon.
If anyone else has any renting deepseek experience, pls help.
>>105693514the best part is that it appears Reddit gave the sub to some turbojanny who didn't even make a thread on "redditrequest" like everyone else has to.
It just goes to show (again) that there is a handful of moderators who get all the subs, and they likely all know the admins.
To who they speak directly on discord probably and are given the subs.
This guy here, was the first to actually request it but he didn't get it, and there were several more requests after him:
https://old.reddit.com/r/redditrequest/comments/1lhsjz1/rlocalllama/
And here's the guy who ended up getting it without even requesting it officially lol:
https://old.reddit.com/user/HOLUPREDICTIONS/submitted/
Hopefully someone calls him out on it, I would but I was banned from reddit.
>>105697160Running shit on a VPS is like running them locally, just that the machine is far, far away from you.
>>105697179I don't think the new one will be dedicated enough to limit spam from grifters and karma farmers, it's already worse than it's ever been in recent times and I can already imagine how it will be in just a couple weeks.
every hour, someone on huggingface releases a merge. Usually, they're a merge of merged models - so the merge might consist of 3-4 models and each model in the merge is a merge of another 3-4 models.
From such a method, can you ever really get anything exceptional? I see so many and I'm just starting to think that it's worth dismissing them completely out of hand, despite the claims that you read on the model card. There might be slight improvements (usually there are not) but over the course of a long chat, it's barely noticeable
>>105697310>every hour, someone on huggingface releases a mergethat sounds like the opening to a tearjerker charity drive
>to help, call 1-800-HUG, and stop the senseless merging
>>105697310I'm confident that almost nobody is doing merges because they think they're useful to others.
dd
md5: 0b3db732f236eed6aca249e00b47a47d
🔍
>{{random::arg1::arg2}}
I made a quest table using random macro. There's about 20 or so different entries for now. This allows me to randomize the theme every time I do a new chat. It's all happening in the 'first message' slot.
Is there a way to create the string but hide it from the user? He doesn't even need to know what the quest is.
>>105696010I don't know what LMStudio specifically does by default, but all of the currently existing code for automatically setting the number of GPU layers is very bad.
It's using heuristics for how much VRAM will be used but those heuristics have to be very conservative in order not to OOM, so a lot of performance is left on the table.
Steve's commentary suggests that he is not aware of what the software does internally when there is insufficient VRAM.
>>105696333I'm a long time viewer of GN and left a comment offering to help them with their benchmarking methodology.
>>105697393Whatever to answer my own question, HTML comments do work.
><!-- comment !-->Block out a comment using these in the first comment and it's visible in the terminal but won't appear in ST.
ahhhhhhh. well i am gonna try the /ubergarm/DeepSeek-R1-0528-GGUF quants first. maybe it fucks something up with the unsloth ones.
Any good guide on how to make AI less generic and reddit-like when running fantasy or sci-fi quest?
>>105697460*first message not first comment
>>105697472? Why would I want to 'top' (I know this is a tombler word for sex) a 'sigma' (I assume nsigma is neo-sigma) male (I am not gay)?
can i check at runtime which layers get written to which cuda device?
what the helly?
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_RPC=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
./ik_llama.cpp/build/bin/llama-cli \
--rpc "$RPC_SERVERS" \
--model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
--threads 48 \
--n-gpu-layers 99 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--flash-attn \
--ctx-size 16384 \
--parallel 1 \
-mla 3 -fa \
-amb 512 \
-fmoe \
-ctk q8_0 \
-ot "\.(3[3-9]|4[0-9]|5[0-9]|6[0-9]|7[0-9]|8[0-9]|9[0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU" \
The EQBench author added Mistral Small 3.2 to the Creative Writing bench. https://eqbench.com/creative_writing.html
Contrarily to my expectations, the "slop profile" of Mistral Small 3.2 is apparently the closest to DeepSeek-V3-0324.
https://fortune.com/2025/06/20/hugging-face-thomas-wolf-ai-yes-men-on-servers-no-scientific-breakthroughs/
>“In science, asking the question is the hard part, it’s not finding the answer,” Wolf said. “Once the question is asked, often the answer is quite obvious, but the tough part is really asking the question, and models are very bad at asking great questions.”
>Wolf said he initially found the piece inspiring but started to doubt Amodei’s idealistic vision of the future after the second read.
>“It was saying AI is going to solve cancer, and it’s going to solve mental health problems—it’s going to even bring peace into the world. But then I read it again and realized there’s something that sounds very wrong about it, and I don’t believe that,” he said.
>“Models are just trying to predict the most likely thing,” Wolf explained. “But in almost all big cases of discovery or art, it’s not really the most likely art piece you want to see, but it’s the most interesting one.”
Wow, huggingface bro is totally mogging the anthropic retards, based and brain pilled.
LLMs are useful tools but they are not actual intelligence and will never become intelligence.
>>105697641>AI is going to solve cancer, and it’s going to solve mental health problems—it’s going to even bring peace into the world.Whoever believed that is just naive.
>>105697627>llm as judge for a benchmark in the creative writing fieldthis shit is so retarded bro just stop
llm are fuzzy finders and they can understand the most broken of writing and even when commenting negatively on it they can still act quite sycophant
ex prompt from chatgpt:
https://rentry.org/5f7xrz9y
anyone who takes benches like eqbench seriously are brainless turds, waste of space, waste of oxygen, waste of food and literal oven dodgers
WTF? HOW THE FUCK IS THAT SHIT SOO SLOW
>>105697583What was your "best effort"?
>>105697697with llama.cpp around 7.5 T/S
>>105697672lmao, said exactly like someone who hasn't spent a single fucking second reading the about page
>asks the llm known for glazing the shit out of its users>is surprised when it glazes the shit out of its usersbenchmarks have to be objective and perfect!!! ignore everything with minor flaws in methodology!!! reeee
>>1056977067.5 t/s with gerganov's llama,
And this
>>105697583 with ik_llama?
You finally broke it lol
>>105697712https://arxiv.org/pdf/2506.11440
it's not just about the glazing
LLMs aren't even able to tell when something that should be there isn't there
they rarely manage to spot excessively repetitive writing etc when tasked with judging
you are a subhuman mongoloid who belongs to the oven if you believe in benchmarks
>>105697723yeah, i dont know whats wrong. i assume the PRC is fucked
>>105697672Isn't the fact that there's an apparently massive difference within the same benchmark from 3.1 to 3.2 at least interesting? Or that it seems to be borrowing slop from DeepSeek V3 rather than Gemini/Gemma like others mentioned earlier?
>>105697741It is, just ignore the elo to the moon bs and focus on what you can actually see yourself, so the slop profile info basically
>>105697724>https://arxiv.org/pdf/2506.11440ok interesting paper but wtf does this have to do with anything
slop/repetition is a metric on eqbench btw
and it goes down with better models and up with worse models... lol
even if it's a coincidence or they aren't controlling for it directly you can literally fucking filter it out yourself blindass
thoughts on minimax and polaris?
>>105697419Quite strange why such a comment is being filtered, but GN can it and make it visible if they check their "spam folder" for YouTube comments.
>>105697740Did you try to achieve the max speed just on a single local GPU? You thow too many variables into the equation at once
>>105698103with
./ik_llama.cpp/build/bin/llama-cli \
--model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
--threads 48 \
--n-gpu-layers 40 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--flash-attn \
-mla 3 -fa \
-amb 512 \
-fmoe \
-ctk q8_0 \
-ot "blk\.(1|2|3|4)\.ffn_.*=CUDA0" \
-ot "blk\.(5|6|7|8)\.ffn_.*=CUDA1" \
-ot "blk\.(9|10|11|12)\.ffn_.*=CUDA2" \
-ot "blk\.(13|14|15|16)\.ffn_.*=CUDA3" \
--override-tensor exps=CPU \
i get around 3.5T/s
>>105698194>i get around 3.5T/sI get 4 tkn/s on a single RTX 3090 and gerganov's llama-cli
and it is still 3.5 tkn/s with 20k+ of context
>>105693968I mean a REAL LOADED context
file
md5: acc4ba2e1ad86a20953bc6b47865c3e4
🔍
>>105696010>my post workedThe benchmark is obviously botched but at least he tried.
>>105698422>>105698194Why -cli and not server? Just curious about your setup, this is not a critique.
>>105698422 ./ik_llama.cpp/build/bin/llama-cli \
--rpc "$RPC_SERVERS" \
--model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
--threads 48 \
--n-gpu-layers 99 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--flash-attn \
-mla 3 -fa \
-amb 512 \
-fmoe \
-ctk q8_0 \
-ot "blk\.(1|2|3|4|5|6)\.ffn_.*=CUDA0" \
-ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \
-ot "blk\.(11|12|13|14)\.ffn_.*=CUDA2" \
-ot "blk\.(15|16|17|18)\.ffn_.*=CUDA3" \
-ot "blk\.(19|20|21|22)\.ffn_.*=RPC[10.0.0.28:50052]" \
-ot "blk\.(23|24|25|26)\.ffn_.*=RPC[10.0.0.28:50053]" \
-ot "blk\.(27|28|29|30)\.ffn_.*=RPC[10.0.0.28:50054]" \
-ot "blk\.(31|32|33|34)\.ffn_.*=RPC[10.0.0.28:50055]" \
-ot "blk\.(35|36|37|38)\.ffn_.*=RPC[10.0.0.40:50052]" \
-ot "blk\.(39|40|41|42)\.ffn_.*=RPC[10.0.0.40:50053]" \
-ot "blk\.(43|44|45|46)\.ffn_.*=RPC[10.0.0.40:50054]" \
-ot "blk\.(47|48|49|50)\.ffn_.*=RPC[10.0.0.40:50055]" \
--override-tensor exps=CPU \
--prompt
i am getting 5.5T/s. 2 T/s worse then llama.cpp. also the ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4 are japping holy hell. couldnt even 0-shot a working flappy bird clone- meanwhile /unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL has no problems.
>>105698577i am using the cli with a prompt to get some testing done because i dont have a client ready yet.
>>105698591I see.
I'm hoping to get to programming my own client with the help of chatgpt.
So far SillyTavern has been great but I want to test persistent locations and something what tracks user's location behind the scenes. LLM is used to generate and bolster location descriptions.
I'm sure this has been done million times by now but it's new to me.
bros please help my ESL ass, what does dipsy mean by this?
>>105698577>Why -cli and not server?For some strange reason, in my setup, the server is 40% slower than -cli. The server starts at 4 tkn/s, but quickly falls to 2 tkn/s. I tried different commits, same behavior
d
md5: 791ed2c12c8b09b845fd867acd8f6f9e
🔍
>>105698645Huh. Have you double checked your system's power settings? I'm just guessing here but there has to be some reason.
>>105698645To add - or compare their terminal output.
Server could be using less of your vram by default unless you are directly specifying it with --gpu-layers for example. (so check out task manager for gpu vram usage too).
file
md5: 6fc70cc13f212dddf6b2e6bf04e8f118
🔍
NOOO AI! Not that realistic! ;_;
>>105697419>>105697910If you actually care about getting involved you'd be way better off sending an email than relying on youtube to work properly
>>105698751Wait why would she not have a hymen if she is undead?
whats the best webdev pycuck model to run on 48gb, ive tried devstral on q8_0 and its shite
>>105698798Don't people still say Qwen is great for dev-cucking some short snippets.
Is there some tool that abstracts away all the fucking around with prompt formatting with llama-server?
how difficult is training a lora and do you need really high vram requirements?
What about a lora vs a fine tune?
I want to try add some question and answer style text blocks to mistral large quants to both increase knowledge and reinforce the answering style, I have 48gb vram
file
md5: 2332da3936312f31a7886691498f4fd6
🔍
>>105698779I was not able to find an email for general contact on the GN website and they said in the video to leave a comment, so that is how I'll try contacting them first.
They were responding to comments in the first hour or so after the video was published, if I was simply too late I'll find another way.
>>105698699>>105698742I'm on Linux. And the only thing I changed was -server instead of -cli.
I noticed than the CPU core were running at approx 80% (I isolated 8 cores for the purpose) in case of the server, while they've been at 100% in case of CLI.
GPU load is the same in both cases.
I see no reason why there should be a difference
>>105693934I think you don't understand what LLMs are. They don't "decide" and they don't perform "analysis."
llama.cpp is a guaranteed blue screen for me. Even for very small models that take up a fraction of my vram it bsods when I unload the model. Am I missing out on anything if I use ollama?
>>105699343why not kobold?
.t henk
>>105699359It's based on llama.cpp so it should have a similar issue right?
>>105699406You never know, ollama uses a lot of llama.cpp code too
>>105699343If it doesn't find a connected uranium centrifuge to blow up the system crashes.
It's a known issue, will be fixed soon.