← Home ← Back to /g/

Thread 106559371

386 posts 110 images /g/
Anonymous No.106559371 [Report] >>106559401 >>106561077 >>106562330 >>106564359
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106551921 & >>106539477

►News
>(09/11) Qwen3-Next-80B-A3B released: https://hf.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>(09/11) ERNIE-4.5-21B-A3B-Thinking released: https://hf.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106559374 [Report] >>106559395
►Recent Highlights from the Previous Thread: >>106551921

-Optimizing code generation workflows on V100 GPUs with MoE models:
>106555312 >106555465 >106555506 >106555522 >106555524 >106555586 >106555717 >106555770 >106555782 >106555852
--Best local text gen models and VRAM optimization discussion:
>106556580 >106556863 >106556934 >106557638 >106557036 >106557046 >106557069 >106557098 >106557239 >106557514 >106557190
--AI surpasses mathematicians in complex analysis challenge:
>106558352 >106558367 >106558387 >106558476 >106558500 >106558527 >106558711
--Baidu's ERNIE-4.5-21B-A3B-Thinking model release and performance evaluation:
>106554153 >106554580 >106555008 >106555170 >106555207
--Silero VAD v6 evaluation and comparison with Nvidia's MarbleNet:
>106557953 >106558064
--LocalAI vs OpenWebUI: backend model management vs frontend interface:
>106555093 >106555341 >106555529 >106558434
--Running 30B-A3B models on 12GB VRAM via expert offloading and quantization:
>106558134 >106558186 >106558210 >106558227 >106558238 >106558251 >106558293 >106558317 >106558341
--GPU layer differences in small vs large models due to parameter grouping and optimization:
>106553923 >106554094 >106554256 >106554362 >106554384 >106554458 >106556050 >106556200
--LongCat's strengths and MoE limitations in llama.cpp compatibility:
>106552000 >106552095 >106552267 >106554325 >106554412
--Achieving deterministic LLM inference through caching logic adjustments:
>106555106 >106555150 >106555169
--llama.cpp development updates and flash attention implementation considerations:
>106553388 >106553417 >106553890 >106555026 >106555040 >106555059 >106555061 >106555068
--Qwen3 Next release`:
>106557806 >106557845 >106557853 >106557858 >106557903
--Miku (free space):
>106555337 >106554679 >106555530 >106555574 >106557190 >106558219 >106559139 >106559166 >106559181

►Recent Highlight Posts from the Previous Thread: >>106551925

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.106559395 [Report] >>106559403
>>106559374
I made the highlight reel again back to back boys
Anonymous No.106559401 [Report] >>106561077 >>106562612 >>106564359
>>106559371 (OP)
Anonymous No.106559403 [Report]
>>106559395
https://www.youtube.com/watch?v=VcWAQ5a1NdI
Anonymous No.106559420 [Report] >>106559506 >>106562370
>>106559404
You could still help them out. I'm pretty sure vllm supports it now.
Anonymous No.106559499 [Report]
My uni gives me access to Copilot Chat (GPT 5) and this thing is dumb as fuck, even with search. I think you people have lied to me about the big models being hugely smarter (narratively) than some 32b model.
llama.cpp CUDA dev !!yhbFjk57TDr No.106559506 [Report] >>106559756
>>106559420
Yes, I could be doing a lot of things but everything has an opportunity cost.
Anonymous No.106559516 [Report] >>106559551 >>106559555 >>106559704
>llama.cpp still hasn't added support for qwen-next
https://github.com/ggml-org/llama.cpp/issues/15940
>vllm already merged in support last night
https://github.com/vllm-project/vllm/pull/24526
llama devs are hacks
Anonymous No.106559528 [Report]
Qwen3 Next geejuff status?
Anonymous No.106559551 [Report]
>>106559516
vllm devs only needed to bump their pytorch version or something.
Anonymous No.106559555 [Report] >>106559598 >>106559704
>>106559516
Maybe rewriting the entire ML stack in C++ wasn't such a good idea.
Anonymous No.106559598 [Report] >>106559627
>>106559555
should've used pure C, they probably don't need any of the OOP features anyway.
Anonymous No.106559627 [Report] >>106559680
>>106559598
PyTorch is written in C++ contrary to its name. Nobody is using C for good reasons.
Anonymous No.106559680 [Report] >>106559701
>>106559627
The performance critical parts are, but it's not like you can use PyTorch directly from C++.
Anonymous No.106559696 [Report] >>106559714 >>106559733
qwen 3 80b consensus?
Anonymous No.106559701 [Report]
>>106559680
of course not. pytorch is literally a wrapper for libtorch which is in C++. you would use libtorch if you wanted to use C++. there's alot more support around pytorch tho as it's far more accessible to people.
Anonymous No.106559704 [Report]
>>106559516
>>106559555
maybe 1 has a gorillion dollars since it's used by llm companies and one is a hobby project for consumers
Anonymous No.106559714 [Report] >>106559717
>>106559696
It's shit because there are no goofs
Anonymous No.106559717 [Report]
>>106559714
>he doesn't know how to run safetensors
Anonymous No.106559733 [Report] >>106559753
>>106559696
Anonymous No.106559753 [Report]
>>106559733
literally me
Anonymous No.106559756 [Report]
>>106559506
*CLAP EMOJI* CUDA *CLAP EMOJI* DEV *CLAP EMOJI* WE *CLAP EMOJI* ARE *CLAP EMOJI* ASKING
Anonymous No.106559780 [Report] >>106559792 >>106559800 >>106559803 >>106559823 >>106559926 >>106559975
I never understood how some of you have the hardware and talent to render AI videos and images that are realistic and good and yet you don't make full length porn videos
Anonymous No.106559792 [Report] >>106559824
>>106559780
Video models break down quickly past 5 seconds.
Anonymous No.106559800 [Report]
>>106559780
Porn sucks, text is better. The mind is most powerful sex organ. Unironically. t. man
Anonymous No.106559803 [Report] >>106560128 >>106563909
>>106559780
>knowing how to read instructions= talent
Anonymous No.106559823 [Report]
>>106559780
all the slop I posted this thread took around 8s~ to gen (praise be nunchaku devs)
Anonymous No.106559824 [Report] >>106559863 >>106559975
>>106559792
IIRC standard for brainrot tiktok videos is to have a cut every 3 seconds.
Anonymous No.106559863 [Report] >>106559884
>>106559824
Now try getting the model to maintain consistency across hundreds of 3 second clips.
Anonymous No.106559871 [Report] >>106559925 >>106559938 >>106559943
What options are people running to get speedups on MoE models? There was a way to offload only certain tensors to RAM in order to get a significant speedup. Is it ik_llama.cpp only?
Anonymous No.106559884 [Report]
>>106559863
Who said consistency was a requirement?
Anonymous No.106559925 [Report] >>106559979
>>106559871
"overridetensors": "([2-8]+).ffn_.*_exps.=CPU"
That's what I use on kobold to run 30B A3B Q4_K_M on 8 GB VRAM / 24 GB RAM, the parameter is probably the same on llama.cpp (no fork needed)
Anonymous No.106559926 [Report] >>106559958
>>106559780
about that.....
This one is really good
not full creation, but it's one of the better tools released yet
https://ebsynth.com/
Anonymous No.106559938 [Report]
>>106559871
--cpu-moe is all you need
--n-cpu-moe 999 if you want to be fancy
Anonymous No.106559943 [Report] >>106559962
>>106559871
>>106558251
>--n-cpu-moe 37 --gpu-layers 99
Normal llama.cpp.
Obviously, adjust --n-cpu-moe as needed.
Anonymous No.106559958 [Report] >>106560471
>>106559926
huh, I thought picrelated guy was tracing frames by hand
Anonymous No.106559962 [Report] >>106559979 >>106559984
>>106559943
What was that -ot thing I saw some anons use? It had a bunch of numbers after it.
Anonymous No.106559975 [Report] >>106559998
>>106559824
>>106559780
the issue is it will suck so why bother making it. The ass wont jiggle right, the blowjob wont have audio thats good, any gimmick you add to take advantage of ai will break the lora. And real porn will just look better. Probably better off deepfaking porn already made with enhancements

I have had success using vibevoice to clone a pornstars voice and then have her talk for several minutes using infinite talker. An LLM wrote the script so I wouldnt know what it would say and I got my own personal vid from her, and it was uh... kinda good.
Anonymous No.106559979 [Report] >>106560000
>>106559962
With -ot you can target the specific tensors inside the model's layers using regex. --n-cpu-moe simply obfuscates all that much like -ngl does for whole layers.
One thing to keep in mind when using -ot like in >>106559925 is to not move the shared experts (if they exist) from VRAM, since those are always used.
Anonymous No.106559984 [Report] >>106560000
>>106559962
-ot was the only way to do same thing before --cpu-moe arguments were introduced.
Anonymous No.106559998 [Report] >>106560257 >>106560428
>>106559975
nah bro, you just have to search for it
https://litter.catbox.moe/110x2tu7sbg6hixe.gif
Anonymous No.106560000 [Report] >>106560056
>>106559979
>>106559984
Ah cool, thanks for the explanation. So --n-cpu-moe moves only the non-shared experts to CPU? And --cpu-moe keeps *all* non-shared experts to the CPU?
Anonymous No.106560007 [Report]
fuk me sideways, i wanted to try to use qwen3-next with vLLM and it seems it doesn't work with pipeline parallelism
Anonymous No.106560056 [Report]
>>106560000
I'm only aware of --n-cpu-moe.
Maybe --cpu-moe is the same thing for koboldcpp, I don't know.
As far as I know, --n-cpu-moe also keeps the normal experts to the CPU/RAM.
You can run llama-server with the -h option to get more details.
Anonymous No.106560060 [Report] >>106560089
llamabros...
Anonymous No.106560089 [Report] >>106561153
>>106560060
Note that "primary hardware" is always GPUs. That's because to anyone serious, "cpumaxxing" is as sad and absurd as "ssdmaxxing" is to us.
Anonymous No.106560095 [Report]
https://allenai.org/blog/olmo2-32b
How did they manage to do it in just 32B?
Anonymous No.106560128 [Report]
>>106559803
it really is in this day and age
western kids have been dragged down to the level of their 80IQ peers for two generations now
Anonymous No.106560211 [Report] >>106560245 >>106560248 >>106560274 >>106563929
>Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus. It uses less than 80% of the GPU hours needed by Qwen3-30A-3B, and only 9.3% of the compute cost of Qwen3-32B — while achieving better performance. This shows outstanding training efficiency and value.
And it beats Qwen 3 32B + handles long context better than the 235B moe
pretty impressing stuff
Anonymous No.106560245 [Report]
>>106560211
That's great. Would be greater if they expedited a Qwen3 Next Coder.
Anonymous No.106560248 [Report] >>106560269
>>106560211
It's native 256K context I think, without extending.
Anonymous No.106560257 [Report]
>>106559998
Illya?
Anonymous No.106560269 [Report]
>>106560248
yeah but the RULER benchmark is better on the Q-Next than the Q3 235B
Anonymous No.106560271 [Report]
>>106557716
i don't work in an office
Anonymous No.106560274 [Report] >>106560283 >>106560291 >>106560302 >>106560314
>>106560211
Isn't Qwen3-Next 70B? Why are they comparing to Qwen3 32B and not other 70B models?
Anonymous No.106560283 [Report] >>106560294 >>106560310
>>106560274
It's 80B A3B.
Anonymous No.106560291 [Report] >>106560294
>>106560274
Supersparse MoE, 80B A3B
Anonymous No.106560294 [Report] >>106560302 >>106560322
>>106560291
>>106560283
Okay, so how does it compare to models that are around 80B?
Anonymous No.106560302 [Report]
>>106560274
they compare it to every other Qwen3.
>>106560294
they did not bother comparing it to non-Qwen3 models.
Anonymous No.106560310 [Report]
>>106560283
sqrt(80*3) means it's a copetitor to 16b models
Anonymous No.106560314 [Report] >>106560338 >>106560356
>>106560274
Even Gemini is praising it.
Anonymous No.106560320 [Report] >>106560327
Qwen3-Max is such a disappointment that I have absolutely zero hope for 3.5
Alibaba truly is the meta of China
Anonymous No.106560322 [Report]
>>106560294
It's faster :)
Anonymous No.106560327 [Report] >>106560350
>>106560320
Kinda funny how Max got completely overshadowed by Qwen3-Next.
Anonymous No.106560331 [Report]
3bit? is that not bitnet?
Anonymous No.106560338 [Report]
>>106560314
Gemini will praise anything
Anonymous No.106560346 [Report] >>106560354 >>106560433 >>106560436 >>106560596 >>106562686
LLMs seem like a competition between Americans, Europe, and China. Why can't Russia, Japan nor Korea compete despite being tech giants?
Anonymous No.106560350 [Report]
>>106560327
Max got overshadowed by the fact that it's completely pointless so everyone forgot about it two hours after it became available.
Anonymous No.106560354 [Report] >>106560361
>>106560346
>Europe
They're competing? It looks like only one European state is just barely trying.
Anonymous No.106560356 [Report] >>106560463
>>106560314
>an 80B model requires ~160GB of VRAM. A 3-bit version could potentially run in under 40GB of VRAM, making it feasible to run on a single high-end GPU like an NVIDA RTX 4090
This is Gemini? The peak of LLMs right now? With web access?
Anonymous No.106560361 [Report] >>106560404
>>106560354
Mistral was great
Anonymous No.106560404 [Report] >>106560414
>>106560361
>was
Yeah
Anonymous No.106560409 [Report] >>106560417
The proper name is Qwen3-MoE-A3B thank you.
Anonymous No.106560414 [Report] >>106560481
>>106560404
It still is.
Anonymous No.106560417 [Report] >>106560440
>>106560409
The sqrt(total * active) formula has been officially confirmed
Anonymous No.106560428 [Report]
>>106559998
ghostbusters ectoplasm ghostly appearing sperm
Anonymous No.106560433 [Report]
>>106560346
>Europe
Lol lmao even
Anonymous No.106560436 [Report] >>106560596
>>106560346
>Russia
For the last 35 years the #1 rule of doing business in Russia was "don't do business in Russia". CS stuff was the easiest to move abroad.
It's not like there's nothing at all, IIRC Yandex was pretty competitive in the self-driving scene for a moment, and every street dog sells it's own proprietary voice assistant now, but for local I only found https://huggingface.co/yandex/YandexGPT-5-Lite-8B-instruct so far. (It's whatever)
>Japan
Failed into programming way back when we started using real operating systems and high level languages, and did not recover to this day. I blame language barrier.
>Korea
Probably too busy printing money with all their gachas instead.
Anonymous No.106560440 [Report]
>>106560417
Anonymous No.106560442 [Report] >>106560454 >>106560460
>Why aren't you using vllm bro?
>What do you mean you don't have H100 cluster? It can still work with A100 cluster bro.
>Wait, you got just an RTX 3090? Uhm, I've never heard about such GPUs, must be Chinese knockoff or something. Get legit hardware bro.
>You got no money? Just ask for grants bro! Or get investors. You are part of the network, right?
Anonymous No.106560454 [Report]
>>106560442
>You got no money?
Have you tried getting a job recently? scamming into a grant or investor is unironically easier at this point.
Anonymous No.106560459 [Report]
Are you enjoying the next best thing? (Qwen Next) (Subtle request for feedback)
Anonymous No.106560460 [Report] >>106560474 >>106560489
>>106560442
vllm can run on an intel arc gpu. you've got no excuse bro. Also it can do cpu as well and MoEs and even got gguf support not long ago
Anonymous No.106560463 [Report]
>>106560356
Anonymous No.106560471 [Report]
>>106559958
I wonder what this dude thinks about AI. There's not a lot of difference between what he does and what video models can do.
Anonymous No.106560474 [Report] >>106560483
>>106560460
Where can I buy B60 for MSRP(1200BURGERCOINS)?
Anonymous No.106560481 [Report] >>106560588 >>106562483
>>106560414
Mistral's 2025 output has been one okayish 24B model and nothing else of note
Meanwhile their business model is increasingly Cohere-ifying and there's good reason to believe they fucked up training Large 3
Maybe the cash injection from ASML will help some but acting like they're still internationally competitive is a joke
Anonymous No.106560483 [Report] >>106560550
>>106560474
B60 will be announced next month at SEMICON West. $500 for 24GB model
Anonymous No.106560489 [Report] >>106560523
>>106560460
How does pure CPU performance of vllm compare to ik_llama.cpp?
Anonymous No.106560523 [Report] >>106560551
>>106560489
I would suggest just standing up the vllm cpu docker image and running benchmarks yourself. You probably won't find much public info for cpu benchmarks between the two.
Anonymous No.106560550 [Report] >>106560600
>>106560483
>AMD CPU
>Intel GPU
>NVIDIA RAM
If only I had infinite money...
Anonymous No.106560551 [Report] >>106560561 >>106560587 >>106560606 >>106561458
>>106560523
Where exe? I DONT GIVE A FUCK ABOUT THE FUCKING DOCKER! i just want to download this stupid fucking application and use it
WHY IS THERE DOCKER??? MAKE A FUCKING .EXE FILE AND GIVE IT TO ME. these dumbfucks think that everyone is a developer and understands code. well i am not and i don't understand it. I only know to download and install applications. SO WHY THE FUCK IS THERE DOCKER? make an EXE file and give it to me. STUPID FUCKING SMELLY NERDS
Anonymous No.106560561 [Report]
>>106560551
Anonymous No.106560566 [Report] >>106560572
>2025
>vibevoice is fully forgotten
Anonymous No.106560572 [Report]
>>106560566
Useless without training scripts.
Anonymous No.106560587 [Report]
>>106560551
Anonymous No.106560588 [Report]
>>106560481
To Mistral's credit, that single model they made is actually the best model for running on a normal PC. Gemma is heavily censored, Qwen's similar sized models are worse at non benchmaxx tasks and everything else is too big unless you're building your PC for running LLMs
Anonymous No.106560596 [Report]
>>106560346
>>106560436
>Russia
Case in point: https://en.wikipedia.org/wiki/ABBYY_FineReader
I was informed this used to be SOTA for OCR.
>ABBY ... was founded in the USSR and operated in Russia for nine years before moving to the United States.
Anonymous No.106560600 [Report]
>>106560550
>NVIDIA CPU
>AMD GPU
>INTEL RAM
WE ARE MAKING A MEME SYSTEM. OPTANE WILL NEVER DIE.
Anonymous No.106560604 [Report] >>106560622
nvidia not offering a 24gb 50xx card was criminal and i'm tired of pretending otherwise.
Anonymous No.106560606 [Report] >>106560614 >>106561807
>>106560551
This argument has never been refuted
Anonymous No.106560614 [Report] >>106560630
>>106560606
nobody wants to deal with women. if exe is a filter than so be it.
Anonymous No.106560619 [Report] >>106560634 >>106560640
mistral for erp
qwen3 for anything else but erp
Anonymous No.106560622 [Report] >>106560627
>>106560604
Fuck 24GB. The 5090 should have just been cheaper, it's not remotely close to being a proper workstation card and 32GB is too little for anything outside of hobbyist stuff.
Anonymous No.106560627 [Report]
>>106560622
It's a gayming gpu. Buy from their worktsation lineup if you want professional stuff.
Anonymous No.106560630 [Report] >>106560635
>>106560614
You aren't a woman, though
Anonymous No.106560633 [Report]
>>106559044
SSDmaxxbros, maybe our time is finally cuming soon...
Anonymous No.106560634 [Report] >>106560665
>>106560619
But what about sfw rp, is that included in that? Is Qwen 3 smarter than Gemma 3?
Anonymous No.106560635 [Report]
>>106560630
no?? really??? I think your lost bro, this isn't >>>/lgbt/
Anonymous No.106560640 [Report]
>>106560619
>anything else but erp
there is nothing else
Anonymous No.106560645 [Report] >>106560723 >>106560751 >>106560809
Am I about to get scammed? I've never seen these under $1000. From Hong Kong.
Anonymous No.106560665 [Report]
>>106560634
>gemma3
after all safety humiliation I got I will never use it again
Anonymous No.106560687 [Report] >>106561056
I refuse to support any model whose selling point is high context limits. Every llm i've used from free to paid are absolute garbage and hallucinate at high context.
Anonymous No.106560693 [Report] >>106560965
>forcing full prompt re-processing due to lack of cache data (likely due to SWA
humiliation ritual
Anonymous No.106560713 [Report]
My CLINE prompts are all timing out when I'm trying to use gemma3:12b on a 4070. Do I need a quantized model instead?
not bqgfla-1 No.106560723 [Report]
>>106560645
No, you're in for a great deal! Buy it quick, there's only one left!
Anonymous No.106560751 [Report]
>>106560645
>seller with 0 reviews
Yeah, trust him!
Anonymous No.106560809 [Report] >>106560814 >>106560970
>>106560645
bro no don't do that
buy this one: https://www.ebay.com/itm/325407276138

much better trust me
Anonymous No.106560814 [Report] >>106560967 >>106561079
>>106560809
>Graphcore IPU
what
Anonymous No.106560965 [Report] >>106562680
>>106560693
How do you prevent this?
Anonymous No.106560967 [Report] >>106561079
>>106560814
>intelligent processing unit
lmao
Anonymous No.106560970 [Report]
>>106560809
ok ersinc03
Anonymous No.106561056 [Report]
>>106560687
You can't trust the actual numbers for context that companies put out, they're always wrong. But it's usually safe to assume that a higher advertised number does mean a higher 'effective' context ceiling.
Anonymous No.106561071 [Report]
Can't wait for adobe research to publish an updated study on how all these models go to shit past 32k
Anonymous No.106561077 [Report] >>106561915
>>106559371 (OP)
>>106559401
>no tits
>shitty reddit memes
You are gay.
Anonymous No.106561079 [Report] >>106561127
>>106560967
>>106560814
IPU/NPUs are a real thing, they're in all the new CPUs from AMD for instance. just not from meme companies like that one.
Anonymous No.106561127 [Report] >>106561164 >>106561166
>>106561079
>central processing unit
makes sense
>graphics processing unit
yup
>neural processing unit
works with neural networks, gotcha
>intelligent processing unit
the fuck is this supposed to be? it sounds like some marketing term
Anonymous No.106561145 [Report] >>106561161
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
https://arxiv.org/abs/2509.09679
>Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence \mu = 1/\sqrt{n}--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete \{+1, -1\} entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving O(n \log n) computational complexity with only \frac{n \log n}{2} learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
Links belowhttps://github.com/42Shawn
https://github.com/oumi-ai/oumi
Code might be posted on one of those. Might be cool but then again very little results included.
previous paper that looked at butterfly transforms
https://arxiv.org/abs/2302.06646
Anonymous No.106561153 [Report]
>>106560089
Anyone serious is deploying for enterprise, not personal use. Normal people don't use local models for personal use, just like normal people don't use 4chan and only use fb/linkedin.
Anonymous No.106561161 [Report] >>106561168
>>106561145
That is very nice, but how does it compare to GGUF?
Anonymous No.106561164 [Report]
>>106561127
>it sounds like some marketing term
it basically is. NPU = IPU
in the industry its looking like NPU has won out but AMD at least early in the developments of NPUs in like 2023 referred to it as IPUs as well
Anonymous No.106561166 [Report] >>106561184 >>106561602
>>106561127
Graphics Processing Unit is a horrible term nowadays.
NVIDIA calls the H100 a GPU even though it doesn’t even have a display output and isn’t aimed at graphics processing.
Anonymous No.106561168 [Report] >>106561177
>>106561161
Probably just as shit as Q2 ggufs are.
Anonymous No.106561177 [Report] >>106561204
>>106561168
Nah, ggufs are probably better since they don't mention them.
Anonymous No.106561184 [Report] >>106561205 >>106561207 >>106561336 >>106561367
>>106561166
GPU stands for "General Processing Unit" in nvidia's own terms
Anonymous No.106561186 [Report]
2-bit is all you need, you don't need more
Anonymous No.106561204 [Report] >>106561309
>>106561177
My point stands. Q2 is shit . This is literally a competition of who has the nicer looking pile of shit. If you're seriously using a Q2 model you need to reevaluate your life. Also the paper likely doesn't mention GGUFs at all because it's talking about W2A16 which Q2 GGUF can't even map to in practice.
Anonymous No.106561205 [Report] >>106565782
>>106561184
https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/
Anonymous No.106561207 [Report] >>106561367
>>106561184
that's a backronym they made up so they can keep using the term everyone would have used anyways
Anonymous No.106561302 [Report]
That reminds me, I have vllm installed. Might as well try a quick speed comparison. Tomorrow maybe.
Anonymous No.106561309 [Report]
>>106561204
case in point
Anonymous No.106561336 [Report] >>106561367
>>106561184
Should call them NVIDIA Processing Units to shit into everyone's salad.
Anonymous No.106561341 [Report] >>106561347 >>106561358
RP testing qwen3-next-thinking and it has a completely different reasoning style from 2507, and not in a particularly good way
several times more verbose and EXTREMELY wasteful of tokens - trying out different lines of dialogue over and over again, outputting them in full with minor variations, outputting full drafts of the response, or in one case "let me check the previous messages [proceeds to output EVERY previous turn of the roleplay IN FULL]"... wtf. I get the sense that this is something of a proof of concept model for them (and to their credit, in my limited testing the models do seem smart and pretty good at long context) but they've gotta fix this for 3.5 or whatever their next release is.
Anonymous No.106561347 [Report] >>106561391
>>106561341
post cockbench
Anonymous No.106561358 [Report] >>106561391
>>106561341
Have you tried prefilling the thinking with some guidance on how to think about the RP?
Anonymous No.106561367 [Report] >>106561514
>>106561184
>>106561207
>>106561336
You butt hurt boys are SO silly! :3
Anonymous No.106561380 [Report] >>106563978
Why is it so hard to get models to undress the finnish catgirl pm?
Anonymous No.106561391 [Report]
>>106561347
I APIfagged, sorry anon. I'd expect it to be in line with the 2507 qwens though.
>>106561358
not yet, I'm putting off messing with it more until there are ggufs
Anonymous No.106561458 [Report]
>>106560551
> anon comes to a thread where everyone has a fucking doctorate in AI
> sees the word docker
> looses his shit as he's dumb as fuck
> after crying gets his mcdonalds uniform ready for work tomorrow
Anonymous No.106561459 [Report] >>106561476 >>106561562 >>106561572 >>106561768
Not sure what I expected.
What is this called? At the beginning the sentences are long and then its all short and weird. I saw this before with another sloped model.
Anonymous No.106561476 [Report] >>106561506
>>106561459
I can get better outputs from llama 8B. Holy slop
Anonymous No.106561506 [Report] >>106561512 >>106563309
>>106561476
Sad because this would have been a really cool size.Fast even with offloading.
But at least they try something new.
Anonymous No.106561512 [Report]
>>106561506
even the chinks are putting in extreme safety nets. shame. Gemma3 tier slop
Anonymous No.106561514 [Report]
>>106561367
Consider the following you tranny freak
Anonymous No.106561562 [Report]
>>106561459
It kinda communicates pacing.
Anonymous No.106561572 [Report]
>>106561459
Somehow way worse than Mistral Small
Anonymous No.106561599 [Report] >>106566096
I did nothing today
Anonymous No.106561602 [Report]
>>106561166
>isn’t aimed at graphics processing
you can have a gpu render something and then display it through an iGPU's display output
i wonder if you could stick a h100 inside a normal desktop PC, install the geforce driver (after doing inf mod) and then just play games on it
Anonymous No.106561768 [Report] >>106561794
>>106561459
It's qwen3 only problem I think. It tries to mimic the text formatting from latest response. Also how it was trained could be the culprit, like maybe it was trained with a bunch of Chinese poems.
The pattern I noticed is like this :
1 paragraph -> 2 paragraphs -> 3 -> 4 -> 5 -> Then it ended with one line per paragraph.

So far the only way to control it is by instruct it explicitly in system prompt. For example I'm using this :
"Respond in multiple standard paragraphs format. Avoid poetic or dramatic elements. "
Anonymous No.106561794 [Report] >>106561855
>>106561768
That helps. But what a weird writing style. Feels like Deepseek on steroids.
Anonymous No.106561807 [Report]
>>106560606
pay me
Anonymous No.106561855 [Report] >>106561861
>>106561794
now you've got a pattern of 3 paragraphs of exactly 3 lines.
Anonymous No.106561861 [Report] >>106561922
>>106561855
As god intended it. Proper paragraphs should never exceed more than 3-4 lines. I learned that in middle school
Anonymous No.106561915 [Report] >>106562085
>>106561077
>>no tits
Perfect.
Anonymous No.106561922 [Report]
>>106561861
congratulations for completing middle school anon.
nobody thought you could do it, but you did.
Anonymous No.106561944 [Report] >>106562231 >>106562694
How will qwen 80b-A3B improve my text adventures involving me being a magical kemoshota that cures predators of their fucked up fetishes?
Anonymous No.106561992 [Report]
wen qwen ggoofs
Anonymous No.106562002 [Report]
y no opera his son?
Anonymous No.106562029 [Report]
i think im gonna goof...
Anonymous No.106562070 [Report] >>106562090 >>106562124
Realistically speaking, there haven't been any improvements erp wise since llama3.3-70b and mistral large 2407
Anonymous No.106562085 [Report]
>>106561915
Oke doke gay
Anonymous No.106562090 [Report]
>>106562070
I've never once used LLMs to goon so I have no idea what this even means
But the obvious solution to get around LLMs not doing what you want is to be agentic
agents aren't only for tool calling and APIs. They can also form complex logic based on natural language, like following and maintaining a story structure despite whatever retarded shit you're trying to pull
Anonymous No.106562092 [Report]
https://www.washingtontimes.com/news/2025/sep/11/ftc-launches-inquiry-ai-chatbots-acting-companions-effects-children
Anonymous No.106562108 [Report] >>106562200
Anonymous No.106562124 [Report]
>>106562070
Air is a direct upgrade for 3-4 3090 VRAMlets
Anonymous No.106562161 [Report] >>106562200
Anonymous No.106562200 [Report]
>>106562108
>>106562161
tfw the goofs are nevermore
Anonymous No.106562231 [Report]
>>106561944
Seeing how shit it is will make you put even more effort into your RPs using Mistral Nemo
Anonymous No.106562240 [Report] >>106562244
im vibecoding vibevoice for my vibecoded local ai software. what am i in for?
Anonymous No.106562242 [Report]
Anonymous No.106562244 [Report]
>>106562240
aids
Anonymous No.106562252 [Report]
Anonymous No.106562289 [Report] >>106562298
It's up!
Anonymous No.106562298 [Report]
>>106562289
>*looks down*
Yes, it is!
Anonymous No.106562330 [Report]
>>106559371 (OP)
Cute miku I like
Anonymous No.106562331 [Report] >>106562345 >>106562353
wheres my fucking ggoofs Daniel
Anonymous No.106562345 [Report] >>106562356
>>106562331
What's happening?
Anonymous No.106562353 [Report]
>>106562331
>unsloth
>ever
lmao
Anonymous No.106562356 [Report]
>>106562345
upload the qwen3 next ggoofs you goof
Anonymous No.106562370 [Report]
>>106559420
vLLM supported it in June via https://github.com/vllm-project/vllm/commit/b69781f107b7ad847a351f584178cfafbee2b32a but it's really hacky and depends on their Extension for Pytorch and some calls in their LLM hacked backend.
The best I've seen from Intel publicly for C++ is their closed pull request inside the main Flash Attention repo.
https://github.com/Dao-AILab/flash-attention/pull/1528
This uses SYCL so yeah, would be kind of an uphill for anyone not an Intel developer to adapt to the existing CUDA code.
Anonymous No.106562423 [Report] >>106562427 >>106562430 >>106562453 >>106562482 >>106562763 >>106564012
Damn phonemizers are a huge bottleneck for TTS because devs use by default the pile of trash that is espeak. On CPU for kokoro it takes almost 8-9s to preprocess a single sentence to IPA phonemes on my laptop while the inference itself is ~6s and that shit grows at O(n) or more (fucking 22s to preprocess a paragraph). Switching to g2p_en for american english + a bunch of heuristics I got from chatgpt achieves the same preprocessing output in 1.5s for a single sentence, growing at ~O(log N). I wish this field focused a bit on efficiency instead of convenience
Anonymous No.106562427 [Report]
>>106562423
ok nerd
Anonymous No.106562430 [Report] >>106562450
>>106562423
You don't need to pre-process anything.
Anonymous No.106562450 [Report] >>106562542
>>106562430
It's not feeding the raw text to the TTS, it's preprocessing the text to phonemes before feeding them to the model
Anonymous No.106562453 [Report] >>106562493
>>106562423
Shouldn't that just be a database lookup for retarded languages like english where the pronounciation doesn't match the spelling?
Anonymous No.106562482 [Report] >>106562515 >>106562543
>>106562423
They aren't using espeak just because it is easy, it is because it has multilingual support out of the box. G2P is much harder to configure with mappings needed for each language.
Anonymous No.106562483 [Report]
>>106560481
They've made other models too, but they're mostly not open-weights. But I don't get why they don't start doing MoE models the Qwen way though, wouldn't that make them able to release them in a wider range of sizes with less compute?
Anonymous No.106562493 [Report] >>106563141
>>106562453
It's not enough, because some words have different pronunciation depending on whether they're a noun or a verb while written the same way, like "use" + other things that are context dependant
Anonymous No.106562515 [Report]
>>106562482
You're describing convenience bro. Espeak is almost twenty years old, it has memory leaks and a lot of issues that won't ever be fixed because of GPL no one wants to contribute to this trash.
Anonymous No.106562542 [Report] >>106562586
>>106562450
One thing what will help you regardless - doesn't matter if it gets converted to phonememes or not - is to use contractions module
>import contractions
>cleaned_text = contractions.fix(text)
and then remove surrogates with regex and optionally add abbreviations and optionally clean up any problematic remaining characters (because LLMs always output random shit).
Anonymous No.106562543 [Report] >>106562579 >>106562608
>>106562482
Sounds like an llm whose sole purpose is to take text as input and output ipa is required.
Anonymous No.106562546 [Report] >>106562549
>MUH ERP
go to sleep americlaps and huemonkeys.
productive eurochads are taking over from here.
Anonymous No.106562549 [Report]
>>106562546
Give us Miqu 3 already or at least Largestral 3. WTF are you frogs doing?
Anonymous No.106562579 [Report]
>>106562543
at that point you might as well take text as input and diffuse the audio directly.
Take in some positive/negative descriptor tokens too.
Anonymous No.106562586 [Report] >>106562603
>>106562542
Thanks, I didn't know it was a thing. I'll add that
Anonymous No.106562603 [Report]
>>106562586
Yeah so what I did with piper voice (it's instant tts, takes ~100 mb or less but it's not as robust as vibevoice of course)
>contractions
>surrogates
># remove surrogates (U+D800 to U+DFFF unicode range)
>cleaned_text = re.sub(r'[\ud800-\udfff]', '', cleaned_text)
>Then replace commas, ellipses, "", dash, em dash, and whatever else there is with either empty spaces or periods - this way TTS does not even try to do anything but it'll go straight onward - basically remove and replace everything else except periods. This is sort of trial and error, you'll need to test this and proceed accordingly.
Anonymous No.106562608 [Report]
>>106562543
There are small transformers for that (T5), but it's even slower than espeak. They're using them for disambiguation, which is fine when you don't care about latency and want the output to be as good as possible
Anonymous No.106562612 [Report]
>>106559401
kek
Anonymous No.106562680 [Report]
>>106560965
In llama.cpp? --swa-full or you can set --swa-checkpoints to something higher than 3. I think the checkpoints are made after each gen is finished. Each checkpoint consumes memory, of course.
Anonymous No.106562686 [Report]
>>106560346
this is investor driven, not a real organic industry with demand. Investors are just going to choose safe countries that have lower amounts of corruption. One of the unique things the USA can do is be given a bunch of money with nebulous expectations and instead of vanishing into oligarch palaces people will generally try to develop something, hiring devs to work on it AT ALL
Anonymous No.106562694 [Report] >>106562710
>>106561944
the only reason we want to try it is to see how censored it is compared to glm air
Anonymous No.106562710 [Report]
>>106562694
yeah where the fuck are the goofs
Anonymous No.106562763 [Report] >>106564024 >>106564046
>>106562423
I made my own thing to run onnx models (piper, kokoro and kittentts) and espeak as the phonemizer is almost instant.
I looked at kokoro's code and it does a lot of their own phoneme translations and text normalization before and after espeak. The bottleneck is definitely not in espeak itself.
>https://voca.ro/1fQed7Pp86Ek
Nevermind the cut at the end. It's a low q voice and i'm still not doing some of the padding correctly. And this is running on a t420.
Anonymous No.106562827 [Report] >>106562864 >>106562981 >>106563064
What purpose do LLMs serve in TTS and image generation models like VibeVoice and Flux? How are they incorporated?
Anonymous No.106562864 [Report]
>>106562827
LLMs are typically autoregressive. The next token is dependent on the previous token[s]. For TTS, the start of the sound of a phoneme depends on how the previous one ended. There's nothing about language models specifically. It's just the autoregression.
I don't think Flux is autoregressive, but it has a text encoder, which isn't quite an LLM, but it's part of one. Converts text into an embedding which is then fed to the denoiser. Think of embeddings as abstract representations of a collection of words/sentence.
Anonymous No.106562972 [Report] >>106562998 >>106566610
Is anyone actually using GLM Air as their go-to RP model? Is it actually an improvement over Small/Nemo? I only have 32GB RAM so I can't use any decent quants, considering upgrading to run it.
Anonymous No.106562981 [Report]
>>106562827
As far as I understand, the internal state of an llm has contextual information such as sentiment, which can be used for more realistic tts. The pretrained lm is stripped of some final layers and is further trained with a new objective because it's better than starting from random parameters
Anonymous No.106562998 [Report] >>106563093
>>106562972
I am! q4_xs, 49152 context. It's shit! I hate it! But so are all the other models.

Snuff/vore (not snuff-vore).
Anonymous No.106563064 [Report]
>>106562827
Maybe part of the multi-modal push?
Although that would be about feeding pieces of images and sounds into an llm.
Anonymous No.106563074 [Report] >>106563109 >>106563163 >>106563365 >>106563394 >>106563614 >>106563680 >>106564000 >>106564086
https://techcrunch.com/2025/09/10/a-california-bill-that-would-regulate-ai-companion-chatbots-is-close-to-becoming-law/
local keeps winning
Anonymous No.106563093 [Report] >>106563147
>>106562998
Do you find it usable at higher context? I did try a ~q2_xs quant and it seemed decent until ~8k where it kept repeating itself, though that could very well be because of the low quant.
Anonymous No.106563109 [Report] >>106563394 >>106564741
>>106563074
>require AI chatbot operators to implement safety protocols for AI companions and hold companies legally accountable if their chatbots fail to meet those standards.

>The bill specifically aims to prevent companion chatbots — which the legislation defines as AI systems that provide adaptive, human-like responses and are capable of meeting a user’s social needs

C.ai is die in california
Anonymous No.106563141 [Report]
>>106562493
>some words have different pronunciation depending on whether they're a noun or a verb while written the same way, like "use" + other things that are context dependant
Oh yeah you're right it's subtle but it's there. Honestly I don't give a fuck after switching between the two use's I think if you used the quicker verb version always literally no one would ever notice. Might be different based on accent of course
Anonymous No.106563147 [Report] >>106563175
>>106563093
I stopped after the second response. Really can't get used to the way llms write.

But in the planning and world building phases, iq4_xs is definitely very coherent and usable up to 49152. It's not smart, but it's usable. No real repetitions, even when I was incrementally instructing instead of sweeping changes to the world.

That's a different beast from rp and actual creative writing though. Even with a specific plan to draw out the deaths, i.e pummel A, kill B, further injury on A, kill C, even more gratutious violence on A, kill D, before finally killing A (in a sexual manner, because this is erp), it'll forget it (both at 12k after some preamble, and at 4k context when I drop it right into it), and instantly go and focus on killing/torturing A - finishing it in one or two scenes.

>But in the planning and world building phases, iq4_xs is definitely very coherent and usable up to 49152
Do note, however, you may need to write the start of the llm's response to accept the request or it may get denied. It's better (less safe) than other llms, but it still will deny some things, especially in an assistant context.

>No real repetitions
Actually, that's not true. It got an aneurysm when talking about meat toilets, and kept on repeating eeeeeeeeeeeeeeeee for about maybe a hundred times before catching itself and saying "wait no that's a typo let me fix it", which I've never seen before. Usually it's a death spiral when that happens with other models. I've only seen that happen once.
Anonymous No.106563163 [Report]
>>106563074
end of ERP coming...
Anonymous No.106563175 [Report]
>>106563147
Interesting, thanks
Anonymous No.106563309 [Report]
>>106561506
Hmm.
"Check if this is appropriate"
"I recall our guidelines"
this feels trained on gpt-oss thinking patterns.
Anonymous No.106563343 [Report] >>106563345 >>106566103
i've setup a server with Qwen3-Next-80B-A3B-Instruct for you anons, give it a try. 64K context. It's on vllm with pipeline parallelism so not the best but its should support quite a few parallel request. It's a bit of a frakenmix with a mixture of gpus but it's running at 65t/s for me.

url in picrel

api-key: "sk-miku"
model-name: "Qwen/Qwen3-Next-80B-A3B-Instruct"

I tested with openwebui and it requires the /v1, sillytavern does too in chat completions but not on text completions.

Someone do a cockbench
Anonymous No.106563345 [Report] >>106563350
>>106563343
you are not getting my IP anon
Anonymous No.106563350 [Report]
>>106563345
whatever use a vpn
Anonymous No.106563365 [Report] >>106563394 >>106563402 >>106563628 >>106563636
>>106563074
>regulate AI companion chatbots in order to protect minors and vulnerable users
>which the legislation defines as AI systems that provide adaptive, human-like responses and are capable of meeting a user’s social needs — from engaging in conversations around suicidal ideation, self-harm, or sexually explicit content.
the fuck, lol.
Anonymous No.106563394 [Report]
>>106563074
The regulatory moat is moat is finally coming.

>>106563109
>>106563365
>or sexually explicit content.
Got to kill any niche the big players aren't interested in to prevent future competitors.
Anonymous No.106563402 [Report]
>>106563365
A a european I think these are in fact social needs in 2025 US
Anonymous No.106563608 [Report]
https://huggingface.co/blog/faster-transformers
Looks like OpenAI did move the local ecosystem forward, not by their model itself, but with the tricks they used to run it
Anonymous No.106563614 [Report] >>106563820
>>106563074
- use jailbreak to avoid guardrails.
- sue company.
- earn $1.000.
- profit.
Anonymous No.106563628 [Report] >>106563649
>>106563365
>or sexually explicit content.
why? what's wrong with sex?
Anonymous No.106563636 [Report]
>>106563365
Jerking off is indeed just as bad as throwing yourself off a bridge
Anonymous No.106563649 [Report]
>>106563628
Thats always how it goes.
First they bring out the "muh cunny". Its NEVER were it stops.
But there is no real outrage unless the normies are affected.
Wouldnt be suprised if the femoids gonna do witch rituals in response to their claude-sama being full robo mode now. kek
Anonymous No.106563656 [Report]
ok but where are the fucking goofs?
Anonymous No.106563664 [Report] >>106563694 >>106563752 >>106564136
guys, will i be able to run some low quant of qwen3next-80b on 8gb vram/32gb ram? i'm hoping for a mistral nemo upgrade for erp...
Anonymous No.106563680 [Report] >>106563812
>>106563074
As if local models won't be trained with regulations in mind too
Anonymous No.106563691 [Report] >>106563763 >>106563772 >>106563786 >>106564411
Why is my performance so horrible? Llama-server.exe on windows 10 is mindbogglingly slow with multiple gpus. during inference, the reported power usage of my gpus are 80-90 watts when they have a limit of 350w.

call ".\llama-server.exe" --ctx-size 16384 --jinja --flash-attn auto --cache-type-v f16 --mlock --no-mmap --device CUDA0,CUDA1,CUDA2 --gpu-layers 48 --split-mode layer --tensor-split 33,33,33 --main-gpu 0 --model "D:\Models\GLM 4.5 Air 106bA12b Zhipu AI -- bartowski_zai-org_GLM-4.5-Air-GGUF\zai-org_GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf"

Glm air iq4_xs with 0 context shouldn't be this slow on 3090s, no?
Anonymous No.106563694 [Report]
>>106563664
So 40 vram+ram in total. 4bit should be 40gb, so you will be able to run 3.5bpw to leave space for context
Anonymous No.106563752 [Report]
>>106563664
q4 will be ~50gb or so so you will have to use MAP a bit to get it to run off of your SSD, which may affect your tokens per second. More ram would help, but it may run at somewhat usable speeds. May be very usable if llama ever implements its mtp layers but that never happens- though qwen sometimes puts in their own support so hopefully
Anonymous No.106563763 [Report]
>>106563691
>--split-mode layer --tensor-split 33,33,33
Try removing this.
Anonymous No.106563772 [Report] >>106563838 >>106563847 >>106563861
>>106563691
>--jinja
Is it doing anything? I thought GLM was supported directly.
>--flash-attn auto
Is this doing anything? Isn't it the default?
>--cache-type-v f16
Is doing nothing. It's the default.
>--main-gpu 0
Is this doing anything? Isn't 0 the default?
>--split-mode layer
Have you tried without it?
>--tensor-split 33,33,33
Have you tried without it?

Are you sure you understand what those options do or you just add them as part of a ritual?
Anonymous No.106563786 [Report] >>106563838
>>106563691
you arent using offload moe command top prioritize offloading less important non-attention layers.

They really need to fucking update this shit so it figures that crap out because it's every fucking thread people fuckin this shit up. Im too lazy to spoonfeed you google reddit or ask grok.
Anonymous No.106563812 [Report]
>>106563680
Shh, let's just ignore that, it doesn't matter since new models are already shit anyway.
Anonymous No.106563820 [Report]
>>106563614
$1000 will get you maybe one 3090. That is not a whole lot of profit.
Anonymous No.106563838 [Report] >>106563867 >>106563879
>>106563772
>Are you sure you understand what those options do or you just add them as part of a ritual?
That's what I'm wondering.

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

without --jinja, the response I get is in chinese (using bartowski's gguf)

According to https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md, flash attention is disabled by default.

>--cache-type-v f16
>--main-gpu 0
You're probably right.

>--split-mode layer
I test layer and row to see if there's a difference... it's less than 10 tk/s with row.

>--split-mode layer
>--tensor-split 33,33,33
I'll try that now.

>>106563786
I thought --gpu-layers 48 offloads everything onto the gpus?
Anonymous No.106563847 [Report] >>106563879
>>106563772
Anonymous No.106563861 [Report] >>106563879
>>106563772
Nemo on a single gpu vs on three gpus...

llama-b6451-bin-win-cuda-12.4-x64 btw
Anonymous No.106563867 [Report]
>>106563838
Yah. And I told you to fucking google it you piece of shit
Anonymous No.106563879 [Report] >>106563891
>>106563838
>According to https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md, flash attention is disabled by default.
It's not. Read llama-server -h.
>without --jinja, the response I get is in chinese (using bartowski's gguf)
Fair enough. Haven't used a GLM model, so I can't tell.
>I thought --gpu-layers 48 offloads everything onto the gpus?
Only if it has 48 layers. Does it?

>>106563847
Much better. I don't know if that's good or not, though. Post your specs and someone can tell you about their numbers.

>>106563861
MoEs are slower than dense even when moe_active_params ~= dense_model_params. Splitting a model (of any type) between gpus increases overhead.
Anonymous No.106563891 [Report] >>106563919 >>106563941
>>106563879
>Much better. I don't know if that's good or not, though. Post your specs and someone can tell you about their numbers.
It's *really* bad. Koboldcpp on linux gets me 60-80 tk/s with nvidia-smi -pl 250.
Anonymous No.106563909 [Report] >>106564274
>>106559803
I hate the word talent in general. No, bitch, I've been doing this for years, getting experience, learning how to find relevant information. It's not talent, it's hard work.
Anonymous No.106563919 [Report] >>106563941
>>106563891
>Koboldcpp on linux
You need to figure out if the problem is something on windows or if your settings aren't yet optimal on llama.cpp. Basically you just said
>all these things are different, and things are different, how can this be?
Anonymous No.106563929 [Report]
>>106560211
Bigger models are more sample efficient, so of course they'll need less steps/data than 30B-A3B. This has been known since GPT-3.
Anonymous No.106563941 [Report] >>106563960 >>106563975
>>106563891
>>106563919 (cont)
Just parsed. You're not that anon. Are you running it in 3 3090s as well? Same quant?
Anonymous No.106563960 [Report] >>106564017
>>106563941
No that's me. I'm just asking around for ideas. Going to switch driver versions so they're the same on windows and linux.

Single gpu-wise, windows performs about as I expect, 10-20% worse than linux, but with multiple gpus windows just fucks up, and I don't know why.
Anonymous No.106563975 [Report] >>106564017
>>106563941
I use koboldcpp most of the time but some other anon called me a retard before so I tried llama-server instead when posting here.
Anonymous No.106563978 [Report]
>>106561380
What's with all the radiation white spots on the sensor, is she okay?
Anonymous No.106564000 [Report]
>>106563074
china will win!
Anonymous No.106564012 [Report] >>106564024
>>106562423
Phonemizers are also bad because they bias the pronunciation of words to whatever the phonemizer says instead of how the speaker whose voice you're using would pronounce them. But yes, phonemizers are generally slow as fuck garbage. Something's wrong with your espeak though, it's not that slow.
Anonymous No.106564017 [Report] >>106564070
>>106563960
>>106563975
Some anon was whining about some particular version of cuda that messed up his image gen speed a while back. It wouldn't surprise me if it was a driver version thing.
Why on earth are you on windows if you have a working linux with kobold? It's impossible to make proper comparisons if you just change everything about a system.
Anonymous No.106564024 [Report]
>>106564012
>phonemizers are generally slow as fuck garbage
>>106562763
Something else is wrong in there. I doubt it's espeak.
Anonymous No.106564046 [Report] >>106564076
>>106562763
I'm using misaki for kokoro since it was the recommended option. The model is in UINT8 onnx format which should be fine for CPU. Here's my inference code, idk why it's taking so long.
https://pastebin.com/4U8J6VD4
Anonymous No.106564070 [Report] >>106564087
>>106564017
I don't really use linux, so it's a hassle to switch os every time I want to use bigger models at an acceptable speed.

If it's limited to a single gpu, there's no much difference between linux and windows.

Do you know where to get driver version 550.163.01 for windows 10? The oldest I have is 552.
Anonymous No.106564076 [Report]
>>106564046
You're asking me to read misaki's code, not yours. I won't do that.
What about setting USE_ESPEAK_DIRECTLY to True?
Anonymous No.106564086 [Report]
>>106563074
>The bill would require platforms to provide recurring alerts to users — every three hours for minors — reminding them that they are speaking to an AI chatbot, not a real person, and that they should take a break.
don't they know that those who want to suicide would rather trust more to ai chatbots than real human beings?
and how do they show these warnings to api users who use st?
Anonymous No.106564087 [Report] >>106564107
>>106564070
No idea. Haven't used windows for decades. And I don't run nvidia.
Anonymous No.106564094 [Report] >>106564140
https://huggingface.co/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
It's up.
Anonymous No.106564105 [Report] >>106564123 >>106564135
Qwen3-next is such an insane improvements. If the bigger versions can keep up we have our new local SOTA that shits on Deepseek, GLM and Kimi.
Anonymous No.106564107 [Report] >>106564154 >>106564411
>>106564087
Where should I go for more nvidia windows discussion? With both linux and windows on cuda 12.4, one driver version 550, the other 552, I'm still having the same issue. Is there any other thing to look out for? Llamacpp and koboldcpp on linux are easily twice the speed as they are on windows with multiple gpus.

It it just the windows overhead, or is there some other thing I'm overlooking? Resizable bar maybe? I've enabled it in the bios, but gpu-z says it's disabled in windows.
Anonymous No.106564123 [Report]
>>106564105
Isn't qwen 3 just gptoss by chinks?
Anonymous No.106564135 [Report] >>106566778
>>106564105
if qwen 235b manages to be be even slightly better that would be a nice win. Right now glm air beats it imo and that shouldnt be the case.
Anonymous No.106564136 [Report]
>>106563664
I only have 6+32 it's over for me
Anonymous No.106564140 [Report]
>>106564094
No goof no boof
Anonymous No.106564154 [Report]
>>106564107
https://www.techpowerup.com/download/nvidia-geforce-graphics-drivers/
Little box on the left, scroll down, click on show older or whatever.
The rest is on you.
Anonymous No.106564203 [Report] >>106564224 >>106564229 >>106564230 >>106564323 >>106564409 >>106564560 >>106564607
Instead of building massive LLMs that occupy 500B-1000B, companies should focus on smaller models, like 12B. They could divide models by topic, for example:
Nemo-erp.gguf, Nemo-rp.gguf, nemo-textadventure.gguf, nemo-science.gguf, nemo-medical.gguf, etc.
This would allow a model to hold more information on a specific subject. If you want to know about a different topic, you just switch the model.
Anonymous No.106564224 [Report]
>>106564203
that is how nvidia envisions the future currently. I think they see LLM's as a dead end and are giving homework out. Whether anyone can turn that into a comprehensive product that works better than current tech is the big question.

https://research.nvidia.com/labs/lpr/slm-agents/
Anonymous No.106564229 [Report] >>106564273
>>106564203
Not The Entire Internet, we know, you're absolutely right and should get tons of grants from all the labs.
Anonymous No.106564230 [Report] >>106564244
>>106564203
just smush em all together and train a model to select the right one based on the context.
Anonymous No.106564244 [Report]
>>106564230
moebros...
Anonymous No.106564273 [Report] >>106564280
>>106564229
Commercial AI labs just want to make general-purpose models that win as many benchmarks as possible; open lab researchers just want to advance their career with as little controversy as possible. And, most RP or writing capabilities are not verifiable yet except by some very limited and/or narrow metrics.
Anonymous No.106564274 [Report]
>>106563909
my sports animes taught me that talent is in the ability to put the hard work in.
Anonymous No.106564280 [Report] >>106564291
>>106564273
>win as many benchmarks as possible
For as cheaply as possible too, if recent Qwenpostig habits are anything to go by, we're well into the race to the bottom arc.
Anonymous No.106564291 [Report] >>106564300
>>106564280
ok but where are the fucking goofs?
Anonymous No.106564300 [Report]
>>106564291
GGUF is not economically viable for the shareholding please be of understand
Anonymous No.106564323 [Report]
>>106564203
Most of the topics you listed have far too much overlap with each other. If you are talking about coding and math reasoning, then that could make sense. I don't think that stuff is helpful for RP and creative writing.
Anonymous No.106564338 [Report] >>106564349 >>106564354 >>106564369 >>106564507 >>106564682 >>106565134
why did programmers hijack llms to this degree? people used them for porn long before llms could produce remotely viable code so it's just strange that they get the priority over this. llms would not be what they are right now without the support and exposure provided by the porn communities.
Anonymous No.106564349 [Report]
>>106564338
Big techanology pay more than groomers.
Anonymous No.106564354 [Report]
>>106564338
Because code is very corpo safe and makes investors think line will go uppies!
Anonymous No.106564359 [Report] >>106564365 >>106564366 >>106564386 >>106564545
>>106559371 (OP)
>>106559401
Thats a guy
Anonymous No.106564365 [Report]
>>106564359
ok so are you making the goofs?
Anonymous No.106564366 [Report]
>>106564359
Yeah. Me.
Anonymous No.106564369 [Report] >>106564425
>>106564338
The abstract, almost lovecraftian higher order concept called "investor" seems to have something to do with it, spreading its slimy tendrils of influence everywhere.
Also, COOs salivate at the thought of being able to replace 1000 programmers with an AI subscription.
Anonymous No.106564386 [Report]
>>106564359
Yeah and also very minor coded if the tits aren't at least DD cup.
Anonymous No.106564398 [Report] >>106564698
https://github.com/ggml-org/llama.cpp/issues/15940
bros its fucking BLEAK, only nocoders/vibecoders in the issue, no wonder goofs are nowhere to be seen
Anonymous No.106564409 [Report]
>>106564203
>Nemo-erp.gguf, Nemo-rp.gguf, nemo-textadventure.gguf
What world do you live in where this is something companies care enough about to put out specialized models and not something that they specifically train refusals for?
Also, these days, if they're spending money to train a 12B, they might as well train a 1000B-A12B.
llama.cpp CUDA dev !!yhbFjk57TDr No.106564411 [Report] >>106564493 >>106564784
>>106563691
Using 3x RTX 4090 and the llama.cpp defaults on Linux I get 86.47 t/s tg for 2048 tokens (should be comparable since the memory bandwidth is almost the same).
The options you posted should not be the problem.
Windows performance will always be gimped vs. Linux, I've never tested it with a multi GPU setup though.
One problem with Windows that the overhead for each CUDA call is much higher.
CUDA graphs fix that to some extent because they allow the GPU to better schedule the kernels and hide the kernel launch overhead.
But the more GPUs you have the more synchronization overhead is needed to make sure the operations are executed in the correct order.

Make sure that driver option that automatically swaps VRAM to RAM is disabled.

>>106564107
Resizable bar should not be relevant.
Anonymous No.106564425 [Report] >>106564496
>>106564369
>Also, COOs salivate at the thought of being able to replace 1000 programmers with an AI subscription.
Just like SQL, VisualBasic, WYSIWYG, and offshoring were supposed to do right?
Anonymous No.106564493 [Report] >>106564517
>>106564411
This one right?
Anonymous No.106564496 [Report] >>106564565
>>106564425
nta. Tool consolidation makes employees easier to replace. The easier they are to replace, the cheaper they become. You no longer have the sysadmin that knows the bespoke tool stack from top to bottom on your company, you have a docker user. Now you just need a cheaper docker user and you're stuck with docker because going back seems like too much of an investment in time and skill.
The tools you mention replace expensive employees for cheaper ones. So yes.
Anonymous No.106564507 [Report]
>>106564338
If you want retarded VCs to think your company will be worth a gorillion dollars in the future you have to sell them the promise of self-improving AI.
And good results for mememarks/programming are simply the most efficient way to do that.
llama.cpp CUDA dev !!yhbFjk57TDr No.106564517 [Report] >>106564596
>>106564493
It's been like 2 years since the last time I booted into my Windows partition but that sounds like the right one.
Anonymous No.106564545 [Report]
>>106564359
The prompt includes "1girl", "female", and "dominant female" so you are objectively wrong.
Anonymous No.106564560 [Report] >>106564579 >>106564593 >>106564600
>>106564203
How often do I have to tell you that to do good RP or text adventure, you need all the world knowledge? These are not narrow domain tasks like "classify sentiment on this smart phone review".
Anonymous No.106564565 [Report] >>106564601
>>106564496
There's no way you can look at the llama.cpp PRs where prompt engineers try to do the work of actual engineers and tell me with a straight face that you think it is viable.
Anonymous No.106564579 [Report]
>>106564560
This is totally a different anon sir.
Anonymous No.106564593 [Report] >>106564746
>>106564560
"suck my dick" and "tell me a story" are narrow domain tasks
Anonymous No.106564596 [Report]
>>106564517
It's been off. I tried with it on and off, and there weasn't much difference, all under 15 token/s.
Anonymous No.106564600 [Report] >>106564746
>>106564560
nta, I don't really disagree with your ideas, but it would still be nice if they didn't train on so much code and math and refusals. if they did just give a light rp tune to a pretrained base we would be light years ahead of what we have now even at the shitty data and parameter scales.
Anonymous No.106564601 [Report]
>>106564565
>tell me with a straight face that you think it is viable.
I didn't. I wouldn't. I'm telling you why VB and friends partially succeeded in their intended purpose. The intention of higher ups and AI is exactly the same.
Anonymous No.106564607 [Report]
>>106564203
I think the current meta is to make small models by distilling larger ones
Anonymous No.106564617 [Report] >>106564637 >>106564680
ONCE AGAIN.
MATH AND CODE IS BENCHMAXXED BECAUSE RESULTS ARE OBJECTIVELY VERIFIABLE.
STORY TELLING AND RP ISN'T!
Anonymous No.106564637 [Report] >>106564779
>>106564617
>RESULTS ARE OBJECTIVELY VERIFIABLE
are they really when you're cheating though
Anonymous No.106564680 [Report] >>106564741 >>106564779
>>106564617
Take those trillions of tokens of code and math off, replace with real conversations (of any/all kinds) and literature; finetune on top of that. It probably won't win benchmarks but it will likely have much more situational and conversational awareness than anything released so far.
Anonymous No.106564682 [Report] >>106564715
>>106564338
coders found a new thing to do nothing while looking busy
Anonymous No.106564698 [Report]
>>106564398
i'd like to make a difference but unfortunately you can probably guess which group i belong to
Anonymous No.106564715 [Report]
>>106564682
Anonymous No.106564741 [Report] >>106565071
>>106564680
>real conversations
sir that's illegal >>106563109
>The bill specifically aims to prevent companion chatbots — which the legislation defines as AI systems that provide adaptive, human-like responses and are capable of meeting a user’s social needs
Anonymous No.106564746 [Report]
>>106564593
The former, yes. The latter, no. It's basically the most open domain you can imagine.

>>106564600
>it would still be nice if they didn't train on so much code and math and refusals
I don't disagree with this, especially refusals. A bit less focus on code and math and more on popculture would be cool too.
Anonymous No.106564779 [Report] >>106564789 >>106564799 >>106564812 >>106564837
>>106564637
Doesn't matter. They know the reply they want from the model. If it's what they expect, it's good. Number go up.
You cannot do that with storytelling or RP. You can show a bit of storytelling to two anons and they'll have different opinions about it.
>>106564680
You can verify math and code, so they benchmaxx on math and code. If it overrides convos or not is neither here nor there. We don't have a good way to measure storytelling and RP. Do you not understand?
Anonymous No.106564784 [Report] >>106564822 >>106564834 >>106565629
>>106564411
And this is what it's like on linux (debian 13, driver version 550). I don't know how to screenshot.

15 tk/s vs 60 tk/s...

I have no idea what's wrong with my windows.

Are there any other anons with 3x 3090s on a zen 2 cpu with 3200 mhz ram, x16 gen 4 to all gpus, windows 10 iot ltsc 21h2? What kind of performance vs linux?
Anonymous No.106564789 [Report]
>>106564779
You are absolutely right-- very sharp thinking.
Anonymous No.106564799 [Report]
>>106564779
This really gets to the heart of modern LLM datasets.
Anonymous No.106564812 [Report] >>106564817
>>106564779
EQ-bench thoughbeltever
Anonymous No.106564817 [Report]
>>106564812
no
Anonymous No.106564822 [Report] >>106564832
>>106564784
microsoft, not even once
Anonymous No.106564826 [Report]
Any interesting gooner releases lately?
Anonymous No.106564832 [Report]
>>106564822
Wyell, could just be something wrong with my configs, cuda dev's windows doesn't have that problem.
Anonymous No.106564834 [Report] >>106564871
>>106564784
Windows-
>power settings via control panel:
>minimum processor state: 100%
>pci express:
>link state power management: off
>nvidia settings
>power management mode: pref. max. performance
Also check out your ram is running at max speed but this should be a bios issue, not OS one...
Anonymous No.106564837 [Report] >>106564874
>>106564779
literally who cares. make the benchmarks for the math version and release a best attempt rp writing tune. it would be basically free for them to do. its just the cult of safety would never allow it.
Anonymous No.106564871 [Report] >>106564906
>>106564834
>power plan
Yeah, that's one of the first things I do on a windows install.
Nvidia settings are maximum performance.
Both report 3200mhz inside the os.
Anonymous No.106564874 [Report]
>>106564837
Ok. You don't understand.
Anonymous No.106564881 [Report] >>106564896
chill down the spine
Anonymous No.106564896 [Report]
>>106564881
*Anti-Slop Backtracks you*
Anonymous No.106564906 [Report] >>106564923 >>106564936
>>106564871
Are you sure your torch installation isn't runnin cpu only? This is the biggest culprit.
Other than that, can't really help you.
Anonymous No.106564923 [Report] >>106564961
>>106564906
Oh wait you were talking about llama.cpp.
You might have actually downloaded wrong binaries - are you sure you are using
>https://github.com/ggml-org/llama.cpp/releases/download/b6451/cudart-llama-bin-win-cuda-12.4-x64.zip
>https://github.com/ggml-org/llama.cpp/releases/download/b6451/llama-b6451-bin-win-cuda-12.4-x64.zip
Anonymous No.106564936 [Report] >>106565141
>>106564906
Wait we needed torch?
Anonymous No.106564961 [Report] >>106565105
>>106564923
Mine is llama-b6451-bin-win-cuda-12.4-x64. Other than the power management, what other factors might contribute to performance? Stable diffusion is fine... but that's not multi-gpu.
Anonymous No.106565061 [Report]
Hey, I just noticed that docker compose has a model: section. Has anyone tried that? How is it compared when compared to running text gen webui?
Anonymous No.106565071 [Report] >>106565076
>>106564741
The actual bill doesn't outright ban companion chatbots or weights for them. It's for online platforms (Meta AI, Character AI, etc) providing companion chatbot services to minors.
Anonymous No.106565076 [Report] >>106565098
>>106565071
And you think they won't just do it to models they release as well to be extra safe too?
Anonymous No.106565098 [Report]
>>106565076
They probably will, but that's not what the bill is about. Online services can easily have extra moderation layers on top of unmoderated models like Character.AI has always done.
Anonymous No.106565105 [Report]
>>106564961
Chipset drivers are one thing but I don't know.
Anonymous No.106565134 [Report]
>>106564338
because of retarded shit like that california bill
Anonymous No.106565141 [Report]
>>106564936
Bro
Anonymous No.106565224 [Report] >>106565281 >>106565387 >>106565645
All three 3090s are pinned at 16GT/s PCIe Link Speed in hwinfo64 during generation, and gpu-z reports perfcap reason as idle. Gpu memory controller load is 10%.

Wait what? Nvidia-smi reports two 3090s at p3, and one at p2. Is this it?
Anonymous No.106565281 [Report] >>106565351
>>106565224
Eh, memory clocks are at 650mhz? They're 1187 if only a single gpu is used.
Anonymous No.106565351 [Report]
>>106565281
If I add the gpu driving my display into the mix, it becomes even lower, at 100mhz. Is it because they're different graphics cards? Even though they're 3090s, two are msi, and one is palit. Hmm, but even when I tested only the msi cards, they went down to 650mhz if splitting the model.
Anonymous No.106565387 [Report] >>106565434 >>106565452 >>106565453 >>106565477
>>106565224
if you are layer splitting then adding more processors will result in a lower average speed because they spend a greater percentage of time ideling. play with a model you can run fully on a single card and the try splitting it. observe the happenings?
Anonymous No.106565434 [Report]
>>106565387
I did that with nemo ofc, same behaviour, nerfed by more than 200% when split. By row is even worse. I'm going to try vllm on windows tomorrow and see if it still exhibits the same behavior.
Anonymous No.106565452 [Report]
>>106565387
Oh I misread that. It's something like a 20-30% performance difference between linux and windows if I only use one gpu last time I tested it.
Anonymous No.106565453 [Report]
>>106565387
that's not true at all, if you're running something halfway decent like a GB200 supercluster the performance hit is minimal.
Anonymous No.106565477 [Report] >>106565567
>>106565387
>ideling
Is that why the memory clocks drop when it's split? Normal behavior? How do I check the memory clocks on linux?
Anonymous No.106565567 [Report] >>106565591
>>106565477
its normal behavior. most people don't want their gpu burning 40watts to move a cursor around a static display. they are supposed to dynamically clock based on load. but there is latency to the procedure. some anons have noted performance boosts by locking their cards clocks or using a more aggressive power profile.
Anonymous No.106565591 [Report] >>106565625 >>106565975
>>106565567
Power profile? Do I need the geforce experience thing for that? With just the driver, I can set the power management to prefer maximum performance, but I don't know how to lock clocks or change power profiles.
Anonymous No.106565625 [Report] >>106565649
>>106565591
if you change your power profile in the Nvidia driver settings thing you need to reboot for it to take effect. you can use msi afterburner to lock the gpu clocks to any value, not sure about memory clock tho.
Anonymous No.106565629 [Report] >>106565866 >>106566288
>>106564784
bro just use WSL
Anonymous No.106565645 [Report] >>106565658
>>106565224
install linux and lock memory clock to maximum
options nvidia NVreg_RegistryDwords="PowerMizerEnable=0x1; PowerMizerDefaultAC=0x1; PowerMizerLevelAC=0x1; PowerMizerLevel=0x1; PerfLevelSrc=0x2222;
Anonymous No.106565649 [Report]
>>106565625
>if you change your power profile in the Nvidia driver settings thing you need to reboot for it to take effect

If it's just in the driver settings, then that's done, power plan for cpu and gpu are one of the first I do when I install windows.
Anonymous No.106565658 [Report] >>106565664
>>106565645
Does that carry over to windows after a reboot?
Anonymous No.106565664 [Report]
>>106565658
no
Anonymous No.106565782 [Report] >>106565806 >>106565831
>>106561205
nvidia didn't invent GPUs, what the fuck?
lying scumbags
Anonymous No.106565806 [Report] >>106565813 >>106565886
>>106565782
Anonymous No.106565813 [Report]
>>106565806
kek
Anonymous No.106565831 [Report]
>>106565782
Next they will buy intel and invent the CPU as well.
Anonymous No.106565866 [Report]
>>106565629
Would that actually help?
Anonymous No.106565886 [Report]
>>106565806
okay but when is nvidia omegaverse
Anonymous No.106565975 [Report]
>>106565591
On windows you can use evga precision x1 to lock the cards at their max boost clock. Tripled my tokens per second.
Anonymous No.106566096 [Report]
>>106561599
literally me
Anonymous No.106566103 [Report] >>106566151
>>106563343
i would very much like to use that and thank you for hosting it but i dont really want to use it but thank you anon
i mean i appreciate you hosting it for anons but i dont want to connect to an api
Anonymous No.106566151 [Report] >>106566297
>>106566103
>i dont want to connect to an api
What king of lmg anon doesn't want to connect to an api not hosted on their own hardware?
Anonymous No.106566288 [Report]
>>106565629
already on esl thanks
Anonymous No.106566297 [Report]
>>106566151
Are you okay?
Anonymous No.106566428 [Report]
Threadly reminder that Fiction LiveBench is the best benchmark to ever grace the field
Anonymous No.106566610 [Report]
>>106562972
I am, iq4_xs at 32k context. It's not great and tends to be very repetitive. Swiping more or less gives the same response every time.
Anonymous No.106566647 [Report] >>106566696
So, having only used koboldcpp, oobabooga, and llama-serve, which all came with frontends, what's a good one for vllm? I tried open webui, but it instantly asked for my email on startup (why?)
Anonymous No.106566696 [Report]
>>106566647
>but it instantly asked for my email on startup
It's an internal login system so you can just type in anything as an email address as long as it has an @ the won't be any email sent unless you setup something for it. It's just to make it look more corpo I guess.
Anonymous No.106566778 [Report]
>>106564135
How/what are you using GLM Air (for)?
It has been pretty shit for me. Especially when talking about software architecture. I don't know about RP shit and I have been using an API (not local) so it's not even the quant probably.
While Qwen3 and Kimi are an absolute blast.
Anonymous No.106566854 [Report]
>>106566836
>>106566836
>>106566836
Anonymous No.106566918 [Report]
https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/
https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
https://huggingface.co/google/vaultgemma-1b

The future of Google LLMs: models that know nothing about rare information. They use huge batch size to mitigate memorization, among other things.

>What does this mean in practice? Informally speaking, because we provide protection at the sequence level, if information relating to any (potentially private) fact or inference occurs in a single sequence, then VaultGemma essentially does not know that fact: the response to any query will be statistically similar to the result from a model that never trained on the sequence in question. However, if many training sequences contain information relevant to a particular fact, then in general VaultGemma will be able to provide that information.
>
> [...] Sequence-level DP provably bounds the influence of any single training sequence (example) on the final model. We prompted the model with a 50-token prefix from a training document to see if it would generate the corresponding 50-token suffix. VaultGemma 1B shows no detectable memorization of its training data and successfully demonstrates the efficacy of DP training.
Anonymous No.106567105 [Report]
gemma 4 soon