Thread 106559371

386 posts 110 images /g/

Anonymous 9/11/2025, 11:30:17 PM No.106559371 [Report] >>106559401 >>106561077 >>106562330 >>106564359

/lmg/ - Local Models General

date with miku - good end.png md5: 1f797ec5...

Anonymous 9/11/2025, 11:30:43 PM No.106559374 [Report] >>106559395

__hatsune_miku_vocaloid_drawn_by_oyama_oyama0_43__4fe8ad9ea8cfc8fff6d4e2a5501872bf.png md5: 01737ff6...

►Recent Highlights from the Previous Thread: >>106551921

-Optimizing code generation workflows on V100 GPUs with MoE models:
>106555312 >106555465 >106555506 >106555522 >106555524 >106555586 >106555717 >106555770 >106555782 >106555852
--Best local text gen models and VRAM optimization discussion:
>106556580 >106556863 >106556934 >106557638 >106557036 >106557046 >106557069 >106557098 >106557239 >106557514 >106557190
--AI surpasses mathematicians in complex analysis challenge:
>106558352 >106558367 >106558387 >106558476 >106558500 >106558527 >106558711
--Baidu's ERNIE-4.5-21B-A3B-Thinking model release and performance evaluation:
>106554153 >106554580 >106555008 >106555170 >106555207
--Silero VAD v6 evaluation and comparison with Nvidia's MarbleNet:
>106557953 >106558064
--LocalAI vs OpenWebUI: backend model management vs frontend interface:
>106555093 >106555341 >106555529 >106558434
--Running 30B-A3B models on 12GB VRAM via expert offloading and quantization:
>106558134 >106558186 >106558210 >106558227 >106558238 >106558251 >106558293 >106558317 >106558341
--GPU layer differences in small vs large models due to parameter grouping and optimization:
>106553923 >106554094 >106554256 >106554362 >106554384 >106554458 >106556050 >106556200
--LongCat's strengths and MoE limitations in llama.cpp compatibility:
>106552000 >106552095 >106552267 >106554325 >106554412
--Achieving deterministic LLM inference through caching logic adjustments:
>106555106 >106555150 >106555169
--llama.cpp development updates and flash attention implementation considerations:
>106553388 >106553417 >106553890 >106555026 >106555040 >106555059 >106555061 >106555068
--Qwen3 Next release`:
>106557806 >106557845 >106557853 >106557858 >106557903
--Miku (free space):
>106555337 >106554679 >106555530 >106555574 >106557190 >106558219 >106559139 >106559166 >106559181

►Recent Highlight Posts from the Previous Thread: >>106551925

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous 9/11/2025, 11:33:23 PM No.106559395 [Report] >>106559403

1671200815321009.png md5: ab87bbcb...

>>106559374
I made the highlight reel again back to back boys

Anonymous 9/11/2025, 11:34:17 PM No.106559401 [Report] >>106561077 >>106562612 >>106564359

1739311776947199.png md5: 559ed3f3...

>>106559371 (OP)

Anonymous 9/11/2025, 11:34:35 PM No.106559403 [Report]

>>106559395
https://www.youtube.com/watch?v=VcWAQ5a1NdI

Anonymous 9/11/2025, 11:36:13 PM No.106559420 [Report] >>106559506 >>106562370

>>106559404
You could still help them out. I'm pretty sure vllm supports it now.

Anonymous 9/11/2025, 11:44:55 PM No.106559499 [Report]

My uni gives me access to Copilot Chat (GPT 5) and this thing is dumb as fuck, even with search. I think you people have lied to me about the big models being hugely smarter (narratively) than some 32b model.

llama.cpp CUDA dev !!yhbFjk57TDr 9/11/2025, 11:45:44 PM No.106559506 [Report] >>106559756

>>106559420
Yes, I could be doing a lot of things but everything has an opportunity cost.

Anonymous 9/11/2025, 11:46:26 PM No.106559516 [Report] >>106559551 >>106559555 >>106559704

>llama.cpp still hasn't added support for qwen-next
https://github.com/ggml-org/llama.cpp/issues/15940
>vllm already merged in support last night
https://github.com/vllm-project/vllm/pull/24526
llama devs are hacks

Anonymous 9/11/2025, 11:47:56 PM No.106559528 [Report]

Qwen3 Next geejuff status?

Anonymous 9/11/2025, 11:49:59 PM No.106559551 [Report]

>>106559516
vllm devs only needed to bump their pytorch version or something.

Anonymous 9/11/2025, 11:50:25 PM No.106559555 [Report] >>106559598 >>106559704

>>106559516
Maybe rewriting the entire ML stack in C++ wasn't such a good idea.

Anonymous 9/11/2025, 11:54:08 PM No.106559598 [Report] >>106559627

>>106559555
should've used pure C, they probably don't need any of the OOP features anyway.

Anonymous 9/11/2025, 11:58:00 PM No.106559627 [Report] >>106559680

>>106559598
PyTorch is written in C++ contrary to its name. Nobody is using C for good reasons.

Anonymous 9/12/2025, 12:05:05 AM No.106559680 [Report] >>106559701

>>106559627
The performance critical parts are, but it's not like you can use PyTorch directly from C++.

Anonymous 9/12/2025, 12:07:00 AM No.106559696 [Report] >>106559714 >>106559733

qwen 3 80b consensus?

Anonymous 9/12/2025, 12:07:36 AM No.106559701 [Report]

>>106559680
of course not. pytorch is literally a wrapper for libtorch which is in C++. you would use libtorch if you wanted to use C++. there's alot more support around pytorch tho as it's far more accessible to people.

Anonymous 9/12/2025, 12:08:16 AM No.106559704 [Report]

>>106559516
>>106559555
maybe 1 has a gorillion dollars since it's used by llm companies and one is a hobby project for consumers

Anonymous 9/12/2025, 12:09:49 AM No.106559714 [Report] >>106559717

>>106559696
It's shit because there are no goofs

Anonymous 9/12/2025, 12:10:48 AM No.106559717 [Report]

1676316942545183.png md5: c803dcf4...

>>106559714
>he doesn't know how to run safetensors

Anonymous 9/12/2025, 12:12:02 AM No.106559733 [Report] >>106559753

1748050363563976.png md5: bf06f80f...

>>106559696

Anonymous 9/12/2025, 12:13:53 AM No.106559753 [Report]

>>106559733
literally me

Anonymous 9/12/2025, 12:14:04 AM No.106559756 [Report]

MHHHHMMMMMM.png md5: a4a03ff9...

>>106559506
*CLAP EMOJI* CUDA *CLAP EMOJI* DEV *CLAP EMOJI* WE *CLAP EMOJI* ARE *CLAP EMOJI* ASKING

Anonymous 9/12/2025, 12:16:39 AM No.106559780 [Report] >>106559792 >>106559800 >>106559803 >>106559823 >>106559926 >>106559975

I never understood how some of you have the hardware and talent to render AI videos and images that are realistic and good and yet you don't make full length porn videos

Anonymous 9/12/2025, 12:18:54 AM No.106559792 [Report] >>106559824

>>106559780
Video models break down quickly past 5 seconds.

Anonymous 9/12/2025, 12:19:31 AM No.106559800 [Report]

>>106559780
Porn sucks, text is better. The mind is most powerful sex organ. Unironically. t. man

Anonymous 9/12/2025, 12:19:55 AM No.106559803 [Report] >>106560128 >>106563909

1741181107861956.jpg md5: 6c27a197...

>>106559780
>knowing how to read instructions= talent

Anonymous 9/12/2025, 12:22:42 AM No.106559823 [Report]

>>106559780
all the slop I posted this thread took around 8s~ to gen (praise be nunchaku devs)

Anonymous 9/12/2025, 12:22:43 AM No.106559824 [Report] >>106559863 >>106559975

>>106559792
IIRC standard for brainrot tiktok videos is to have a cut every 3 seconds.

Anonymous 9/12/2025, 12:27:34 AM No.106559863 [Report] >>106559884

>>106559824
Now try getting the model to maintain consistency across hundreds of 3 second clips.

Anonymous 9/12/2025, 12:29:01 AM No.106559871 [Report] >>106559925 >>106559938 >>106559943

What options are people running to get speedups on MoE models? There was a way to offload only certain tensors to RAM in order to get a significant speedup. Is it ik_llama.cpp only?

Anonymous 9/12/2025, 12:29:50 AM No.106559884 [Report]

>>106559863
Who said consistency was a requirement?

Anonymous 9/12/2025, 12:34:34 AM No.106559925 [Report] >>106559979

>>106559871
"overridetensors": "([2-8]+).ffn_.*_exps.=CPU"
That's what I use on kobold to run 30B A3B Q4_K_M on 8 GB VRAM / 24 GB RAM, the parameter is probably the same on llama.cpp (no fork needed)

Anonymous 9/12/2025, 12:34:34 AM No.106559926 [Report] >>106559958

1747328841336034_thumb.jpg.webm md5: 48c38917...

WebM not supported

>>106559780
about that.....
This one is really good
not full creation, but it's one of the better tools released yet
https://ebsynth.com/

Anonymous 9/12/2025, 12:35:36 AM No.106559938 [Report]

>>106559871
--cpu-moe is all you need
--n-cpu-moe 999 if you want to be fancy

Anonymous 9/12/2025, 12:36:18 AM No.106559943 [Report] >>106559962

>>106559871
>>106558251
>--n-cpu-moe 37 --gpu-layers 99
Normal llama.cpp.
Obviously, adjust --n-cpu-moe as needed.

Anonymous 9/12/2025, 12:38:14 AM No.106559958 [Report] >>106560471

joelHaver-thumbnail.jpg md5: 36ff80c6...

>>106559926
huh, I thought picrelated guy was tracing frames by hand

Anonymous 9/12/2025, 12:39:17 AM No.106559962 [Report] >>106559979 >>106559984

>>106559943
What was that -ot thing I saw some anons use? It had a bunch of numbers after it.

Anonymous 9/12/2025, 12:40:50 AM No.106559975 [Report] >>106559998

>>106559824
>>106559780
the issue is it will suck so why bother making it. The ass wont jiggle right, the blowjob wont have audio thats good, any gimmick you add to take advantage of ai will break the lora. And real porn will just look better. Probably better off deepfaking porn already made with enhancements

I have had success using vibevoice to clone a pornstars voice and then have her talk for several minutes using infinite talker. An LLM wrote the script so I wouldnt know what it would say and I got my own personal vid from her, and it was uh... kinda good.

Anonymous 9/12/2025, 12:41:02 AM No.106559979 [Report] >>106560000

>>106559962
With -ot you can target the specific tensors inside the model's layers using regex. --n-cpu-moe simply obfuscates all that much like -ngl does for whole layers.
One thing to keep in mind when using -ot like in >>106559925 is to not move the shared experts (if they exist) from VRAM, since those are always used.

Anonymous 9/12/2025, 12:41:22 AM No.106559984 [Report] >>106560000

>>106559962
-ot was the only way to do same thing before --cpu-moe arguments were introduced.

Anonymous 9/12/2025, 12:43:09 AM No.106559998 [Report] >>106560257 >>106560428

>>106559975
nah bro, you just have to search for it
https://litter.catbox.moe/110x2tu7sbg6hixe.gif

Anonymous 9/12/2025, 12:43:16 AM No.106560000 [Report] >>106560056

>>106559979
>>106559984
Ah cool, thanks for the explanation. So --n-cpu-moe moves only the non-shared experts to CPU? And --cpu-moe keeps *all* non-shared experts to the CPU?

Anonymous 9/12/2025, 12:44:28 AM No.106560007 [Report]

fuk me sideways, i wanted to try to use qwen3-next with vLLM and it seems it doesn't work with pipeline parallelism

Anonymous 9/12/2025, 12:49:38 AM No.106560056 [Report]

>>106560000
I'm only aware of --n-cpu-moe.
Maybe --cpu-moe is the same thing for koboldcpp, I don't know.
As far as I know, --n-cpu-moe also keeps the normal experts to the CPU/RAM.
You can run llama-server with the -h option to get more details.

Anonymous 9/12/2025, 12:49:55 AM No.106560060 [Report] >>106560089

file.png md5: f682d6bf...

llamabros...

Anonymous 9/12/2025, 12:53:40 AM No.106560089 [Report] >>106561153

>>106560060
Note that "primary hardware" is always GPUs. That's because to anyone serious, "cpumaxxing" is as sad and absurd as "ssdmaxxing" is to us.

Anonymous 9/12/2025, 12:54:55 AM No.106560095 [Report]

https://allenai.org/blog/olmo2-32b
How did they manage to do it in just 32B?

Anonymous 9/12/2025, 12:59:27 AM No.106560128 [Report]

>>106559803
it really is in this day and age
western kids have been dragged down to the level of their 80IQ peers for two generations now

Anonymous 9/12/2025, 1:10:53 AM No.106560211 [Report] >>106560245 >>106560248 >>106560274 >>106563929

>Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus. It uses less than 80% of the GPU hours needed by Qwen3-30A-3B, and only 9.3% of the compute cost of Qwen3-32B — while achieving better performance. This shows outstanding training efficiency and value.
And it beats Qwen 3 32B + handles long context better than the 235B moe
pretty impressing stuff

Anonymous 9/12/2025, 1:17:08 AM No.106560245 [Report]

>>106560211
That's great. Would be greater if they expedited a Qwen3 Next Coder.

Anonymous 9/12/2025, 1:17:33 AM No.106560248 [Report] >>106560269

>>106560211
It's native 256K context I think, without extending.

Anonymous 9/12/2025, 1:18:18 AM No.106560257 [Report]

>>106559998
Illya?

Anonymous 9/12/2025, 1:20:22 AM No.106560269 [Report]

>>106560248
yeah but the RULER benchmark is better on the Q-Next than the Q3 235B

Anonymous 9/12/2025, 1:20:26 AM No.106560271 [Report]

>>106557716
i don't work in an office

Anonymous 9/12/2025, 1:21:22 AM No.106560274 [Report] >>106560283 >>106560291 >>106560302 >>106560314

>>106560211
Isn't Qwen3-Next 70B? Why are they comparing to Qwen3 32B and not other 70B models?

Anonymous 9/12/2025, 1:22:28 AM No.106560283 [Report] >>106560294 >>106560310

>>106560274
It's 80B A3B.

Anonymous 9/12/2025, 1:23:42 AM No.106560291 [Report] >>106560294

>>106560274
Supersparse MoE, 80B A3B

Anonymous 9/12/2025, 1:24:28 AM No.106560294 [Report] >>106560302 >>106560322

>>106560291
>>106560283
Okay, so how does it compare to models that are around 80B?

Anonymous 9/12/2025, 1:25:37 AM No.106560302 [Report]

>>106560274
they compare it to every other Qwen3.
>>106560294
they did not bother comparing it to non-Qwen3 models.

Anonymous 9/12/2025, 1:26:32 AM No.106560310 [Report]

>>106560283
sqrt(80*3) means it's a copetitor to 16b models

Anonymous 9/12/2025, 1:28:14 AM No.106560314 [Report] >>106560338 >>106560356

file.png md5: 43e018c4...

>>106560274
Even Gemini is praising it.

Anonymous 9/12/2025, 1:28:51 AM No.106560320 [Report] >>106560327

Qwen3-Max is such a disappointment that I have absolutely zero hope for 3.5
Alibaba truly is the meta of China

Anonymous 9/12/2025, 1:28:55 AM No.106560322 [Report]

>>106560294
It's faster :)

Anonymous 9/12/2025, 1:30:07 AM No.106560327 [Report] >>106560350

>>106560320
Kinda funny how Max got completely overshadowed by Qwen3-Next.

Anonymous 9/12/2025, 1:30:45 AM No.106560331 [Report]

3bit? is that not bitnet?

Anonymous 9/12/2025, 1:31:34 AM No.106560338 [Report]

>>106560314
Gemini will praise anything

Anonymous 9/12/2025, 1:32:52 AM No.106560346 [Report] >>106560354 >>106560433 >>106560436 >>106560596 >>106562686

1615679846051.jpg md5: 81790071...

LLMs seem like a competition between Americans, Europe, and China. Why can't Russia, Japan nor Korea compete despite being tech giants?

Anonymous 9/12/2025, 1:33:12 AM No.106560350 [Report]

>>106560327
Max got overshadowed by the fact that it's completely pointless so everyone forgot about it two hours after it became available.

Anonymous 9/12/2025, 1:34:18 AM No.106560354 [Report] >>106560361

>>106560346
>Europe
They're competing? It looks like only one European state is just barely trying.

Anonymous 9/12/2025, 1:34:32 AM No.106560356 [Report] >>106560463

1749693672355757.jpg md5: 1ecac848...

>>106560314
>an 80B model requires ~160GB of VRAM. A 3-bit version could potentially run in under 40GB of VRAM, making it feasible to run on a single high-end GPU like an NVIDA RTX 4090
This is Gemini? The peak of LLMs right now? With web access?

Anonymous 9/12/2025, 1:35:11 AM No.106560361 [Report] >>106560404

>>106560354
Mistral was great

Anonymous 9/12/2025, 1:42:55 AM No.106560404 [Report] >>106560414

>>106560361
>was
Yeah

Anonymous 9/12/2025, 1:43:19 AM No.106560409 [Report] >>106560417

file.png md5: 63493149...

The proper name is Qwen3-MoE-A3B thank you.

Anonymous 9/12/2025, 1:44:21 AM No.106560414 [Report] >>106560481

>>106560404
It still is.

Anonymous 9/12/2025, 1:44:51 AM No.106560417 [Report] >>106560440

1750684572736732.png md5: c7a64e4b...

>>106560409
The sqrt(total * active) formula has been officially confirmed

Anonymous 9/12/2025, 1:46:05 AM No.106560428 [Report]

>>106559998
ghostbusters ectoplasm ghostly appearing sperm

Anonymous 9/12/2025, 1:46:38 AM No.106560433 [Report]

>>106560346
>Europe
Lol lmao even

Anonymous 9/12/2025, 1:46:54 AM No.106560436 [Report] >>106560596

>>106560346
>Russia
For the last 35 years the #1 rule of doing business in Russia was "don't do business in Russia". CS stuff was the easiest to move abroad.
It's not like there's nothing at all, IIRC Yandex was pretty competitive in the self-driving scene for a moment, and every street dog sells it's own proprietary voice assistant now, but for local I only found https://huggingface.co/yandex/YandexGPT-5-Lite-8B-instruct so far. (It's whatever)
>Japan
Failed into programming way back when we started using real operating systems and high level languages, and did not recover to this day. I blame language barrier.
>Korea
Probably too busy printing money with all their gachas instead.

Anonymous 9/12/2025, 1:47:26 AM No.106560440 [Report]

file.png md5: 96b9fdb9...

>>106560417

Anonymous 9/12/2025, 1:48:09 AM No.106560442 [Report] >>106560454 >>106560460

>Why aren't you using vllm bro?
>What do you mean you don't have H100 cluster? It can still work with A100 cluster bro.
>Wait, you got just an RTX 3090? Uhm, I've never heard about such GPUs, must be Chinese knockoff or something. Get legit hardware bro.
>You got no money? Just ask for grants bro! Or get investors. You are part of the network, right?

Anonymous 9/12/2025, 1:50:19 AM No.106560454 [Report]

>>106560442
>You got no money?
Have you tried getting a job recently? scamming into a grant or investor is unironically easier at this point.

Anonymous 9/12/2025, 1:50:42 AM No.106560459 [Report]

30474 - SoyBooru.png md5: 99dc0484...

Are you enjoying the next best thing? (Qwen Next) (Subtle request for feedback)

Anonymous 9/12/2025, 1:50:43 AM No.106560460 [Report] >>106560474 >>106560489

>>106560442
vllm can run on an intel arc gpu. you've got no excuse bro. Also it can do cpu as well and MoEs and even got gguf support not long ago

Anonymous 9/12/2025, 1:51:14 AM No.106560463 [Report]

1726683352470372.png md5: d4b9ad26...

>>106560356

Anonymous 9/12/2025, 1:52:10 AM No.106560471 [Report]

>>106559958
I wonder what this dude thinks about AI. There's not a lot of difference between what he does and what video models can do.

Anonymous 9/12/2025, 1:52:44 AM No.106560474 [Report] >>106560483

>>106560460
Where can I buy B60 for MSRP(1200BURGERCOINS)?

Anonymous 9/12/2025, 1:53:41 AM No.106560481 [Report] >>106560588 >>106562483

>>106560414
Mistral's 2025 output has been one okayish 24B model and nothing else of note
Meanwhile their business model is increasingly Cohere-ifying and there's good reason to believe they fucked up training Large 3
Maybe the cash injection from ASML will help some but acting like they're still internationally competitive is a joke

Anonymous 9/12/2025, 1:54:07 AM No.106560483 [Report] >>106560550

>>106560474
B60 will be announced next month at SEMICON West. $500 for 24GB model

Anonymous 9/12/2025, 1:54:56 AM No.106560489 [Report] >>106560523

>>106560460
How does pure CPU performance of vllm compare to ik_llama.cpp?

Anonymous 9/12/2025, 1:58:48 AM No.106560523 [Report] >>106560551

>>106560489
I would suggest just standing up the vllm cpu docker image and running benchmarks yourself. You probably won't find much public info for cpu benchmarks between the two.

Anonymous 9/12/2025, 2:02:52 AM No.106560550 [Report] >>106560600

>>106560483
>AMD CPU
>Intel GPU
>NVIDIA RAM
If only I had infinite money...

Anonymous 9/12/2025, 2:03:16 AM No.106560551 [Report] >>106560561 >>106560587 >>106560606 >>106561458

>>106560523
Where exe? I DONT GIVE A FUCK ABOUT THE FUCKING DOCKER! i just want to download this stupid fucking application and use it
WHY IS THERE DOCKER??? MAKE A FUCKING .EXE FILE AND GIVE IT TO ME. these dumbfucks think that everyone is a developer and understands code. well i am not and i don't understand it. I only know to download and install applications. SO WHY THE FUCK IS THERE DOCKER? make an EXE file and give it to me. STUPID FUCKING SMELLY NERDS

Anonymous 9/12/2025, 2:04:46 AM No.106560561 [Report]

1593076675554.jpg md5: 7f626358...

>>106560551

Anonymous 9/12/2025, 2:05:21 AM No.106560566 [Report] >>106560572

>2025
>vibevoice is fully forgotten

Anonymous 9/12/2025, 2:06:19 AM No.106560572 [Report]

>>106560566
Useless without training scripts.

Anonymous 9/12/2025, 2:08:57 AM No.106560587 [Report]

1754938970794577.jpg md5: d3801acc...

>>106560551

Anonymous 9/12/2025, 2:09:00 AM No.106560588 [Report]

>>106560481
To Mistral's credit, that single model they made is actually the best model for running on a normal PC. Gemma is heavily censored, Qwen's similar sized models are worse at non benchmaxx tasks and everything else is too big unless you're building your PC for running LLMs

Anonymous 9/12/2025, 2:10:21 AM No.106560596 [Report]

>>106560346
>>106560436
>Russia
Case in point: https://en.wikipedia.org/wiki/ABBYY_FineReader
I was informed this used to be SOTA for OCR.
>ABBY ... was founded in the USSR and operated in Russia for nine years before moving to the United States.

Anonymous 9/12/2025, 2:11:00 AM No.106560600 [Report]

>>106560550
>NVIDIA CPU
>AMD GPU
>INTEL RAM
WE ARE MAKING A MEME SYSTEM. OPTANE WILL NEVER DIE.

Anonymous 9/12/2025, 2:11:18 AM No.106560604 [Report] >>106560622

nvidia not offering a 24gb 50xx card was criminal and i'm tired of pretending otherwise.

Anonymous 9/12/2025, 2:11:46 AM No.106560606 [Report] >>106560614 >>106561807

>>106560551
This argument has never been refuted

Anonymous 9/12/2025, 2:12:29 AM No.106560614 [Report] >>106560630

>>106560606
nobody wants to deal with women. if exe is a filter than so be it.

Anonymous 9/12/2025, 2:13:20 AM No.106560619 [Report] >>106560634 >>106560640

mistral for erp
qwen3 for anything else but erp

Anonymous 9/12/2025, 2:13:43 AM No.106560622 [Report] >>106560627

>>106560604
Fuck 24GB. The 5090 should have just been cheaper, it's not remotely close to being a proper workstation card and 32GB is too little for anything outside of hobbyist stuff.

Anonymous 9/12/2025, 2:14:36 AM No.106560627 [Report]

>>106560622
It's a gayming gpu. Buy from their worktsation lineup if you want professional stuff.

Anonymous 9/12/2025, 2:14:43 AM No.106560630 [Report] >>106560635

>>106560614
You aren't a woman, though

Anonymous 9/12/2025, 2:15:24 AM No.106560633 [Report]

>>106559044
SSDmaxxbros, maybe our time is finally cuming soon...

Anonymous 9/12/2025, 2:15:50 AM No.106560634 [Report] >>106560665

>>106560619
But what about sfw rp, is that included in that? Is Qwen 3 smarter than Gemma 3?

Anonymous 9/12/2025, 2:15:55 AM No.106560635 [Report]

>>106560630
no?? really??? I think your lost bro, this isn't >>>/lgbt/

Anonymous 9/12/2025, 2:16:30 AM No.106560640 [Report]

>>106560619
>anything else but erp
there is nothing else

Anonymous 9/12/2025, 2:17:04 AM No.106560645 [Report] >>106560723 >>106560751 >>106560809

file.png md5: 1035a657...

Am I about to get scammed? I've never seen these under $1000. From Hong Kong.

Anonymous 9/12/2025, 2:20:40 AM No.106560665 [Report]

>>106560634
>gemma3
after all safety humiliation I got I will never use it again

Anonymous 9/12/2025, 2:25:20 AM No.106560687 [Report] >>106561056

I refuse to support any model whose selling point is high context limits. Every llm i've used from free to paid are absolute garbage and hallucinate at high context.

Anonymous 9/12/2025, 2:26:47 AM No.106560693 [Report] >>106560965

>forcing full prompt re-processing due to lack of cache data (likely due to SWA
humiliation ritual

Anonymous 9/12/2025, 2:28:59 AM No.106560713 [Report]

My CLINE prompts are all timing out when I'm trying to use gemma3:12b on a 4070. Do I need a quantized model instead?

not bqgfla-1 9/12/2025, 2:30:20 AM No.106560723 [Report]

>>106560645
No, you're in for a great deal! Buy it quick, there's only one left!

Anonymous 9/12/2025, 2:33:24 AM No.106560751 [Report]

>>106560645
>seller with 0 reviews
Yeah, trust him!

Anonymous 9/12/2025, 2:43:20 AM No.106560809 [Report] >>106560814 >>106560970

1733603436129628.png md5: 0ed44db4...

>>106560645
bro no don't do that
buy this one: https://www.ebay.com/itm/325407276138

much better trust me

Anonymous 9/12/2025, 2:44:15 AM No.106560814 [Report] >>106560967 >>106561079

>>106560809
>Graphcore IPU
what

Anonymous 9/12/2025, 3:13:07 AM No.106560965 [Report] >>106562680

>>106560693
How do you prevent this?

Anonymous 9/12/2025, 3:13:27 AM No.106560967 [Report] >>106561079

>>106560814
>intelligent processing unit
lmao

Anonymous 9/12/2025, 3:13:31 AM No.106560970 [Report]

>>106560809
ok ersinc03

Anonymous 9/12/2025, 3:29:11 AM No.106561056 [Report]

>>106560687
You can't trust the actual numbers for context that companies put out, they're always wrong. But it's usually safe to assume that a higher advertised number does mean a higher 'effective' context ceiling.

Anonymous 9/12/2025, 3:31:26 AM No.106561071 [Report]

Can't wait for adobe research to publish an updated study on how all these models go to shit past 32k

Anonymous 9/12/2025, 3:32:44 AM No.106561077 [Report] >>106561915

>>106559371 (OP)
>>106559401
>no tits
>shitty reddit memes
You are gay.

Anonymous 9/12/2025, 3:32:51 AM No.106561079 [Report] >>106561127

>>106560967
>>106560814
IPU/NPUs are a real thing, they're in all the new CPUs from AMD for instance. just not from meme companies like that one.

Anonymous 9/12/2025, 3:43:42 AM No.106561127 [Report] >>106561164 >>106561166

>>106561079
>central processing unit
makes sense
>graphics processing unit
yup
>neural processing unit
works with neural networks, gotcha
>intelligent processing unit
the fuck is this supposed to be? it sounds like some marketing term

Anonymous 9/12/2025, 3:48:16 AM No.106561145 [Report] >>106561161

Base Image.png md5: 645634ff...

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
https://arxiv.org/abs/2509.09679
>Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence \mu = 1/\sqrt{n}--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete \{+1, -1\} entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving O(n \log n) computational complexity with only \frac{n \log n}{2} learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
Links belowhttps://github.com/42Shawn
https://github.com/oumi-ai/oumi
Code might be posted on one of those. Might be cool but then again very little results included.
previous paper that looked at butterfly transforms
https://arxiv.org/abs/2302.06646

Anonymous 9/12/2025, 3:49:17 AM No.106561153 [Report]

>>106560089
Anyone serious is deploying for enterprise, not personal use. Normal people don't use local models for personal use, just like normal people don't use 4chan and only use fb/linkedin.

Anonymous 9/12/2025, 3:50:48 AM No.106561161 [Report] >>106561168

>>106561145
That is very nice, but how does it compare to GGUF?

Anonymous 9/12/2025, 3:51:16 AM No.106561164 [Report]

>>106561127
>it sounds like some marketing term
it basically is. NPU = IPU
in the industry its looking like NPU has won out but AMD at least early in the developments of NPUs in like 2023 referred to it as IPUs as well

Anonymous 9/12/2025, 3:51:32 AM No.106561166 [Report] >>106561184 >>106561602

>>106561127
Graphics Processing Unit is a horrible term nowadays.
NVIDIA calls the H100 a GPU even though it doesn’t even have a display output and isn’t aimed at graphics processing.

Anonymous 9/12/2025, 3:52:07 AM No.106561168 [Report] >>106561177

>>106561161
Probably just as shit as Q2 ggufs are.

Anonymous 9/12/2025, 3:53:05 AM No.106561177 [Report] >>106561204

>>106561168
Nah, ggufs are probably better since they don't mention them.

Anonymous 9/12/2025, 3:53:46 AM No.106561184 [Report] >>106561205 >>106561207 >>106561336 >>106561367

>>106561166
GPU stands for "General Processing Unit" in nvidia's own terms

Anonymous 9/12/2025, 3:54:11 AM No.106561186 [Report]

2-bit is all you need, you don't need more

Anonymous 9/12/2025, 3:56:39 AM No.106561204 [Report] >>106561309

>>106561177
My point stands. Q2 is shit . This is literally a competition of who has the nicer looking pile of shit. If you're seriously using a Q2 model you need to reevaluate your life. Also the paper likely doesn't mention GGUFs at all because it's talking about W2A16 which Q2 GGUF can't even map to in practice.

Anonymous 9/12/2025, 3:56:53 AM No.106561205 [Report] >>106565782

1757508971871431.png md5: 4b4d0553...

>>106561184
https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/

Anonymous 9/12/2025, 3:57:16 AM No.106561207 [Report] >>106561367

>>106561184
that's a backronym they made up so they can keep using the term everyone would have used anyways

Anonymous 9/12/2025, 4:14:33 AM No.106561302 [Report]

That reminds me, I have vllm installed. Might as well try a quick speed comparison. Tomorrow maybe.

Anonymous 9/12/2025, 4:16:15 AM No.106561309 [Report]

k2-0905-perplexity.png md5: c855a4a8...

>>106561204
case in point

Anonymous 9/12/2025, 4:25:31 AM No.106561336 [Report] >>106561367

>>106561184
Should call them NVIDIA Processing Units to shit into everyone's salad.

Anonymous 9/12/2025, 4:26:49 AM No.106561341 [Report] >>106561347 >>106561358

RP testing qwen3-next-thinking and it has a completely different reasoning style from 2507, and not in a particularly good way
several times more verbose and EXTREMELY wasteful of tokens - trying out different lines of dialogue over and over again, outputting them in full with minor variations, outputting full drafts of the response, or in one case "let me check the previous messages [proceeds to output EVERY previous turn of the roleplay IN FULL]"... wtf. I get the sense that this is something of a proof of concept model for them (and to their credit, in my limited testing the models do seem smart and pretty good at long context) but they've gotta fix this for 3.5 or whatever their next release is.

Anonymous 9/12/2025, 4:27:49 AM No.106561347 [Report] >>106561391

>>106561341
post cockbench

Anonymous 9/12/2025, 4:29:57 AM No.106561358 [Report] >>106561391

>>106561341
Have you tried prefilling the thinking with some guidance on how to think about the RP?

Anonymous 9/12/2025, 4:32:26 AM No.106561367 [Report] >>106561514

fd3de020feadaaa297ee313d57cc7924.jpg md5: 28e7230f...

>>106561184
>>106561207
>>106561336
You butt hurt boys are SO silly! :3

Anonymous 9/12/2025, 4:35:36 AM No.106561380 [Report] >>106563978

cute finnish bf_thumb.jpg.webm md5: be8ce411...

WebM not supported

Why is it so hard to get models to undress the finnish catgirl pm?

Anonymous 9/12/2025, 4:38:26 AM No.106561391 [Report]

>>106561347
I APIfagged, sorry anon. I'd expect it to be in line with the 2507 qwens though.
>>106561358
not yet, I'm putting off messing with it more until there are ggufs

Anonymous 9/12/2025, 4:52:13 AM No.106561458 [Report]

>>106560551
> anon comes to a thread where everyone has a fucking doctorate in AI
> sees the word docker
> looses his shit as he's dumb as fuck
> after crying gets his mcdonalds uniform ready for work tomorrow

Anonymous 9/12/2025, 4:52:18 AM No.106561459 [Report] >>106561476 >>106561562 >>106561572 >>106561768

Screenshot_20250912_115041.png md5: 9429d26c...

Not sure what I expected.
What is this called? At the beginning the sentences are long and then its all short and weird. I saw this before with another sloped model.

Anonymous 9/12/2025, 4:55:20 AM No.106561476 [Report] >>106561506

>>106561459
I can get better outputs from llama 8B. Holy slop

Anonymous 9/12/2025, 5:03:05 AM No.106561506 [Report] >>106561512 >>106563309

screencapture-192-168-1-142-8080-c-03dcab27-9d4e-46c2-af71-6af7416c77c8-2025-09-12-11_59_19.png md5: e62f05db...

>>106561476
Sad because this would have been a really cool size.Fast even with offloading.
But at least they try something new.

Anonymous 9/12/2025, 5:04:09 AM No.106561512 [Report]

>>106561506
even the chinks are putting in extreme safety nets. shame. Gemma3 tier slop

Anonymous 9/12/2025, 5:05:00 AM No.106561514 [Report]

file.png md5: ca367127...

>>106561367
Consider the following you tranny freak

Anonymous 9/12/2025, 5:13:44 AM No.106561562 [Report]

>>106561459
It kinda communicates pacing.

Anonymous 9/12/2025, 5:15:54 AM No.106561572 [Report]

>>106561459
Somehow way worse than Mistral Small

Anonymous 9/12/2025, 5:20:56 AM No.106561599 [Report] >>106566096

1692170984443505.jpg md5: 37e6c542...

I did nothing today

Anonymous 9/12/2025, 5:21:17 AM No.106561602 [Report]

>>106561166
>isn’t aimed at graphics processing
you can have a gpu render something and then display it through an iGPU's display output
i wonder if you could stick a h100 inside a normal desktop PC, install the geforce driver (after doing inf mod) and then just play games on it

Anonymous 9/12/2025, 5:46:32 AM No.106561768 [Report] >>106561794

>>106561459
It's qwen3 only problem I think. It tries to mimic the text formatting from latest response. Also how it was trained could be the culprit, like maybe it was trained with a bunch of Chinese poems.
The pattern I noticed is like this :
1 paragraph -> 2 paragraphs -> 3 -> 4 -> 5 -> Then it ended with one line per paragraph.

So far the only way to control it is by instruct it explicitly in system prompt. For example I'm using this :
"Respond in multiple standard paragraphs format. Avoid poetic or dramatic elements. "

Anonymous 9/12/2025, 5:50:36 AM No.106561794 [Report] >>106561855

Screenshot_20250912_124937.png md5: 32189258...

>>106561768
That helps. But what a weird writing style. Feels like Deepseek on steroids.

Anonymous 9/12/2025, 5:54:39 AM No.106561807 [Report]

>>106560606
pay me

Anonymous 9/12/2025, 6:05:17 AM No.106561855 [Report] >>106561861

>>106561794
now you've got a pattern of 3 paragraphs of exactly 3 lines.

Anonymous 9/12/2025, 6:07:04 AM No.106561861 [Report] >>106561922

>>106561855
As god intended it. Proper paragraphs should never exceed more than 3-4 lines. I learned that in middle school

Anonymous 9/12/2025, 6:16:33 AM No.106561915 [Report] >>106562085

>>106561077
>>no tits
Perfect.

Anonymous 9/12/2025, 6:17:32 AM No.106561922 [Report]

>>106561861
congratulations for completing middle school anon.
nobody thought you could do it, but you did.

Anonymous 9/12/2025, 6:23:28 AM No.106561944 [Report] >>106562231 >>106562694

1736083582437887.jpg md5: 3d3852d5...

How will qwen 80b-A3B improve my text adventures involving me being a magical kemoshota that cures predators of their fucked up fetishes?

Anonymous 9/12/2025, 6:38:42 AM No.106561992 [Report]

wen qwen ggoofs

Anonymous 9/12/2025, 6:41:23 AM No.106562002 [Report]

y no opera his son?

Anonymous 9/12/2025, 6:49:51 AM No.106562029 [Report]

i think im gonna goof...

Anonymous 9/12/2025, 6:57:40 AM No.106562070 [Report] >>106562090 >>106562124

Realistically speaking, there haven't been any improvements erp wise since llama3.3-70b and mistral large 2407

Anonymous 9/12/2025, 6:59:56 AM No.106562085 [Report]

>>106561915
Oke doke gay

Anonymous 9/12/2025, 7:00:54 AM No.106562090 [Report]

>>106562070
I've never once used LLMs to goon so I have no idea what this even means
But the obvious solution to get around LLMs not doing what you want is to be agentic
agents aren't only for tool calling and APIs. They can also form complex logic based on natural language, like following and maintaining a story structure despite whatever retarded shit you're trying to pull

Anonymous 9/12/2025, 7:01:04 AM No.106562092 [Report]

file.png md5: 7c36edb3...

https://www.washingtontimes.com/news/2025/sep/11/ftc-launches-inquiry-ai-chatbots-acting-companions-effects-children

Anonymous 9/12/2025, 7:03:58 AM No.106562108 [Report] >>106562200

1754987229502936.png md5: f114af63...

Anonymous 9/12/2025, 7:07:50 AM No.106562124 [Report]

>>106562070
Air is a direct upgrade for 3-4 3090 VRAMlets

Anonymous 9/12/2025, 7:14:52 AM No.106562161 [Report] >>106562200

1740718713054690.png md5: 6726758f...

Anonymous 9/12/2025, 7:22:35 AM No.106562200 [Report]

>>106562108
>>106562161
tfw the goofs are nevermore

Anonymous 9/12/2025, 7:29:59 AM No.106562231 [Report]

>>106561944
Seeing how shit it is will make you put even more effort into your RPs using Mistral Nemo

Anonymous 9/12/2025, 7:32:55 AM No.106562240 [Report] >>106562244

im vibecoding vibevoice for my vibecoded local ai software. what am i in for?

Anonymous 9/12/2025, 7:33:08 AM No.106562242 [Report]

1728174081989037.png md5: f4dd7661...

Anonymous 9/12/2025, 7:34:31 AM No.106562244 [Report]

>>106562240
aids

Anonymous 9/12/2025, 7:35:41 AM No.106562252 [Report]

1729918764920247.png md5: 1bc86fd2...

Anonymous 9/12/2025, 7:43:08 AM No.106562289 [Report] >>106562298

It's up!

Anonymous 9/12/2025, 7:44:30 AM No.106562298 [Report]

>>106562289
>*looks down*
Yes, it is!

Anonymous 9/12/2025, 7:51:14 AM No.106562330 [Report]

>>106559371 (OP)
Cute miku I like

Anonymous 9/12/2025, 7:51:17 AM No.106562331 [Report] >>106562345 >>106562353

wheres my fucking ggoofs Daniel

Anonymous 9/12/2025, 7:53:34 AM No.106562345 [Report] >>106562356

>>106562331
What's happening?

Anonymous 9/12/2025, 7:55:11 AM No.106562353 [Report]

>>106562331
>unsloth
>ever
lmao

Anonymous 9/12/2025, 7:55:50 AM No.106562356 [Report]

>>106562345
upload the qwen3 next ggoofs you goof

Anonymous 9/12/2025, 7:57:50 AM No.106562370 [Report]

>>106559420
vLLM supported it in June via https://github.com/vllm-project/vllm/commit/b69781f107b7ad847a351f584178cfafbee2b32a but it's really hacky and depends on their Extension for Pytorch and some calls in their LLM hacked backend.
The best I've seen from Intel publicly for C++ is their closed pull request inside the main Flash Attention repo.
https://github.com/Dao-AILab/flash-attention/pull/1528
This uses SYCL so yeah, would be kind of an uphill for anyone not an Intel developer to adapt to the existing CUDA code.

Anonymous 9/12/2025, 8:12:30 AM No.106562423 [Report] >>106562427 >>106562430 >>106562453 >>106562482 >>106562763 >>106564012

Damn phonemizers are a huge bottleneck for TTS because devs use by default the pile of trash that is espeak. On CPU for kokoro it takes almost 8-9s to preprocess a single sentence to IPA phonemes on my laptop while the inference itself is ~6s and that shit grows at O(n) or more (fucking 22s to preprocess a paragraph). Switching to g2p_en for american english + a bunch of heuristics I got from chatgpt achieves the same preprocessing output in 1.5s for a single sentence, growing at ~O(log N). I wish this field focused a bit on efficiency instead of convenience

Anonymous 9/12/2025, 8:13:43 AM No.106562427 [Report]

>>106562423
ok nerd

Anonymous 9/12/2025, 8:14:52 AM No.106562430 [Report] >>106562450

>>106562423
You don't need to pre-process anything.

Anonymous 9/12/2025, 8:19:49 AM No.106562450 [Report] >>106562542

>>106562430
It's not feeding the raw text to the TTS, it's preprocessing the text to phonemes before feeding them to the model

Anonymous 9/12/2025, 8:20:10 AM No.106562453 [Report] >>106562493

>>106562423
Shouldn't that just be a database lookup for retarded languages like english where the pronounciation doesn't match the spelling?

Anonymous 9/12/2025, 8:24:37 AM No.106562482 [Report] >>106562515 >>106562543

>>106562423
They aren't using espeak just because it is easy, it is because it has multilingual support out of the box. G2P is much harder to configure with mappings needed for each language.

Anonymous 9/12/2025, 8:24:55 AM No.106562483 [Report]

>>106560481
They've made other models too, but they're mostly not open-weights. But I don't get why they don't start doing MoE models the Qwen way though, wouldn't that make them able to release them in a wider range of sizes with less compute?

Anonymous 9/12/2025, 8:26:43 AM No.106562493 [Report] >>106563141

>>106562453
It's not enough, because some words have different pronunciation depending on whether they're a noun or a verb while written the same way, like "use" + other things that are context dependant

Anonymous 9/12/2025, 8:31:30 AM No.106562515 [Report]

>>106562482
You're describing convenience bro. Espeak is almost twenty years old, it has memory leaks and a lot of issues that won't ever be fixed because of GPL no one wants to contribute to this trash.

Anonymous 9/12/2025, 8:36:55 AM No.106562542 [Report] >>106562586

>>106562450
One thing what will help you regardless - doesn't matter if it gets converted to phonememes or not - is to use contractions module
>import contractions
>cleaned_text = contractions.fix(text)
and then remove surrogates with regex and optionally add abbreviations and optionally clean up any problematic remaining characters (because LLMs always output random shit).

Anonymous 9/12/2025, 8:36:56 AM No.106562543 [Report] >>106562579 >>106562608

>>106562482
Sounds like an llm whose sole purpose is to take text as input and output ipa is required.

Anonymous 9/12/2025, 8:37:22 AM No.106562546 [Report] >>106562549

>MUH ERP
go to sleep americlaps and huemonkeys.
productive eurochads are taking over from here.

Anonymous 9/12/2025, 8:38:41 AM No.106562549 [Report]

>>106562546
Give us Miqu 3 already or at least Largestral 3. WTF are you frogs doing?

Anonymous 9/12/2025, 8:44:38 AM No.106562579 [Report]

>>106562543
at that point you might as well take text as input and diffuse the audio directly.
Take in some positive/negative descriptor tokens too.

Anonymous 9/12/2025, 8:46:00 AM No.106562586 [Report] >>106562603

>>106562542
Thanks, I didn't know it was a thing. I'll add that

Anonymous 9/12/2025, 8:50:24 AM No.106562603 [Report]

>>106562586
Yeah so what I did with piper voice (it's instant tts, takes ~100 mb or less but it's not as robust as vibevoice of course)
>contractions
>surrogates
># remove surrogates (U+D800 to U+DFFF unicode range)
>cleaned_text = re.sub(r'[\ud800-\udfff]', '', cleaned_text)
>Then replace commas, ellipses, "", dash, em dash, and whatever else there is with either empty spaces or periods - this way TTS does not even try to do anything but it'll go straight onward - basically remove and replace everything else except periods. This is sort of trial and error, you'll need to test this and proceed accordingly.

Anonymous 9/12/2025, 8:51:06 AM No.106562608 [Report]

>>106562543
There are small transformers for that (T5), but it's even slower than espeak. They're using them for disambiguation, which is fine when you don't care about latency and want the output to be as good as possible

Anonymous 9/12/2025, 8:52:22 AM No.106562612 [Report]

>>106559401
kek

Anonymous 9/12/2025, 9:08:13 AM No.106562680 [Report]

>>106560965
In llama.cpp? --swa-full or you can set --swa-checkpoints to something higher than 3. I think the checkpoints are made after each gen is finished. Each checkpoint consumes memory, of course.

Anonymous 9/12/2025, 9:10:57 AM No.106562686 [Report]

>>106560346
this is investor driven, not a real organic industry with demand. Investors are just going to choose safe countries that have lower amounts of corruption. One of the unique things the USA can do is be given a bunch of money with nebulous expectations and instead of vanishing into oligarch palaces people will generally try to develop something, hiring devs to work on it AT ALL

Anonymous 9/12/2025, 9:13:08 AM No.106562694 [Report] >>106562710

>>106561944
the only reason we want to try it is to see how censored it is compared to glm air

Anonymous 9/12/2025, 9:15:12 AM No.106562710 [Report]

>>106562694
yeah where the fuck are the goofs

Anonymous 9/12/2025, 9:27:16 AM No.106562763 [Report] >>106564024 >>106564046

espeak_speed.png md5: 28831417...

>>106562423
I made my own thing to run onnx models (piper, kokoro and kittentts) and espeak as the phonemizer is almost instant.
I looked at kokoro's code and it does a lot of their own phoneme translations and text normalization before and after espeak. The bottleneck is definitely not in espeak itself.
>https://voca.ro/1fQed7Pp86Ek
Nevermind the cut at the end. It's a low q voice and i'm still not doing some of the padding correctly. And this is running on a t420.

Anonymous 9/12/2025, 9:44:07 AM No.106562827 [Report] >>106562864 >>106562981 >>106563064

What purpose do LLMs serve in TTS and image generation models like VibeVoice and Flux? How are they incorporated?

Anonymous 9/12/2025, 9:53:18 AM No.106562864 [Report]

>>106562827
LLMs are typically autoregressive. The next token is dependent on the previous token[s]. For TTS, the start of the sound of a phoneme depends on how the previous one ended. There's nothing about language models specifically. It's just the autoregression.
I don't think Flux is autoregressive, but it has a text encoder, which isn't quite an LLM, but it's part of one. Converts text into an embedding which is then fed to the denoiser. Think of embeddings as abstract representations of a collection of words/sentence.

Anonymous 9/12/2025, 10:13:39 AM No.106562972 [Report] >>106562998 >>106566610

Is anyone actually using GLM Air as their go-to RP model? Is it actually an improvement over Small/Nemo? I only have 32GB RAM so I can't use any decent quants, considering upgrading to run it.

Anonymous 9/12/2025, 10:15:05 AM No.106562981 [Report]

>>106562827
As far as I understand, the internal state of an llm has contextual information such as sentiment, which can be used for more realistic tts. The pretrained lm is stripped of some final layers and is further trained with a new objective because it's better than starting from random parameters

Anonymous 9/12/2025, 10:19:00 AM No.106562998 [Report] >>106563093

>>106562972
I am! q4_xs, 49152 context. It's shit! I hate it! But so are all the other models.

Snuff/vore (not snuff-vore).

Anonymous 9/12/2025, 10:31:33 AM No.106563064 [Report]

>>106562827
Maybe part of the multi-modal push?
Although that would be about feeding pieces of images and sounds into an llm.

Anonymous 9/12/2025, 10:32:43 AM No.106563074 [Report] >>106563109 >>106563163 >>106563365 >>106563394 >>106563614 >>106563680 >>106564000 >>106564086

https://techcrunch.com/2025/09/10/a-california-bill-that-would-regulate-ai-companion-chatbots-is-close-to-becoming-law/
local keeps winning

Anonymous 9/12/2025, 10:36:41 AM No.106563093 [Report] >>106563147

>>106562998
Do you find it usable at higher context? I did try a ~q2_xs quant and it seemed decent until ~8k where it kept repeating itself, though that could very well be because of the low quant.

Anonymous 9/12/2025, 10:39:38 AM No.106563109 [Report] >>106563394 >>106564741

>>106563074
>require AI chatbot operators to implement safety protocols for AI companions and hold companies legally accountable if their chatbots fail to meet those standards.

>The bill specifically aims to prevent companion chatbots — which the legislation defines as AI systems that provide adaptive, human-like responses and are capable of meeting a user’s social needs

C.ai is die in california

Anonymous 9/12/2025, 10:46:46 AM No.106563141 [Report]

>>106562493
>some words have different pronunciation depending on whether they're a noun or a verb while written the same way, like "use" + other things that are context dependant
Oh yeah you're right it's subtle but it's there. Honestly I don't give a fuck after switching between the two use's I think if you used the quicker verb version always literally no one would ever notice. Might be different based on accent of course

Anonymous 9/12/2025, 10:48:18 AM No.106563147 [Report] >>106563175

>>106563093
I stopped after the second response. Really can't get used to the way llms write.

But in the planning and world building phases, iq4_xs is definitely very coherent and usable up to 49152. It's not smart, but it's usable. No real repetitions, even when I was incrementally instructing instead of sweeping changes to the world.

That's a different beast from rp and actual creative writing though. Even with a specific plan to draw out the deaths, i.e pummel A, kill B, further injury on A, kill C, even more gratutious violence on A, kill D, before finally killing A (in a sexual manner, because this is erp), it'll forget it (both at 12k after some preamble, and at 4k context when I drop it right into it), and instantly go and focus on killing/torturing A - finishing it in one or two scenes.

>But in the planning and world building phases, iq4_xs is definitely very coherent and usable up to 49152
Do note, however, you may need to write the start of the llm's response to accept the request or it may get denied. It's better (less safe) than other llms, but it still will deny some things, especially in an assistant context.

>No real repetitions
Actually, that's not true. It got an aneurysm when talking about meat toilets, and kept on repeating eeeeeeeeeeeeeeeee for about maybe a hundred times before catching itself and saying "wait no that's a typo let me fix it", which I've never seen before. Usually it's a death spiral when that happens with other models. I've only seen that happen once.

Anonymous 9/12/2025, 10:53:34 AM No.106563163 [Report]

>>106563074
end of ERP coming...

Anonymous 9/12/2025, 10:56:04 AM No.106563175 [Report]

>>106563147
Interesting, thanks

Anonymous 9/12/2025, 11:30:09 AM No.106563309 [Report]

>>106561506
Hmm.
"Check if this is appropriate"
"I recall our guidelines"
this feels trained on gpt-oss thinking patterns.

Anonymous 9/12/2025, 11:36:41 AM No.106563343 [Report] >>106563345 >>106566103

Screenshot 2025-09-12 at 11.36.25.png md5: ea86bcc4...

i've setup a server with Qwen3-Next-80B-A3B-Instruct for you anons, give it a try. 64K context. It's on vllm with pipeline parallelism so not the best but its should support quite a few parallel request. It's a bit of a frakenmix with a mixture of gpus but it's running at 65t/s for me.

url in picrel

api-key: "sk-miku"
model-name: "Qwen/Qwen3-Next-80B-A3B-Instruct"

I tested with openwebui and it requires the /v1, sillytavern does too in chat completions but not on text completions.

Someone do a cockbench

Anonymous 9/12/2025, 11:38:04 AM No.106563345 [Report] >>106563350

>>106563343
you are not getting my IP anon

Anonymous 9/12/2025, 11:39:11 AM No.106563350 [Report]

>>106563345
whatever use a vpn

Anonymous 9/12/2025, 11:42:57 AM No.106563365 [Report] >>106563394 >>106563402 >>106563628 >>106563636

>>106563074
>regulate AI companion chatbots in order to protect minors and vulnerable users
>which the legislation defines as AI systems that provide adaptive, human-like responses and are capable of meeting a user’s social needs — from engaging in conversations around suicidal ideation, self-harm, or sexually explicit content.
the fuck, lol.

Anonymous 9/12/2025, 11:49:46 AM No.106563394 [Report]

>>106563074
The regulatory moat is moat is finally coming.

>>106563109
>>106563365
>or sexually explicit content.
Got to kill any niche the big players aren't interested in to prevent future competitors.

Anonymous 9/12/2025, 11:50:54 AM No.106563402 [Report]

>>106563365
A a european I think these are in fact social needs in 2025 US

Anonymous 9/12/2025, 12:22:28 PM No.106563608 [Report]

1739524940443377.png md5: b469c316...

https://huggingface.co/blog/faster-transformers
Looks like OpenAI did move the local ecosystem forward, not by their model itself, but with the tricks they used to run it

Anonymous 9/12/2025, 12:24:03 PM No.106563614 [Report] >>106563820

>>106563074
- use jailbreak to avoid guardrails.
- sue company.
- earn $1.000.
- profit.

Anonymous 9/12/2025, 12:25:25 PM No.106563628 [Report] >>106563649

>>106563365
>or sexually explicit content.
why? what's wrong with sex?

Anonymous 9/12/2025, 12:26:19 PM No.106563636 [Report]

>>106563365
Jerking off is indeed just as bad as throwing yourself off a bridge

Anonymous 9/12/2025, 12:28:04 PM No.106563649 [Report]

>>106563628
Thats always how it goes.
First they bring out the "muh cunny". Its NEVER were it stops.
But there is no real outrage unless the normies are affected.
Wouldnt be suprised if the femoids gonna do witch rituals in response to their claude-sama being full robo mode now. kek

Anonymous 9/12/2025, 12:28:42 PM No.106563656 [Report]

ok but where are the fucking goofs?

Anonymous 9/12/2025, 12:29:38 PM No.106563664 [Report] >>106563694 >>106563752 >>106564136

guys, will i be able to run some low quant of qwen3next-80b on 8gb vram/32gb ram? i'm hoping for a mistral nemo upgrade for erp...

Anonymous 9/12/2025, 12:32:12 PM No.106563680 [Report] >>106563812

>>106563074
As if local models won't be trained with regulations in mind too

Anonymous 9/12/2025, 12:34:22 PM No.106563691 [Report] >>106563763 >>106563772 >>106563786 >>106564411

Untitled.png md5: 49b2a007...

Why is my performance so horrible? Llama-server.exe on windows 10 is mindbogglingly slow with multiple gpus. during inference, the reported power usage of my gpus are 80-90 watts when they have a limit of 350w.

call ".\llama-server.exe" --ctx-size 16384 --jinja --flash-attn auto --cache-type-v f16 --mlock --no-mmap --device CUDA0,CUDA1,CUDA2 --gpu-layers 48 --split-mode layer --tensor-split 33,33,33 --main-gpu 0 --model "D:\Models\GLM 4.5 Air 106bA12b Zhipu AI -- bartowski_zai-org_GLM-4.5-Air-GGUF\zai-org_GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf"

Glm air iq4_xs with 0 context shouldn't be this slow on 3090s, no?

Anonymous 9/12/2025, 12:34:57 PM No.106563694 [Report]

>>106563664
So 40 vram+ram in total. 4bit should be 40gb, so you will be able to run 3.5bpw to leave space for context

Anonymous 9/12/2025, 12:43:14 PM No.106563752 [Report]

>>106563664
q4 will be ~50gb or so so you will have to use MAP a bit to get it to run off of your SSD, which may affect your tokens per second. More ram would help, but it may run at somewhat usable speeds. May be very usable if llama ever implements its mtp layers but that never happens- though qwen sometimes puts in their own support so hopefully

Anonymous 9/12/2025, 12:44:55 PM No.106563763 [Report]

>>106563691
>--split-mode layer --tensor-split 33,33,33
Try removing this.

Anonymous 9/12/2025, 12:45:51 PM No.106563772 [Report] >>106563838 >>106563847 >>106563861

>>106563691
>--jinja
Is it doing anything? I thought GLM was supported directly.
>--flash-attn auto
Is this doing anything? Isn't it the default?
>--cache-type-v f16
Is doing nothing. It's the default.
>--main-gpu 0
Is this doing anything? Isn't 0 the default?
>--split-mode layer
Have you tried without it?
>--tensor-split 33,33,33
Have you tried without it?

Are you sure you understand what those options do or you just add them as part of a ritual?

Anonymous 9/12/2025, 12:47:19 PM No.106563786 [Report] >>106563838

>>106563691
you arent using offload moe command top prioritize offloading less important non-attention layers.

They really need to fucking update this shit so it figures that crap out because it's every fucking thread people fuckin this shit up. Im too lazy to spoonfeed you google reddit or ask grok.

Anonymous 9/12/2025, 12:51:10 PM No.106563812 [Report]

>>106563680
Shh, let's just ignore that, it doesn't matter since new models are already shit anyway.

Anonymous 9/12/2025, 12:52:47 PM No.106563820 [Report]

>>106563614
$1000 will get you maybe one 3090. That is not a whole lot of profit.

Anonymous 9/12/2025, 12:54:51 PM No.106563838 [Report] >>106563867 >>106563879

>>106563772
>Are you sure you understand what those options do or you just add them as part of a ritual?
That's what I'm wondering.

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

without --jinja, the response I get is in chinese (using bartowski's gguf)

According to https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md, flash attention is disabled by default.

>--cache-type-v f16
>--main-gpu 0
You're probably right.

>--split-mode layer
I test layer and row to see if there's a difference... it's less than 10 tk/s with row.

>--split-mode layer
>--tensor-split 33,33,33
I'll try that now.

>>106563786
I thought --gpu-layers 48 offloads everything onto the gpus?

Anonymous 9/12/2025, 12:56:44 PM No.106563847 [Report] >>106563879

Untitled.png md5: 509ec2f9...

>>106563772

Anonymous 9/12/2025, 12:58:50 PM No.106563861 [Report] >>106563879

Untitled.png md5: 64dd2feb...

>>106563772
Nemo on a single gpu vs on three gpus...

llama-b6451-bin-win-cuda-12.4-x64 btw

Anonymous 9/12/2025, 1:00:23 PM No.106563867 [Report]

>>106563838
Yah. And I told you to fucking google it you piece of shit

Anonymous 9/12/2025, 1:02:56 PM No.106563879 [Report] >>106563891

>>106563838
>According to https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md, flash attention is disabled by default.
It's not. Read llama-server -h.
>without --jinja, the response I get is in chinese (using bartowski's gguf)
Fair enough. Haven't used a GLM model, so I can't tell.
>I thought --gpu-layers 48 offloads everything onto the gpus?
Only if it has 48 layers. Does it?

>>106563847
Much better. I don't know if that's good or not, though. Post your specs and someone can tell you about their numbers.

>>106563861
MoEs are slower than dense even when moe_active_params ~= dense_model_params. Splitting a model (of any type) between gpus increases overhead.

Anonymous 9/12/2025, 1:04:56 PM No.106563891 [Report] >>106563919 >>106563941

>>106563879
>Much better. I don't know if that's good or not, though. Post your specs and someone can tell you about their numbers.
It's *really* bad. Koboldcpp on linux gets me 60-80 tk/s with nvidia-smi -pl 250.

Anonymous 9/12/2025, 1:08:31 PM No.106563909 [Report] >>106564274

>>106559803
I hate the word talent in general. No, bitch, I've been doing this for years, getting experience, learning how to find relevant information. It's not talent, it's hard work.

Anonymous 9/12/2025, 1:10:15 PM No.106563919 [Report] >>106563941

>>106563891
>Koboldcpp on linux
You need to figure out if the problem is something on windows or if your settings aren't yet optimal on llama.cpp. Basically you just said
>all these things are different, and things are different, how can this be?

Anonymous 9/12/2025, 1:12:49 PM No.106563929 [Report]

>>106560211
Bigger models are more sample efficient, so of course they'll need less steps/data than 30B-A3B. This has been known since GPT-3.

Anonymous 9/12/2025, 1:15:02 PM No.106563941 [Report] >>106563960 >>106563975

>>106563891
>>106563919 (cont)
Just parsed. You're not that anon. Are you running it in 3 3090s as well? Same quant?

Anonymous 9/12/2025, 1:18:50 PM No.106563960 [Report] >>106564017

>>106563941
No that's me. I'm just asking around for ideas. Going to switch driver versions so they're the same on windows and linux.

Single gpu-wise, windows performs about as I expect, 10-20% worse than linux, but with multiple gpus windows just fucks up, and I don't know why.

Anonymous 9/12/2025, 1:20:50 PM No.106563975 [Report] >>106564017

>>106563941
I use koboldcpp most of the time but some other anon called me a retard before so I tried llama-server instead when posting here.

Anonymous 9/12/2025, 1:21:06 PM No.106563978 [Report]

>>106561380
What's with all the radiation white spots on the sensor, is she okay?

Anonymous 9/12/2025, 1:23:36 PM No.106564000 [Report]

>>106563074
china will win!

Anonymous 9/12/2025, 1:25:27 PM No.106564012 [Report] >>106564024

>>106562423
Phonemizers are also bad because they bias the pronunciation of words to whatever the phonemizer says instead of how the speaker whose voice you're using would pronounce them. But yes, phonemizers are generally slow as fuck garbage. Something's wrong with your espeak though, it's not that slow.

Anonymous 9/12/2025, 1:27:01 PM No.106564017 [Report] >>106564070

>>106563960
>>106563975
Some anon was whining about some particular version of cuda that messed up his image gen speed a while back. It wouldn't surprise me if it was a driver version thing.
Why on earth are you on windows if you have a working linux with kobold? It's impossible to make proper comparisons if you just change everything about a system.

Anonymous 9/12/2025, 1:28:15 PM No.106564024 [Report]

>>106564012
>phonemizers are generally slow as fuck garbage
>>106562763
Something else is wrong in there. I doubt it's espeak.

Anonymous 9/12/2025, 1:32:52 PM No.106564046 [Report] >>106564076

>>106562763
I'm using misaki for kokoro since it was the recommended option. The model is in UINT8 onnx format which should be fine for CPU. Here's my inference code, idk why it's taking so long.
https://pastebin.com/4U8J6VD4

Anonymous 9/12/2025, 1:37:11 PM No.106564070 [Report] >>106564087

>>106564017
I don't really use linux, so it's a hassle to switch os every time I want to use bigger models at an acceptable speed.

If it's limited to a single gpu, there's no much difference between linux and windows.

Do you know where to get driver version 550.163.01 for windows 10? The oldest I have is 552.

Anonymous 9/12/2025, 1:38:19 PM No.106564076 [Report]

>>106564046
You're asking me to read misaki's code, not yours. I won't do that.
What about setting USE_ESPEAK_DIRECTLY to True?

Anonymous 9/12/2025, 1:39:49 PM No.106564086 [Report]

>>106563074
>The bill would require platforms to provide recurring alerts to users — every three hours for minors — reminding them that they are speaking to an AI chatbot, not a real person, and that they should take a break.
don't they know that those who want to suicide would rather trust more to ai chatbots than real human beings?
and how do they show these warnings to api users who use st?

Anonymous 9/12/2025, 1:39:56 PM No.106564087 [Report] >>106564107

>>106564070
No idea. Haven't used windows for decades. And I don't run nvidia.

Anonymous 9/12/2025, 1:42:35 PM No.106564094 [Report] >>106564140

https://huggingface.co/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
It's up.

Anonymous 9/12/2025, 1:44:44 PM No.106564105 [Report] >>106564123 >>106564135

Qwen3-next is such an insane improvements. If the bigger versions can keep up we have our new local SOTA that shits on Deepseek, GLM and Kimi.

Anonymous 9/12/2025, 1:44:57 PM No.106564107 [Report] >>106564154 >>106564411

>>106564087
Where should I go for more nvidia windows discussion? With both linux and windows on cuda 12.4, one driver version 550, the other 552, I'm still having the same issue. Is there any other thing to look out for? Llamacpp and koboldcpp on linux are easily twice the speed as they are on windows with multiple gpus.

It it just the windows overhead, or is there some other thing I'm overlooking? Resizable bar maybe? I've enabled it in the bios, but gpu-z says it's disabled in windows.

Anonymous 9/12/2025, 1:47:43 PM No.106564123 [Report]

>>106564105
Isn't qwen 3 just gptoss by chinks?

Anonymous 9/12/2025, 1:49:06 PM No.106564135 [Report] >>106566778

>>106564105
if qwen 235b manages to be be even slightly better that would be a nice win. Right now glm air beats it imo and that shouldnt be the case.

Anonymous 9/12/2025, 1:49:06 PM No.106564136 [Report]

>>106563664
I only have 6+32 it's over for me

Anonymous 9/12/2025, 1:50:30 PM No.106564140 [Report]

>>106564094
No goof no boof

Anonymous 9/12/2025, 1:53:08 PM No.106564154 [Report]

>>106564107
https://www.techpowerup.com/download/nvidia-geforce-graphics-drivers/
Little box on the left, scroll down, click on show older or whatever.
The rest is on you.

Anonymous 9/12/2025, 2:00:31 PM No.106564203 [Report] >>106564224 >>106564229 >>106564230 >>106564323 >>106564409 >>106564560 >>106564607

Instead of building massive LLMs that occupy 500B-1000B, companies should focus on smaller models, like 12B. They could divide models by topic, for example:
Nemo-erp.gguf, Nemo-rp.gguf, nemo-textadventure.gguf, nemo-science.gguf, nemo-medical.gguf, etc.
This would allow a model to hold more information on a specific subject. If you want to know about a different topic, you just switch the model.

Anonymous 9/12/2025, 2:04:19 PM No.106564224 [Report]

>>106564203
that is how nvidia envisions the future currently. I think they see LLM's as a dead end and are giving homework out. Whether anyone can turn that into a comprehensive product that works better than current tech is the big question.

https://research.nvidia.com/labs/lpr/slm-agents/

Anonymous 9/12/2025, 2:04:36 PM No.106564229 [Report] >>106564273

>>106564203
Not The Entire Internet, we know, you're absolutely right and should get tons of grants from all the labs.

Anonymous 9/12/2025, 2:05:01 PM No.106564230 [Report] >>106564244

>>106564203
just smush em all together and train a model to select the right one based on the context.

Anonymous 9/12/2025, 2:06:03 PM No.106564244 [Report]

>>106564230
moebros...

Anonymous 9/12/2025, 2:10:17 PM No.106564273 [Report] >>106564280

>>106564229
Commercial AI labs just want to make general-purpose models that win as many benchmarks as possible; open lab researchers just want to advance their career with as little controversy as possible. And, most RP or writing capabilities are not verifiable yet except by some very limited and/or narrow metrics.

Anonymous 9/12/2025, 2:10:25 PM No.106564274 [Report]

>>106563909
my sports animes taught me that talent is in the ability to put the hard work in.

Anonymous 9/12/2025, 2:11:42 PM No.106564280 [Report] >>106564291

>>106564273
>win as many benchmarks as possible
For as cheaply as possible too, if recent Qwenpostig habits are anything to go by, we're well into the race to the bottom arc.

Anonymous 9/12/2025, 2:12:40 PM No.106564291 [Report] >>106564300

>>106564280
ok but where are the fucking goofs?

Anonymous 9/12/2025, 2:13:37 PM No.106564300 [Report]

>>106564291
GGUF is not economically viable for the shareholding please be of understand

Anonymous 9/12/2025, 2:16:14 PM No.106564323 [Report]

glm-reasoning.png md5: a87fbf8f...

>>106564203
Most of the topics you listed have far too much overlap with each other. If you are talking about coding and math reasoning, then that could make sense. I don't think that stuff is helpful for RP and creative writing.

Anonymous 9/12/2025, 2:18:18 PM No.106564338 [Report] >>106564349 >>106564354 >>106564369 >>106564507 >>106564682 >>106565134

why did programmers hijack llms to this degree? people used them for porn long before llms could produce remotely viable code so it's just strange that they get the priority over this. llms would not be what they are right now without the support and exposure provided by the porn communities.

Anonymous 9/12/2025, 2:19:55 PM No.106564349 [Report]

>>106564338
Big techanology pay more than groomers.

Anonymous 9/12/2025, 2:21:18 PM No.106564354 [Report]

>>106564338
Because code is very corpo safe and makes investors think line will go uppies!

Anonymous 9/12/2025, 2:21:59 PM No.106564359 [Report] >>106564365 >>106564366 >>106564386 >>106564545

>>106559371 (OP)
>>106559401
Thats a guy

Anonymous 9/12/2025, 2:22:38 PM No.106564365 [Report]

>>106564359
ok so are you making the goofs?

Anonymous 9/12/2025, 2:22:39 PM No.106564366 [Report]

>>106564359
Yeah. Me.

Anonymous 9/12/2025, 2:22:50 PM No.106564369 [Report] >>106564425

>>106564338
The abstract, almost lovecraftian higher order concept called "investor" seems to have something to do with it, spreading its slimy tendrils of influence everywhere.
Also, COOs salivate at the thought of being able to replace 1000 programmers with an AI subscription.

Anonymous 9/12/2025, 2:24:34 PM No.106564386 [Report]

>>106564359
Yeah and also very minor coded if the tits aren't at least DD cup.

Anonymous 9/12/2025, 2:25:51 PM No.106564398 [Report] >>106564698

https://github.com/ggml-org/llama.cpp/issues/15940
bros its fucking BLEAK, only nocoders/vibecoders in the issue, no wonder goofs are nowhere to be seen

Anonymous 9/12/2025, 2:27:27 PM No.106564409 [Report]

>>106564203
>Nemo-erp.gguf, Nemo-rp.gguf, nemo-textadventure.gguf
What world do you live in where this is something companies care enough about to put out specialized models and not something that they specifically train refusals for?
Also, these days, if they're spending money to train a 12B, they might as well train a 1000B-A12B.

llama.cpp CUDA dev !!yhbFjk57TDr 9/12/2025, 2:27:33 PM No.106564411 [Report] >>106564493 >>106564784

>>106563691
Using 3x RTX 4090 and the llama.cpp defaults on Linux I get 86.47 t/s tg for 2048 tokens (should be comparable since the memory bandwidth is almost the same).
The options you posted should not be the problem.
Windows performance will always be gimped vs. Linux, I've never tested it with a multi GPU setup though.
One problem with Windows that the overhead for each CUDA call is much higher.
CUDA graphs fix that to some extent because they allow the GPU to better schedule the kernels and hide the kernel launch overhead.
But the more GPUs you have the more synchronization overhead is needed to make sure the operations are executed in the correct order.

Make sure that driver option that automatically swaps VRAM to RAM is disabled.

>>106564107
Resizable bar should not be relevant.

Anonymous 9/12/2025, 2:28:56 PM No.106564425 [Report] >>106564496

>>106564369
>Also, COOs salivate at the thought of being able to replace 1000 programmers with an AI subscription.
Just like SQL, VisualBasic, WYSIWYG, and offshoring were supposed to do right?

Anonymous 9/12/2025, 2:37:30 PM No.106564493 [Report] >>106564517

Untitled.png md5: 8f04855d...

>>106564411
This one right?

Anonymous 9/12/2025, 2:37:52 PM No.106564496 [Report] >>106564565

>>106564425
nta. Tool consolidation makes employees easier to replace. The easier they are to replace, the cheaper they become. You no longer have the sysadmin that knows the bespoke tool stack from top to bottom on your company, you have a docker user. Now you just need a cheaper docker user and you're stuck with docker because going back seems like too much of an investment in time and skill.
The tools you mention replace expensive employees for cheaper ones. So yes.

Anonymous 9/12/2025, 2:39:58 PM No.106564507 [Report]

>>106564338
If you want retarded VCs to think your company will be worth a gorillion dollars in the future you have to sell them the promise of self-improving AI.
And good results for mememarks/programming are simply the most efficient way to do that.

llama.cpp CUDA dev !!yhbFjk57TDr 9/12/2025, 2:41:07 PM No.106564517 [Report] >>106564596

>>106564493
It's been like 2 years since the last time I booted into my Windows partition but that sounds like the right one.

Anonymous 9/12/2025, 2:45:08 PM No.106564545 [Report]

>>106564359
The prompt includes "1girl", "female", and "dominant female" so you are objectively wrong.

Anonymous 9/12/2025, 2:46:40 PM No.106564560 [Report] >>106564579 >>106564593 >>106564600

>>106564203
How often do I have to tell you that to do good RP or text adventure, you need all the world knowledge? These are not narrow domain tasks like "classify sentiment on this smart phone review".

Anonymous 9/12/2025, 2:47:24 PM No.106564565 [Report] >>106564601

>>106564496
There's no way you can look at the llama.cpp PRs where prompt engineers try to do the work of actual engineers and tell me with a straight face that you think it is viable.

Anonymous 9/12/2025, 2:48:45 PM No.106564579 [Report]

>>106564560
This is totally a different anon sir.

Anonymous 9/12/2025, 2:49:49 PM No.106564593 [Report] >>106564746

>>106564560
"suck my dick" and "tell me a story" are narrow domain tasks

Anonymous 9/12/2025, 2:50:10 PM No.106564596 [Report]

>>106564517
It's been off. I tried with it on and off, and there weasn't much difference, all under 15 token/s.

Anonymous 9/12/2025, 2:50:39 PM No.106564600 [Report] >>106564746

>>106564560
nta, I don't really disagree with your ideas, but it would still be nice if they didn't train on so much code and math and refusals. if they did just give a light rp tune to a pretrained base we would be light years ahead of what we have now even at the shitty data and parameter scales.

Anonymous 9/12/2025, 2:50:51 PM No.106564601 [Report]

>>106564565
>tell me with a straight face that you think it is viable.
I didn't. I wouldn't. I'm telling you why VB and friends partially succeeded in their intended purpose. The intention of higher ups and AI is exactly the same.

Anonymous 9/12/2025, 2:51:44 PM No.106564607 [Report]

>>106564203
I think the current meta is to make small models by distilling larger ones

Anonymous 9/12/2025, 2:53:34 PM No.106564617 [Report] >>106564637 >>106564680

ONCE AGAIN.
MATH AND CODE IS BENCHMAXXED BECAUSE RESULTS ARE OBJECTIVELY VERIFIABLE.
STORY TELLING AND RP ISN'T!

Anonymous 9/12/2025, 2:56:11 PM No.106564637 [Report] >>106564779

>>106564617
>RESULTS ARE OBJECTIVELY VERIFIABLE
are they really when you're cheating though

Anonymous 9/12/2025, 3:00:29 PM No.106564680 [Report] >>106564741 >>106564779

>>106564617
Take those trillions of tokens of code and math off, replace with real conversations (of any/all kinds) and literature; finetune on top of that. It probably won't win benchmarks but it will likely have much more situational and conversational awareness than anything released so far.

Anonymous 9/12/2025, 3:00:38 PM No.106564682 [Report] >>106564715

compiling.png md5: 27ed4c0b...

>>106564338
coders found a new thing to do nothing while looking busy

Anonymous 9/12/2025, 3:01:57 PM No.106564698 [Report]

>>106564398
i'd like to make a difference but unfortunately you can probably guess which group i belong to

Anonymous 9/12/2025, 3:03:48 PM No.106564715 [Report]

training2.png md5: dd4da977...

>>106564682

Anonymous 9/12/2025, 3:06:18 PM No.106564741 [Report] >>106565071

>>106564680
>real conversations
sir that's illegal >>106563109
>The bill specifically aims to prevent companion chatbots — which the legislation defines as AI systems that provide adaptive, human-like responses and are capable of meeting a user’s social needs

Anonymous 9/12/2025, 3:06:55 PM No.106564746 [Report]

>>106564593
The former, yes. The latter, no. It's basically the most open domain you can imagine.

>>106564600
>it would still be nice if they didn't train on so much code and math and refusals
I don't disagree with this, especially refusals. A bit less focus on code and math and more on popculture would be cool too.

Anonymous 9/12/2025, 3:11:29 PM No.106564779 [Report] >>106564789 >>106564799 >>106564812 >>106564837

>>106564637
Doesn't matter. They know the reply they want from the model. If it's what they expect, it's good. Number go up.
You cannot do that with storytelling or RP. You can show a bit of storytelling to two anons and they'll have different opinions about it.
>>106564680
You can verify math and code, so they benchmaxx on math and code. If it overrides convos or not is neither here nor there. We don't have a good way to measure storytelling and RP. Do you not understand?

Anonymous 9/12/2025, 3:11:58 PM No.106564784 [Report] >>106564822 >>106564834 >>106565629

Untitled.png md5: 0f3dbdee...

>>106564411
And this is what it's like on linux (debian 13, driver version 550). I don't know how to screenshot.

15 tk/s vs 60 tk/s...

I have no idea what's wrong with my windows.

Are there any other anons with 3x 3090s on a zen 2 cpu with 3200 mhz ram, x16 gen 4 to all gpus, windows 10 iot ltsc 21h2? What kind of performance vs linux?

Anonymous 9/12/2025, 3:12:48 PM No.106564789 [Report]

>>106564779
You are absolutely right-- very sharp thinking.

Anonymous 9/12/2025, 3:14:00 PM No.106564799 [Report]

>>106564779
This really gets to the heart of modern LLM datasets.

Anonymous 9/12/2025, 3:15:31 PM No.106564812 [Report] >>106564817

>>106564779
EQ-bench thoughbeltever

Anonymous 9/12/2025, 3:16:23 PM No.106564817 [Report]

>>106564812
no

Anonymous 9/12/2025, 3:16:48 PM No.106564822 [Report] >>106564832

>>106564784
microsoft, not even once

Anonymous 9/12/2025, 3:17:13 PM No.106564826 [Report]

Any interesting gooner releases lately?

Anonymous 9/12/2025, 3:17:44 PM No.106564832 [Report]

>>106564822
Wyell, could just be something wrong with my configs, cuda dev's windows doesn't have that problem.

Anonymous 9/12/2025, 3:18:00 PM No.106564834 [Report] >>106564871

>>106564784
Windows-
>power settings via control panel:
>minimum processor state: 100%
>pci express:
>link state power management: off
>nvidia settings
>power management mode: pref. max. performance
Also check out your ram is running at max speed but this should be a bios issue, not OS one...

Anonymous 9/12/2025, 3:18:10 PM No.106564837 [Report] >>106564874

>>106564779
literally who cares. make the benchmarks for the math version and release a best attempt rp writing tune. it would be basically free for them to do. its just the cult of safety would never allow it.

Anonymous 9/12/2025, 3:22:47 PM No.106564871 [Report] >>106564906

>>106564834
>power plan
Yeah, that's one of the first things I do on a windows install.
Nvidia settings are maximum performance.
Both report 3200mhz inside the os.

Anonymous 9/12/2025, 3:22:53 PM No.106564874 [Report]

>>106564837
Ok. You don't understand.

Anonymous 9/12/2025, 3:23:39 PM No.106564881 [Report] >>106564896

chill down the spine

Anonymous 9/12/2025, 3:24:52 PM No.106564896 [Report]

>>106564881
*Anti-Slop Backtracks you*

Anonymous 9/12/2025, 3:26:13 PM No.106564906 [Report] >>106564923 >>106564936

>>106564871
Are you sure your torch installation isn't runnin cpu only? This is the biggest culprit.
Other than that, can't really help you.

Anonymous 9/12/2025, 3:28:04 PM No.106564923 [Report] >>106564961

>>106564906
Oh wait you were talking about llama.cpp.
You might have actually downloaded wrong binaries - are you sure you are using
>https://github.com/ggml-org/llama.cpp/releases/download/b6451/cudart-llama-bin-win-cuda-12.4-x64.zip
>https://github.com/ggml-org/llama.cpp/releases/download/b6451/llama-b6451-bin-win-cuda-12.4-x64.zip

Anonymous 9/12/2025, 3:29:09 PM No.106564936 [Report] >>106565141

Untitled.png md5: f1ab18c4...

>>106564906
Wait we needed torch?

Anonymous 9/12/2025, 3:31:54 PM No.106564961 [Report] >>106565105

>>106564923
Mine is llama-b6451-bin-win-cuda-12.4-x64. Other than the power management, what other factors might contribute to performance? Stable diffusion is fine... but that's not multi-gpu.

Anonymous 9/12/2025, 3:41:45 PM No.106565061 [Report]

Hey, I just noticed that docker compose has a model: section. Has anyone tried that? How is it compared when compared to running text gen webui?

Anonymous 9/12/2025, 3:43:13 PM No.106565071 [Report] >>106565076

>>106564741
The actual bill doesn't outright ban companion chatbots or weights for them. It's for online platforms (Meta AI, Character AI, etc) providing companion chatbot services to minors.

Anonymous 9/12/2025, 3:44:10 PM No.106565076 [Report] >>106565098

>>106565071
And you think they won't just do it to models they release as well to be extra safe too?

Anonymous 9/12/2025, 3:46:45 PM No.106565098 [Report]

>>106565076
They probably will, but that's not what the bill is about. Online services can easily have extra moderation layers on top of unmoderated models like Character.AI has always done.

Anonymous 9/12/2025, 3:48:00 PM No.106565105 [Report]

>>106564961
Chipset drivers are one thing but I don't know.

Anonymous 9/12/2025, 3:53:17 PM No.106565134 [Report]

>>106564338
because of retarded shit like that california bill

Anonymous 9/12/2025, 3:54:13 PM No.106565141 [Report]

1754882642935937.jpg md5: 4299a265...

>>106564936
Bro

Anonymous 9/12/2025, 4:07:58 PM No.106565224 [Report] >>106565281 >>106565387 >>106565645

All three 3090s are pinned at 16GT/s PCIe Link Speed in hwinfo64 during generation, and gpu-z reports perfcap reason as idle. Gpu memory controller load is 10%.

Wait what? Nvidia-smi reports two 3090s at p3, and one at p2. Is this it?

Anonymous 9/12/2025, 4:14:37 PM No.106565281 [Report] >>106565351

>>106565224
Eh, memory clocks are at 650mhz? They're 1187 if only a single gpu is used.

Anonymous 9/12/2025, 4:24:06 PM No.106565351 [Report]

>>106565281
If I add the gpu driving my display into the mix, it becomes even lower, at 100mhz. Is it because they're different graphics cards? Even though they're 3090s, two are msi, and one is palit. Hmm, but even when I tested only the msi cards, they went down to 650mhz if splitting the model.

Anonymous 9/12/2025, 4:28:56 PM No.106565387 [Report] >>106565434 >>106565452 >>106565453 >>106565477

>>106565224
if you are layer splitting then adding more processors will result in a lower average speed because they spend a greater percentage of time ideling. play with a model you can run fully on a single card and the try splitting it. observe the happenings?

Anonymous 9/12/2025, 4:36:42 PM No.106565434 [Report]

>>106565387
I did that with nemo ofc, same behaviour, nerfed by more than 200% when split. By row is even worse. I'm going to try vllm on windows tomorrow and see if it still exhibits the same behavior.

Anonymous 9/12/2025, 4:38:21 PM No.106565452 [Report]

>>106565387
Oh I misread that. It's something like a 20-30% performance difference between linux and windows if I only use one gpu last time I tested it.

Anonymous 9/12/2025, 4:38:58 PM No.106565453 [Report]

>>106565387
that's not true at all, if you're running something halfway decent like a GB200 supercluster the performance hit is minimal.

Anonymous 9/12/2025, 4:43:00 PM No.106565477 [Report] >>106565567

>>106565387
>ideling
Is that why the memory clocks drop when it's split? Normal behavior? How do I check the memory clocks on linux?

Anonymous 9/12/2025, 4:55:51 PM No.106565567 [Report] >>106565591

>>106565477
its normal behavior. most people don't want their gpu burning 40watts to move a cursor around a static display. they are supposed to dynamically clock based on load. but there is latency to the procedure. some anons have noted performance boosts by locking their cards clocks or using a more aggressive power profile.

Anonymous 9/12/2025, 4:58:35 PM No.106565591 [Report] >>106565625 >>106565975

>>106565567
Power profile? Do I need the geforce experience thing for that? With just the driver, I can set the power management to prefer maximum performance, but I don't know how to lock clocks or change power profiles.

Anonymous 9/12/2025, 5:02:32 PM No.106565625 [Report] >>106565649

>>106565591
if you change your power profile in the Nvidia driver settings thing you need to reboot for it to take effect. you can use msi afterburner to lock the gpu clocks to any value, not sure about memory clock tho.

Anonymous 9/12/2025, 5:03:00 PM No.106565629 [Report] >>106565866 >>106566288

>>106564784
bro just use WSL

Anonymous 9/12/2025, 5:05:16 PM No.106565645 [Report] >>106565658

>>106565224
install linux and lock memory clock to maximum
options nvidia NVreg_RegistryDwords="PowerMizerEnable=0x1; PowerMizerDefaultAC=0x1; PowerMizerLevelAC=0x1; PowerMizerLevel=0x1; PerfLevelSrc=0x2222;

Anonymous 9/12/2025, 5:05:37 PM No.106565649 [Report]

>>106565625
>if you change your power profile in the Nvidia driver settings thing you need to reboot for it to take effect

If it's just in the driver settings, then that's done, power plan for cpu and gpu are one of the first I do when I install windows.

Anonymous 9/12/2025, 5:06:38 PM No.106565658 [Report] >>106565664

>>106565645
Does that carry over to windows after a reboot?

Anonymous 9/12/2025, 5:07:06 PM No.106565664 [Report]

>>106565658
no

Anonymous 9/12/2025, 5:25:48 PM No.106565782 [Report] >>106565806 >>106565831

>>106561205
nvidia didn't invent GPUs, what the fuck?
lying scumbags

Anonymous 9/12/2025, 5:29:55 PM No.106565806 [Report] >>106565813 >>106565886

fyutyurr.png md5: ff7cd881...

>>106565782

Anonymous 9/12/2025, 5:31:35 PM No.106565813 [Report]

>>106565806
kek

Anonymous 9/12/2025, 5:34:35 PM No.106565831 [Report]

>>106565782
Next they will buy intel and invent the CPU as well.

Anonymous 9/12/2025, 5:39:27 PM No.106565866 [Report]

>>106565629
Would that actually help?

Anonymous 9/12/2025, 5:41:56 PM No.106565886 [Report]

>>106565806
okay but when is nvidia omegaverse

Anonymous 9/12/2025, 5:54:26 PM No.106565975 [Report]

>>106565591
On windows you can use evga precision x1 to lock the cards at their max boost clock. Tripled my tokens per second.

Anonymous 9/12/2025, 6:13:57 PM No.106566096 [Report]

>>106561599
literally me

Anonymous 9/12/2025, 6:14:58 PM No.106566103 [Report] >>106566151

>>106563343
i would very much like to use that and thank you for hosting it but i dont really want to use it but thank you anon
i mean i appreciate you hosting it for anons but i dont want to connect to an api

Anonymous 9/12/2025, 6:21:40 PM No.106566151 [Report] >>106566297

>>106566103
>i dont want to connect to an api
What king of lmg anon doesn't want to connect to an api not hosted on their own hardware?

Anonymous 9/12/2025, 6:38:46 PM No.106566288 [Report]

>>106565629
already on esl thanks

Anonymous 9/12/2025, 6:39:37 PM No.106566297 [Report]

>>106566151
Are you okay?

Anonymous 9/12/2025, 6:54:02 PM No.106566428 [Report]

Screenshot 2025-09-12 105230.png md5: 6dab7a26...

Threadly reminder that Fiction LiveBench is the best benchmark to ever grace the field

Anonymous 9/12/2025, 7:16:48 PM No.106566610 [Report]

>>106562972
I am, iq4_xs at 32k context. It's not great and tends to be very repetitive. Swiping more or less gives the same response every time.

Anonymous 9/12/2025, 7:20:48 PM No.106566647 [Report] >>106566696

So, having only used koboldcpp, oobabooga, and llama-serve, which all came with frontends, what's a good one for vllm? I tried open webui, but it instantly asked for my email on startup (why?)

Anonymous 9/12/2025, 7:25:18 PM No.106566696 [Report]

>>106566647
>but it instantly asked for my email on startup
It's an internal login system so you can just type in anything as an email address as long as it has an @ the won't be any email sent unless you setup something for it. It's just to make it look more corpo I guess.

Anonymous 9/12/2025, 7:33:43 PM No.106566778 [Report]

>>106564135
How/what are you using GLM Air (for)?
It has been pretty shit for me. Especially when talking about software architecture. I don't know about RP shit and I have been using an API (not local) so it's not even the quant probably.
While Qwen3 and Kimi are an absolute blast.

Anonymous 9/12/2025, 7:40:35 PM No.106566854 [Report]

>>106566836
>>106566836
>>106566836

Anonymous 9/12/2025, 7:46:44 PM No.106566918 [Report]

nomem.png md5: a725b194...

https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/
https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
https://huggingface.co/google/vaultgemma-1b

The future of Google LLMs: models that know nothing about rare information. They use huge batch size to mitigate memorization, among other things.

>What does this mean in practice? Informally speaking, because we provide protection at the sequence level, if information relating to any (potentially private) fact or inference occurs in a single sequence, then VaultGemma essentially does not know that fact: the response to any query will be statistically similar to the result from a model that never trained on the sequence in question. However, if many training sequences contain information relevant to a particular fact, then in general VaultGemma will be able to provide that information.
>
> [...] Sequence-level DP provably bounds the influence of any single training sequence (example) on the final model. We prompted the model with a 50-token prefix from a training document to see if it would generate the corresponding 50-token suffix. VaultGemma 1B shows no detectable memorization of its training data and successfully demonstrates the efficacy of DP training.

Anonymous 9/12/2025, 8:08:02 PM No.106567105 [Report]

gemma 4 soon