← Home ← Back to /g/

Thread 106566836

238 posts 84 images /g/
Anonymous No.106566836 [Report] >>106567628 >>106570386
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

The Raven Edition

Previous threads: >>106559371 & >>106551921

►News
>(09/11) Qwen3-Next-80B-A3B released: https://hf.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>(09/11) ERNIE-4.5-21B-A3B-Thinking released: https://hf.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106566844 [Report]
►Recent Highlights from the Previous Thread: >>106559371

--Paper: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms:
>106561145 >106561161
--Troubleshooting llama-server.exe performance with multi-GPU configurations:
>106563691 >106563763 >106563772 >106563861 >106563838 >106563879 >106563891 >106563919 >106563941 >106563960 >106564017 >106564070 >106564107 >106564154 >106564411 >106564784
--Qwen3-Next model efficiency and performance analysis:
>106560211 >106560245 >106560248 >106560269 >106560274 >106560283 >106560310 >106560291 >106560294 >106560322 >106560314 >106560338 >106560356 >106560302 >106563929
--Optimizing MoE models via selective tensor offloading:
>106559871 >106559925 >106559938 >106559943 >106559962 >106559979 >106559984 >106560000 >106560056
--Role of LLMs in TTS and image generation via autoregression and embeddings:
>106562827 >106562864 >106562981 >106563064
--Qwen3 model's verbose reasoning issues during roleplay testing:
>106561341 >106561358 >106561391
--Public server setup for Qwen3-Next-80B-A3B-Instruct with 65t/s performance:
>106563343
--TTS phonemizer bottlenecks and optimization:
>106562423 >106562450 >106562542 >106562586 >106562603 >106562493 >106563141 >106562482 >106562515 >106562543 >106562579 >106562608 >106562763 >106564046 >106564012 >106564024
--California bill to regulate AI companion chatbots:
>106563074 >106563109 >106563402 >106563820 >106563680 >106564086 >106563394
--OpenAI optimization techniques boost local transformer efficiency:
>106563608
--FTC probes major tech firms' AI chatbots for child safety risks:
>106562092
--Specialized small LLMs vs multi-purpose models:
>106564203 >106564224 >106564273 >106564280 >106564230 >106564323 >106564409 >106564560 >106564600 >106564607
--Miku (free space):
>106559401 >106562108 >106562161 >106562252

►Recent Highlight Posts from the Previous Thread: >>106559374

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.106566876 [Report] >>106566903 >>106566972
Ramlets, how are we doing today?
Anonymous No.106566895 [Report]
Mikulove
Anonymous No.106566903 [Report] >>106566908 >>106566920
>>106566876
I compress my ram - gives approx. 2 times more memory.
Anonymous No.106566908 [Report]
>>106566903
I sparsify my ram - gives 2 times more speed.
Anonymous No.106566920 [Report]
>>106566903
I overclock my ram,
It cost twice as much because it breaks.
Anonymous No.106566923 [Report] >>106566952
>>106566778
If you're using it on API, why are you using Air instead of the big one?
Anonymous No.106566924 [Report]
https://www.downloadmoreram.com/
Anonymous No.106566936 [Report]
Back in the MS-DOS days, I actually had a driver that compressed my RAM. It broke some things though.
Anonymous No.106566944 [Report] >>106567707 >>106568337 >>106568344 >>106568931
https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/
https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
https://huggingface.co/google/vaultgemma-1b

The future of Google LLMs: models that know nothing about rare information. They use huge batch size to mitigate memorization, and other stuff.

>What does this mean in practice? Informally speaking, because we provide protection at the sequence level, if information relating to any (potentially private) fact or inference occurs in a single sequence, then VaultGemma essentially does not know that fact: the response to any query will be statistically similar to the result from a model that never trained on the sequence in question. However, if many training sequences contain information relevant to a particular fact, then in general VaultGemma will be able to provide that information.
>
> [...] Sequence-level DP provably bounds the influence of any single training sequence (example) on the final model. We prompted the model with a 50-token prefix from a training document to see if it would generate the corresponding 50-token suffix. VaultGemma 1B shows no detectable memorization of its training data and successfully demonstrates the efficacy of DP training.
Anonymous No.106566952 [Report] >>106567795 >>106567977
>>106566923
Maybe he means he tried it, and found it lacking (in addition to the obese one).
Anonymous No.106566972 [Report] >>106567440 >>106567457
>>106566876
As a 24GB vramlet I just went back to gemma3 qat and it's still the goat for ST-style RP. The writing is always fresh and pleasant to read, with outstanding vocabulary. Fucked around with offloading glm4.5-air q3 and other drummer finetunes and they seemed broken and inconsistent in their responses. Google needs to release a 80b moe gemma 4
Anonymous No.106567050 [Report] >>106567073
Is this legit or over-hyped?
I find hard to believe that just 32B is enough to match GPT4 and 200B checkpoints.
With just 32B, you can run it locally in your PC, having a private GPT4...sounds too good to be true.
Anonymous No.106567073 [Report]
>>106567050
Hard to say, can't see the image any longer.
Anonymous No.106567118 [Report] >>106567127 >>106567150 >>106567193 >>106567806 >>106567898 >>106568369
Is this legit or over-hyped?
I find hard to believe that just 32B is enough to match GPT4 and 200B checkpoints.
With just 32B, you can run it locally in most PCs with a decent quantization that doesn't sacrifice much, having a private GPT4 with no quota limitations...sounds too good to be true.
Anonymous No.106567127 [Report]
>>106567118
Sounds like a typical marketing department sales pitch.
Anonymous No.106567150 [Report]
>>106567118
>sounds too good to be true
Congratulations, you have a brain.
Anonymous No.106567193 [Report]
>>106567118
>reasoning
always and has always been a meme
Anonymous No.106567440 [Report]
>>106567105
>>106566972
Welcome to /lmg/ thread google stealth marketing engineer technician saars. Please kindly inform us if the next gemma will be as cucked as the last one so I can decide if I should post funny jeet pictures or not.
Anonymous No.106567457 [Report]
>>106566972
> glm4.5-air
> glm4.5-air
To me work well is at the moment the best model for Vramlets, you are doing something work, this model is ultra sensitive to context temperature, when you reach too much context is time to low the temperature.
Anonymous No.106567628 [Report] >>106567937 >>106568779
>>106566836 (OP)
quoteth the raven.....
Anonymous No.106567662 [Report] >>106567715
>>106565629
It's been stuck for four hours...
Anonymous No.106567707 [Report] >>106568235
>>106566944
If this works how it looks like it works based on what you quoted and the image, then the model would theoretically be equivalent to stuff like Phi. Maybe a bit better. But ultimately it will have trouble with real world user queries since it would lack OOD skills and knowledge. This technique can only create models for extremely narrow use cases, not general assistant models. So if they do it for all future models, Google would be shooting themselves in the foot and lose all the market share they just spent tons of effort to claw back.
Anonymous No.106567715 [Report]
>>106567662
It's probably for the better.
WSL is a janky half-measure.
If you want to run windows for your main gaming/home office PC but you want to run linux for LLM stuff just get a second system and run linux properly.
Anonymous No.106567795 [Report] >>106567814 >>106567977
>>106566952
GLM-chan is NOT obese!
Anonymous No.106567806 [Report] >>106568384
>>106567118
Both can be true

30b's are seeing massive improvements in abilities but that has to do with coding, physics, chemical equations etc. And who gives a fuck about that. It's a glorified wikipedia. Good for grunt work.

For stuff like writing and other more complex tasks, size is still king and may be for a long time. My LLM needs to know every dragon ball z character and understand that yes, piccolo <could> wipe his ass with goku's face if he wanted to. If you want nuance and complexity, simple trivia is not gonna do it for ya.
Anonymous No.106567814 [Report]
>>106567795
(stomp stomp stomp)
Anonymous No.106567898 [Report] >>106567970
>>106567118
I'll ignore all the rest of the shit in the post. What called my attention was
>2000 tokens/s on cerebras
>most reasoning models crawl at 200 tok/sec
What the fuck does reasoning have to do with the token generation speed?
And why the fuck are you paying attention to that retard?
Anonymous No.106567937 [Report]
>>106567628
ggufs nevermore
Anonymous No.106567970 [Report]
>>106567898
Assuming that's total throughput over multiple concurrent requests, that guy has skill issues with the other models or cerebras is shit.
Anonymous No.106567977 [Report] >>106568544
>>106567795
>>106566952
This is just AIR-CHAN, GLM-CHAN wont even fit in the frame.
Anonymous No.106568235 [Report] >>106568661
>>106567707
They were already bragging how memorization in Gemma 3 was lower than Gemma 2, so I think that's the direction where things are going.
Anonymous No.106568337 [Report] >>106568567
>>106566944
>adding noise to the model so it doesn't precisely reproduce input data
>reduces training stability, increases computation costs
>today’s private training methods produce models with utility comparable to that of non-private models from roughly 5 years ago, highlighting the important gap our work will help the community systematically close.

Okay, so adding noise to the model makes it significantly worse (as one would expect). They seem to think that's avoidable, but I don't know how.
Anonymous No.106568344 [Report]
>>106566944
>make your model retarded
>brag about it
Anonymous No.106568369 [Report]
>>106567118
I think punching above weight and trading blows with gpt 4 started in 2024.
Anonymous No.106568374 [Report] >>106572145
macchads just cant stop winning
Anonymous No.106568384 [Report]
>>106567806
>piccolo <could> wipe his ass with goku's face if he wanted to
That is only at the start of dbz.
Anonymous No.106568398 [Report]
is next better than 235B for sex?
Anonymous No.106568414 [Report] >>106572145
macchads just cant stop winning/2
Anonymous No.106568426 [Report] >>106572145
macchads just cant stop winning/3
Anonymous No.106568544 [Report]
>>106567977
Fake news! GLM-chan is a slender young lady.
Anonymous No.106568567 [Report]
>>106568337
Differential Privacy is an area of research. Researchers work on it, promising "it'll be good soon" and push out papers. Everyone not working in that field ignores them, unless they need to bring them up for compliance like "we're working on DP, don't worry".
Anonymous No.106568601 [Report]
Macfags stopped bragging about glm. It's the one thing they had. They will stop bragging about 80b-a3b soon enough. It's the one thing they'll have for a few days.
Then it's back to just sucking cock.
Anonymous No.106568645 [Report] >>106568680 >>106568796 >>106568907 >>106569780
Qwen Image Edit abuse
Anonymous No.106568659 [Report] >>106568674
anyone else testing qwen 80b right now? it feels scarily close to 235b, long context, barely hallucinates, incredibly fast. their new moe architecture must be getting close to the closed models sota.

(testing the mlx version)
Anonymous No.106568661 [Report] >>106568688
>>106568235
That's weird since Gemma 3 still knows a ton of trivia compared to other similarly sized models. If it truly didn't memorize things, then it should be worse than even Qwen. Also that graph is weird. How does 4B have the exact same memorization rate as 9B and 27B?
Anonymous No.106568674 [Report] >>106568697
>>106568659
It's honestly been garbage for me, outputs seem comparable or even a little worse than 30B A3B
Anonymous No.106568680 [Report] >>106568690 >>106568700 >>106569014
>>106568645
>needs a lora to do it
Local nano banana when?
Anonymous No.106568688 [Report] >>106568747
>>106568661
xou don't need or want the model to memorize single examples verbatim. its supposed to generalize. if the information is presented many times it will get integrated. just not a single example. its just to prevent personal information from getting memorized.
Anonymous No.106568690 [Report]
>>106568680
If you praise china hard enough, two more weeks.
If you dont 4 more months
Anonymous No.106568697 [Report]
>>106568674
seethe ggoof
Anonymous No.106568700 [Report] >>106568713 >>106568743
>>106568680
>need a lora
wrong mentality, reframe: can train the model to do new things if desired.
Also it can already do this without, this just enhances the result faithfulness (fabric texture on fumo, painting style).
Anonymous No.106568713 [Report] >>106568718
>>106568700
But I don't want to browse, search for, and maintain a library of loras like I did during the SD days. Not again...
Anonymous No.106568718 [Report]
>>106568713
Yeah me neither
But I also understand that no model, ever, will be able to do enough for what I want to see.
Being able to teach a model new stuff is important to me.
Granted this isn't super complex stuff and I can understand why you'd want it out of the box.
Anonymous No.106568743 [Report] >>106568753
>>106568700
Let me guess you need 100gb of vram to train a Qwen lora?
Anonymous No.106568747 [Report] >>106568803 >>106568808
>>106568688
So is it a bad thing then? Won't the model have more room to learn things when it's not spending time memorizing everything word for word? I mean, could this explain why Gemma seems to know so much for its parameter size?
Anonymous No.106568753 [Report]
>>106568743
yeah it's a dire situation there, I had to use prosumer hardware.
still, the results were fantastic on even a tiny dataset, model understand is clever even if the base results are dog.
Anonymous No.106568779 [Report]
>>106567628
2 weeks more
Anonymous No.106568789 [Report] >>106572155
/lmg/ still struggling to cope with the fact that apple won local
Anonymous No.106568796 [Report] >>106568925
>>106568645
How well does that lora work with fancy outfits?
Anonymous No.106568803 [Report]
>>106568747
I think it is a valid idea. it shouldn't hurt models at the trillions of training tokens scale. anything important will be seen multiple times from multiple examples. it just won't be able to reproduce some random persons medical information or ssn.
Anonymous No.106568808 [Report]
>>106568747
The possibly have very good/long post-training and include general trivia there.
Anonymous No.106568821 [Report] >>106568866
my toaster is already creepin in on the 6 year mark
Planning my next machine to buy on this black friday, will prolly buy an rtx 5090, a ryzen 9950X3D cpu and 256gb of ddr5 memory, what are some of the more memory hungry models I should try then?
Anonymous No.106568863 [Report] >>106568879
What are some must have extensions for SillyTavern?
Anonymous No.106568866 [Report]
>>106568821
Qwen3-next
Anonymous No.106568875 [Report] >>106568897 >>106568949
why is training a model to make it specialized so fucking hard?
Anonymous No.106568879 [Report] >>106568892
>>106568863
a girlfriend
Anonymous No.106568892 [Report] >>106568898
>>106568879
I already have one. She doesn't like AI or me using it but she respects my drive to learn and be skilled with computers.
Anonymous No.106568897 [Report]
>>106568875
its easier now then it was 10 years ago.
Anonymous No.106568898 [Report]
>>106568892
does she know you are having sex with it?
Anonymous No.106568907 [Report] >>106569204
>>106568645
>spyware UI
any other options?
Anonymous No.106568916 [Report] >>106568923
qwen goofs???????????????
Anonymous No.106568923 [Report] >>106568936
>>106568916
https://huggingface.co/mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit
Anonymous No.106568925 [Report] >>106569285 >>106569780
>>106568796
dunno, you can crank an arbitrary resolution though
Anonymous No.106568931 [Report]
>>106566944
safetymaxxing niggers
Anonymous No.106568936 [Report] >>106568962
>>106568923
>mlx
I said goofs nigger, im not downloading lm studio or vllm for the awq variant
Anonymous No.106568947 [Report]
can I run glm4.5v in llama.cpp? I cant find gguf for it
Anonymous No.106568949 [Report]
>>106568875
If you're trying to teach it domain-specific information, it's pretty much impossible with a tiny LoRA and/or without heavily affecting previous knowledge with a huge learning rate and burning the information into the weights.
Using summarized information might work better/faster than entire documents, but good luck training the model in a way to make it truly understand the information without just parroting it (verbatim, even).
Anonymous No.106568962 [Report] >>106568998
>>106568936
wait 2 more weeks then faggot
Anonymous No.106568998 [Report]
>>106568962
Anonymous No.106569008 [Report] >>106569021 >>106569026 >>106569029 >>106569082
qwen next is officially goated
Anonymous No.106569014 [Report] >>106569044
>>106568680
judging by the google studio api outages the chinese are already working on it
Anonymous No.106569021 [Report]
>>106569008
what happens at 100%?
Anonymous No.106569026 [Report]
>>106569008
>80b
>worse than 30b coder
go fuck your herpes infested goat
Anonymous No.106569029 [Report] >>106569072
>>106569008
why is thinking variant so much worse than the regular chat version?
Anonymous No.106569044 [Report]
>>106569014
But distilling never gets the performance of the original...
Anonymous No.106569072 [Report] >>106569104
>>106569029
it hasnt been trained on the secret benchmaxx
Anonymous No.106569075 [Report] >>106569111
yeah im starting to think its over
Anonymous No.106569082 [Report] >>106570295
>>106569008
>officially goated
>Lost to Qwen3-Coder-30B
Dense bros we can't stop winning
Anonymous No.106569104 [Report]
>>106569072
wouldn't that be kind of against the point if they have to train it specifically for the benchmark? That's like one of the big flaws with test driven programming where you make your program fit your test rather than your actual problem.
Anonymous No.106569111 [Report]
>>106569075
GLM4.5 AIR BROS, WE CANT STOP WINNING!!!
Anonymous No.106569153 [Report]
>pull vllm
>follow the same steps I did last time that I wrote down for successfully compiling it, which I had to change and come up with new steps for as the ones previous stopped working at some point
>doesn't work anymore either, giving different errors
Sigh...
Anonymous No.106569204 [Report] >>106569281
>>106568907
>>spyware UI
Uhh what??
Anonymous No.106569224 [Report]
so... has anyone modified a onahole to interact with a LLM yet?
Anonymous No.106569237 [Report]
Groksirs when is we getting supports?
https://github.com/ggml-org/llama.cpp/pull/15539
@CUDAdev sir please do the needful Vishnu bless yo
Anonymous No.106569268 [Report] >>106569331 >>106569335 >>106569349 >>106569356 >>106569357
>vllm supports gguf guys!!!
>try loading a gguf
>it just errors out
My ass it's supported. Now I'm downloading some safetensors to try again and see if it's a model issue or my build is just fucked for some reason.
Anonymous No.106569281 [Report] >>106571956
>>106569204
sends your data to jewgle on startup. API nodes, electron have it but they are optional. the manager calls home
Anonymous No.106569285 [Report]
>>106568925
>/gig/ on /lmg/
Weird colab
Anonymous No.106569299 [Report] >>106569319
https://www.trendforce.com/news/2025/09/11/news-kioxia-reportedly-eyes-2027-launch-for-nvidia-partnered-ai-ssds-with-100x-speed-boost/
>Kioxia, in partnership with NVIDIA, is developing next-generation SSDs aimed at AI servers, targeting commercialization by 2027 with read speeds nearly 100 times faster than current models, reaching up to 200 million IOPS using PCIe 7.0 technology. These SSDs, designed to partially replace HBM as GPU memory expanders, reflect the growing AI-driven demand in storage, with projections indicating that AI-related NAND will comprise 34% of the global NAND market by 2029, adding $29 billion in total addressable market (TAM). As a result, a U.S. investment firm warns of a potential NAND shortage starting in 2026, exacerbated by increased adoption of QLC eSSDs, Nearline SSDs, and high-bandwidth flash in response to tightening HDD supplies and AI infrastructure needs.
SSDmaxxers, 2027 will be your year!
Anonymous No.106569319 [Report]
>>106569299
>Kioxia, in partnership with NVIDIA
doa
Anonymous No.106569331 [Report] >>106569357
>>106569268
When I tried it, I couldn't get a single moe gguf to load. I was expecting it to be slow and unoptimized, but it didn't even load.
Anonymous No.106569335 [Report] >>106569357
>>106569268
just use llama for goofs bro
Anonymous No.106569349 [Report] >>106569357
>>106569268
Support for gguf on vllm is ass anyway
Anonymous No.106569356 [Report]
>>106569268
>he expects corposlop pyshit to "just work" without suffering through dependency hell
Anonymous No.106569357 [Report] >>106569385 >>106569594
>>106569268
Ok I just tried loading a small safetensors model and it also failed. Searching the error on the github issues gives 0 hits.
Wtf is wrong with vllm man.
I suppose GPU would probably work fine as I can just download the prebuilt wheels, but the CPU build is not well.

>>106569331
Thanks.
I think CPU AND goof support just simply cannot be expected to be stable on vllm. Let alone GPU + CPU inference which isn't currently supported.

>>106569335
>>106569349
It seems even safetensors don't work on my build kek. They don't truly "support" goofs or CPU either.
Anonymous No.106569367 [Report] >>106569391 >>106569402
you will give me ze best GERMAN low latency TTS or STS model right now!
I'm tired of my models turning into drooling retards when trying to pronounce 25 aka FÜNFUNDZWANZIG!
>fünfffffuuffuhffffzwwffggg 'hick!
Don't make me use jewgle voices...(fuck elevenlabs, they aren't even that good).
Anonymous No.106569379 [Report] >>106569407
Did Drummer finally troon out?
Anonymous No.106569385 [Report] >>106569553
>>106569357
Thier github is practically useless and it seems like all support happens through their discord.
What error did you get? Try using the V0 engine. They rushed getting V1 out and making it the default while it was still a broken pile of shit missing more than half the features of V0.
Anonymous No.106569391 [Report] >>106569520
>>106569367
Just directly pass your ipa transcription?
Anonymous No.106569402 [Report]
>>106569367
VibeVoice
>low latency
oh...
Anonymous No.106569407 [Report] >>106569413
>>106569379
What's with this tech to troon pipeline?
Anonymous No.106569413 [Report] >>106569471
>>106569407
>What's with this tech to troon pipeline?
The terminally online fit into two groups mostly, the mentally ill and assholes. If you are weak to praise and group think the former is where you will stay. If you just want to solve problems you are going to argue try out things fix it come back and then call everyone a dumbass.
Anonymous No.106569441 [Report] >>106569488
How do you guys usually write character prompts? Do you just write a paragraph describing them, or something more structured?
Anonymous No.106569471 [Report] >>106569492
>>106569413
your logic is a self report
anyway, maybe focus less on people, or do you keep your nose firmly buried up everyone's ass
Anonymous No.106569474 [Report] >>106569564
Since I'm an esl I'd like to know if this prosody using heuristics (kokoro) sounds acceptable for americans: https://vocaroo.com/17z5mdm2a0yU
The sample is complex on purpose so I can test a bunch of heuristics at once: "At 11:30 p.m. on Jan. 3, 2024, the project's lead (Dr. O'Neil) announced, "We'll ship v2.1 to the U.S. and EU by Q2," but a $1250.50 shortfall, a 5% processing fee, and three critical API regressions forced the team to triage, reassign tasks, and reconvene Monday—prioritizing user-facing fixes over backend refactors to preserve product quality."
Anonymous No.106569488 [Report]
>>106569441
Everyone here just asks GPT5 to write it for them and improve it. Nobody uses local models for roleplay, GPT5 is the current meta.
Anonymous No.106569492 [Report] >>106569510 >>106569510
>>106569471
What isnt a self report? im writing myself my opinion what else am i suppose to right or think
Anonymous No.106569503 [Report] >>106569614
The hell kinda quant method are you supposed to use again? I've seen conflicting reports.
Anonymous No.106569510 [Report]
>>106569492
>>106569492
penis
Anonymous No.106569520 [Report] >>106569557
>>106569391
Local or private, show me a conversational STT>TTS method with decent enough latency. Best I found was some hugging face space from the transformers.js dudedev. But it was kinda meh and engerrish only. and it had no dialogue turn system or whatever that interruption mechanic is called. I really cba developing this as I'm more interested in the backend stuff. I'd just use openAI realtime API for prototyping, but fuck me those prices are surreal.
Anonymous No.106569553 [Report] >>106569630
>>106569385
>discord
Ugh.
The error is "TypeError: Cannot instantiate typing.Literal". I guess I could ask my llm about it to see if it possibly has any solutions.
How do I use the V0 engine? I tried the environment variable but it doesn't seem to do anything?
Anonymous No.106569557 [Report] >>106569617
>>106569520
Check this https://github.com/Open-LLM-VTuber/Open-LLM-VTuber it has a barge-in system which is the interruption mechanic you're looking for
Anonymous No.106569564 [Report]
>>106569474
>https://vocaroo.com/17z5mdm2a0yU
sounds fine
Anonymous No.106569594 [Report]
>>106569357
I was actually fucking with the cpu install just to see if I could give next a little test or two, but I could smell it being a migraine the minute I started running into weird dependency mismatches. I'd honestly rather wait the multiple weeks it'll take to get support in llamacpp only to test it and go "yeah, it's pretty shit for writing" anyway. Shame, because small active parameter models are great for cheap context and being relatively fast off the bat. Even jamba with more active parameters is still pretty snappy if you put enough of it into vram, but sperganov has yet to fix the constant reprocessing on new messages for it, or multiple a couple other models that whatever the fuck they coded that causes this.
Anonymous No.106569614 [Report] >>106569646
>>106569503
IQ2 is the new meta, really. You will not notice any difference even when using smaller models.
Anonymous No.106569617 [Report] >>106569672
>>106569557
yeah that looks promising, will give it a shot. thanks, pedo.
Anonymous No.106569630 [Report] >>106569666
>>106569553
>How do I use the V0 engine? I tried the environment variable but it doesn't seem to do anything?
Read the output on startup, it should tell you which engine is being used.
>TypeError: Cannot instantiate typing.Literal
See if there's any hints in the stack trace before this part. For me, the only success I had when vllm decided to throw errors was upgrading or downgrading (at the cost of model support) the vllm version. Using the V0 engine solved a lot of trouble for me, but once they hit v10, I gave up on messing with it.
Anonymous No.106569646 [Report]
>>106569614
He says as he spins on his heel and then says how q8 kv cache is disastrous for models or something
Anonymous No.106569666 [Report]
>>106569630
Yeah I think I'll just stop here if my LLM doesn't solve it. Don't feel like trying out various versions.
And honestly I have a feeling the CPU performance is worse than Llama.cpp's anyway, but it'd be nice to actually confirm.
Anonymous No.106569672 [Report]
>>106569617
break a leg
Anonymous No.106569780 [Report]
>>106568925
>>106568645
damn, wasn't qwen-image and qwen-image edit supposed to be a slopped failure that's not worth it to run?
Anonymous No.106569817 [Report] >>106569822 >>106569839 >>106569846 >>106569869 >>106569923 >>106570369
Is there some kind of model that can act as a sort of twitch chat for when you're playing games by yourself? Like you give it a video feed and it reacts in some way. Just so that it's not so lonely.
Anonymous No.106569822 [Report] >>106569829
>>106569817
im gonna use this idea to become a billionaire
thanks
Anonymous No.106569829 [Report]
>>106569822
You'd be lucky to make lunch money. The only billionare is the owner of whatever platform streaming platform you use.
Anonymous No.106569832 [Report] >>106569878
>finally found a model with little censoring and pretty comptetent logic
>leaks chatml all over the place half the time
sigh
Anonymous No.106569839 [Report]
>>106569817
Having used 30b+ models, it depends. If you start off in a prose setting, then ask it to interject with something like a chat/review section (eg: character reads chat mid-story), it will fuck it up. Off the bat with examples, maybe. As for giving an llm a video feed, I don't think that's feasible at the moment unless you have a shitload of vram or some kind of highly specialized and hand written pipeline
Anonymous No.106569846 [Report]
>>106569817
>Just so that it's not so lonely
This is a general for pretend sex with computers and yet this post is one of the most pathetic things I've ever read
Anonymous No.106569856 [Report] >>106569868 >>106569874 >>106569888 >>106569905 >>106572139
Just became a 128gb ramGOD with 24gb vram. What's the best I can run now?
Anonymous No.106569868 [Report]
>>106569856
>ram
>What's the best I can run now
You mean crawl?
Anonymous No.106569869 [Report] >>106569876 >>106570004 >>106570036 >>106570343 >>106570369
>>106569817
for the millionth time, no we cant build screenreading twitch yet, no we dont know how neuro does it and it cannot be done locally for any sane price
Anonymous No.106569874 [Report]
>>106569856
Probably glm 4.5 at iq3, with a whopping 9 t/s on empty context
Anonymous No.106569876 [Report]
>>106569869
Sorry...
Anonymous No.106569878 [Report]
>>106569832
What model and how does it "leak" "chatml" "all" "over" "the" "place"?
Anonymous No.106569888 [Report]
>>106569856
qwen235b q4
Anonymous No.106569905 [Report] >>106569920
>>106569856
>128gb ramGOD
sounds you are a channellet with 2 channels at most
Anonymous No.106569920 [Report]
>>106569905
Times are changing old man, I could barely fit a llama2 13b at 4k context and now I can run 100b moes and 32b dense models with bare minimum 16k context yet I have not bothered buying new hardware
Anonymous No.106569923 [Report] >>106569955
>>106569817
Could you get a primitive version by sending screenshots through a multimodal model?
Anonymous No.106569955 [Report]
>>106569923
If you don't mind minute+ long latency.
Anonymous No.106570004 [Report] >>106570013 >>106570016
>>106569869
>no we dont know how neuro does it
It's still hilarious that some random guy built a better utilization for AI than trillions in VC cash between every major corporation
Anonymous No.106570013 [Report] >>106570017 >>106570027
>>106570004
Did you forget Ani exists?
Anonymous No.106570016 [Report]
>>106570004
>It's still hilarious that some random guy built a better utilization for AI than trillions in VC cash between every major corporation
Not really, if you dig into anything you will realize its a very small group of people actually doing anything at all sometimes its just one hyperfocused dude who does nothing but that for years cause of autism.
Anonymous No.106570017 [Report]
>>106570013
I wish I could
Anonymous No.106570027 [Report]
>>106570013
Someone post the mouth animations
Anonymous No.106570036 [Report] >>106570048
>>106569869
Uhh, techlet?
>stream has a few minutes long delay (this is what most parasites do normally even)
>selected twitch chat entries are redirected to llm
It's not rocket science.
He wrote a backend what controls the character and llm and integrates them together but I can assure I could make a demo if I had more interest.
Anonymous No.106570048 [Report] >>106570094 >>106570114
>>106570036
>I can assure I could make a demo if I had more interest.
That means you cant, and no one else has cracked it as good and made it available.
Anonymous No.106570054 [Report] >>106570278
>I could
lol
Anonymous No.106570087 [Report]
The new fiscal quarter starts in October. As usual, this will be when companies start pushing out new models to look good.
Two more weeks and the big releases start.
Anonymous No.106570094 [Report] >>106570100
>>106570048
>no one else has cracked it as good and made it available.
What is the incentive to put in that much work just to make it available because you want it? Even if I put in that much effort, I would just make a Neuro knockoff and try to make money off it.
Anonymous No.106570100 [Report]
>>106570094
Okay thats fair, but still if you can clone it and make money why not? how come none of the 'i made my own neuro' is close to his?
Anonymous No.106570107 [Report]
My implementation is cool she's just on the Canadian Github
Anonymous No.106570114 [Report]
>>106570048
You are just too stupid and/or underage even. Jesus christ these ERPers shouldn't even allowed to post in this thread.
Anonymous No.106570266 [Report]
I just did a test of GPU performance with vllm and llama.cpp. With Qwen 4B, original BF16 safetensors, I got around 71 t/s on vllm with empty context, and 45 t/s on llama.cpp with a BF16 GGUF. At 8k context, Llama.cpp got 44 t/s, and vllm got 60 t/s. I also tried an F16 GGUF and it was like 2 t/s faster. These results suggest that at least on my system with Qwen 4B on full precision, there is something bottlenecking Llama.cpp. Maybe it'd be closer with a larger parameter model, but I don't have the VRAM to hold the full precision weights.
Anonymous No.106570278 [Report]
>>106570054
Problem nigger?
Anonymous No.106570295 [Report] >>106572734
>>106569082
So, a general instruct model lost to a model that was specialized for coding at coding, and that's supposed to be a mark against the general instruct model?
Anonymous No.106570343 [Report] >>106570369
>>106569869
Nah you coomers are braindead. There are bazillions of projects like these on github https://github.com/kimjammer/Neuro
Anonymous No.106570369 [Report] >>106570480 >>106570488
>>106569817
>>106569869
>>106570343
the one that can play games with ai:
https://github.com/moeru-ai/airi
Anonymous No.106570386 [Report] >>106572714
>>106566836 (OP)
Question about -ts command on llamacpp, when considering a split, would I consider the fact that my main(first) GPU will have vram in use from other programs and windows when considering the split? Or does llama.cpp take that into consideration and balance it properly? Are there any other commands that will just split the vram evenly between two cards without having to adjust numbers with -ts? I find myself using such odd -ts number combos to get an almost even vram usage split, I don't know why.

For example, currently -ts 24,15 has it split almost evenly between my cards which makes no sense to me considering my 1st card is using vram for other programs and windows. I just don't like how I have to keep re-loading the model over and over trying different numbers until I find a combo that splits it properly.
Anonymous No.106570396 [Report] >>106570405
What if my computer is over a decade old with no vram? Is local llms the hobby for me.
Anonymous No.106570405 [Report]
>>106570396
Anonymous No.106570480 [Report] >>106570646 >>106570647
>>106570369
Wait, so it is possible? Why were anons being mean to me? Are they trying to keep this tech all to themselves?
Anonymous No.106570488 [Report] >>106570796
>>106570369
>ElevenLabs voice synthesis
Anonymous No.106570646 [Report]
>>106570480
You talked to clueless retards. Very few here know more than edging to chub cards
Anonymous No.106570647 [Report]
>>106570480
they're all tsun with little dere here
Anonymous No.106570796 [Report] >>106570807
>>106570488
It's the best and will continue to be the best
Anonymous No.106570807 [Report]
>>106570796
China will rip off vibe voice and make it better.
I believe
Anonymous No.106570867 [Report] >>106570892 >>106570901 >>106572018 >>106572287 >>106572459
Finally a model that passes this test and it's only 1.7B and open sourced. wild
Anonymous No.106570892 [Report] >>106570902 >>106571715
>>106570867
Didn't it get the 8th character wrong?
Anonymous No.106570901 [Report] >>106570943
>>106570867
the model alone or the whole stack?
Anonymous No.106570902 [Report]
>>106570892
well fuck, guess there's always next model
Anonymous No.106570943 [Report]
>>106570901
just the model
Anonymous No.106570964 [Report] >>106571077
What is a good model for being negative and critical? I hate how they agree with everything. I want to be told I'm wrong or being an idiot.
Anonymous No.106571070 [Report] >>106571360
For those of you who use the models for anything other than porn, what is the best way to let the model browse the web to search for info?
In my opinion the difference nowadays between proprietary models and local is mostly in the tooling integration rather than the actual models.
Anonymous No.106571077 [Report] >>106571090 >>106571105
>>106570964
Kimi k2 is the GOAT
Anonymous No.106571090 [Report] >>106571094 >>106571095 >>106571099
>>106571077
>Kimi k2
Is kimi k2 local? can you run it?
Anonymous No.106571094 [Report] >>106571102
>>106571090
No but I understand that one or two people here can run it :)
nta btw
Anonymous No.106571095 [Report]
>>106571090
yes
Anonymous No.106571099 [Report]
>>106571090
It's 1T/30A
Anonymous No.106571102 [Report]
>>106571094
>one or two people here can run it :)
I wish i was one of them.
Anonymous No.106571105 [Report] >>106571123
>>106571077
Can it still talk about medical or mental stuff or does it just shut down?
Anonymous No.106571123 [Report]
>>106571105
post your full medical history and social security number and i'll ask my buddy kimi
Anonymous No.106571216 [Report] >>106571237 >>106571243
any kokoro voice recs?
https://voca.ro/1jAMPLyV0zJA
Anonymous No.106571237 [Report]
>>106571216
Bateman is always good.
https://files.catbox.moe/bwv1fc.mp3
Anonymous No.106571243 [Report] >>106571266 >>106571337
>>106571216
Can it do Japanese sex, moaning, and blowjob noises?
If no, it's worthless
Anonymous No.106571266 [Report]
>>106571243
VibeVoice can, but no api support yet
Anonymous No.106571337 [Report] >>106571379
>>106571243
>braindead coomer
Anonymous No.106571347 [Report] >>106571466
There arent any coomers here. We are all using this technology safely and to enhance our lives and work abilities.
Anonymous No.106571360 [Report]
>>106571070
>what is the best way to let the model browse the web to search for info?
MCP
Anonymous No.106571379 [Report]
>>106571337
gooners are the reason AI has advanced so much
a 4chan holo gooner invented chain of thought
Anonymous No.106571466 [Report] >>106571553
>>106571347
>There arent any coomers here
Sorry I was offline for a bit, I'm back now.
Anonymous No.106571553 [Report] >>106571961
>>106571466
show me your coom
Anonymous No.106571656 [Report]
KH music came on in my playlist and I remembered the lyrics poster :)
Anonymous No.106571715 [Report]
>>106570892
Come on now, let the man rest
Anonymous No.106571835 [Report] >>106571849 >>106571876 >>106571916
I made a Miku for you guys. Feel free to use at your leisure.
Anonymous No.106571836 [Report] >>106571852 >>106571854
Can you guys give me a list of safetymaxxing companies so I know to ignore their model releases?
Anonymous No.106571849 [Report] >>106571876
>>106571835
Anonymous No.106571852 [Report]
>>106571836
Pretty much everyone else except Mistral and Chinese..
Anonymous No.106571853 [Report] >>106571876
Textless, exploitable version.
Anonymous No.106571854 [Report]
>>106571836
All of them
Anonymous No.106571856 [Report] >>106571876
Exploitable transparency version.
Enjoy you are images of official /lmg/ mascot Hatsune Miku!
Anonymous No.106571876 [Report] >>106571885
>>106571835
>>106571849
>>106571853
>>106571856
fuck off, spammer.
Anonymous No.106571883 [Report]
stay, cute normal poster
Anonymous No.106571885 [Report]
>>106571876
I'm sorry for contributing OC. Really, I am.
I'll go back to enjoying my chat with official /lmg/ roleplaying model Rocinante 1.1, the best roleplaying model made by the best finetuner, TheDrummer!
Anonymous No.106571916 [Report] >>106571940
>>106571835
Cute migu
Anonymous No.106571940 [Report]
>>106571916
Yeah I'm happy with how that artist tag blend turned out.
The key is Namori. Namori tag makes anything cute.
Anonymous No.106571956 [Report]
>>106569281
fork it and edit out the homing beacon
Anonymous No.106571961 [Report]
>>106571553
Anonymous No.106572018 [Report] >>106572081
>>106570867
wasn't the mokuro manga thing able to do this already?
>https://github.com/kha-white/mokuro
Anonymous No.106572081 [Report]
>>106572018
where are you supposed to get the high quality raws for this though
Anonymous No.106572139 [Report] >>106572166
>>106569856
>24
Anonymous No.106572145 [Report]
>>106568374
>>106568414
>>106568426
You do know there is still transformers and that has all the model support and where everything goes first, right? Most of the internet only mentions GGUF because they don't want to waste space to download the raw model and use AWQ for non-fined grained 4/8 bit inference because most people don't overspend for the amount of compute they get and are running <1-2k USD builds for these models.
Anonymous No.106572155 [Report]
>>106568789
Apple didn't win jack shit when it is slower per dollar and harder to use overall for anything <=128 GB of RAM than AMD's Strix Halo. Maybe their matmul implementation in the A19/M5 is worth a shit but I am leaning towards no unless proven otherwise given how shit Apple is at AI.
Anonymous No.106572165 [Report]
w y w a
y d h m s
p
o b
d g
Anonymous No.106572166 [Report]
>>106572139
b-but I make up for it with my prompting...
Anonymous No.106572171 [Report]
my prompts turns 30B models to 500B behemoths
Anonymous No.106572287 [Report]
>>106570867
Haha, nice to see my image still floating around.
8th character like other people said and also the KA hiragana torwards the end.
Damn 2 years and and they all still struggle.
In 2023 I thought we would have a local gaming buddy by now. That I can have in the background translating games with an overlay.

At least drummer finetunes are good enough for lunatranslator. That works pretty well most of the time.
I remember the old ATLAS translations back in the 00s. kek
Anonymous No.106572459 [Report] >>106572468
>>106570867
It failed though. There is one big error and three small ones.
Anonymous No.106572468 [Report]
>>106572459
You're absolutely right anon! It really is a testament to your genius to point this out!
Anonymous No.106572569 [Report] >>106572592
what did they mean by this
Anonymous No.106572592 [Report] >>106572601 >>106572637 >>106572714
>>106572569
>everyone picking up mi50s and v100s despite the next major ML stack software releases for their vendors with AMD and Nvidia dropping them.
I don't get it at all, Even if you had to pay double the price, it is worth still having software support over trying hack things together after that point and praying the Vulkan backend is super optimized one day so you can keep using your card.
Anonymous No.106572601 [Report] >>106572669
>>106572592
i meant the little green display but yes the gpu choice is also questionable
Anonymous No.106572637 [Report] >>106572669
>>106572592
What could be the reasons for updating your drivers? The last time I assembled my LLM machine was last year and I had to downgrade drivers for better power savings, it still works to this day. The only thing I've heard about these drivers is that they become slower in newer versions, and power efficiency on idle has been broken for more than a year now
Anonymous No.106572653 [Report] >>106572669
And when it comes to AMD drivers, if you find a version that somewhat works, you'd better never touch it again
Anonymous No.106572669 [Report] >>106572729
>>106572601
Oh didn't notice. Yeah, won't comment on that. I still think microATX is way too small to fit cards like that even with blowers but I guess that's why noise is never a factor to consider.
>>106572637
Depends on what card you have. Ada and now Blackwell are still getting performance improvements and fixes. If you locked your hardware stack now especially on Blackwell, you're missing out on Nvidia actually providing value in unbreaking shit, although to be fair, it's shit they broke in the first place. CUDA also does get a bunch of API changes between major releases.
>>106572653
For AMD, you especially want to run nightly ROCm if you can build it yourself.
Of course, that's from a developer/tinkerer standpoint. If you want shit to just work, then okay, you do you in order to keep software stability at all costs.
Anonymous No.106572696 [Report] >>106572729
just don't use AYYMD and you will be happy
llama.cpp CUDA dev !!yhbFjk57TDr No.106572714 [Report]
>>106570386
Unless someone changed it when I wasn't looking, the default in llama.cpp is to use the GPU with index 0 as "main GPU".
Note that the order in which llama.cpp/ggml receives GPUs is not necessarily consistent with the one reported elsewhere.

>>106572592
Essentially all GPUs you buy are a depreciating asset.
Even if you have to replace them earlier and they end up as e-waste that may have been a better deal than buying and later selling a more expensive GPU.
Though as long as there are drivers I intend to maintain llama.cpp/ggml support for Pascal and Vega (Mi50).
Anonymous No.106572729 [Report]
>>106572669
Have you ever experienced a t/s increase after updating nvidia drivers?
>>106572696
People who buy AMD are either desperate enough or in it for the ride. Someone has to finish off that lunatic extra phantasm, you know
Anonymous No.106572734 [Report]
>>106570295
>So, a general instruct model lost to a model that was specialized for coding at coding, and that's supposed to be a mark against the general instruct model?
the general instruct usually did better than the previous coder focused model, yes.
Qwen 3 instructs (the general instruct, not coder) are better than 2.5 coder.
A new model being worse than previous model is a sign that the tech is stalling.