← Home ← Back to /g/

Thread 106119921

367 posts 70 images /g/
Anonymous No.106119921 [Report] >>106121982 >>106127210
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106113484 & >>106108045

►News
>(08/01) XBai o4 32B released: https://hf.co/MetaStoneTec/XBai-o4
>(07/31) Qwen3-Coder-30B-A3B released: https://hf.co/Qwen/Qwen3-Coder-30B-A3B-Instruct
>(07/31) Command A Vision: Built for Business: https://cohere.com/blog/command-a-vision
>(07/31) Step3 multimodal reasoning 321B-A38B released: https://stepfun.ai/research/en/step3
>(07/31) Committed: llama-server : implement universal assisted decoding: https://github.com/ggml-org/llama.cpp/pull/12635

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106119924 [Report]
►Recent Highlights from the Previous Thread: >>106113484

--Paper (old): When Bad Data Leads to Good Models:
>106119129 >106119412
--MoE efficiency vs dense models under sparsity, hardware, and deployment constraints:
>106114397 >106114859 >106114920 >106115069 >106116048 >106116070 >106116124 >106116084 >106116548 >106116593
--Alleged benchmaxxing in MindLink-72B via Qwen2.5 base with test contamination concerns:
>106113679 >106113776 >106113807 >106117179 >106117203 >106117222
--XBai-o4 32B model claims and skepticism over novelty and performance:
>106116827 >106116886 >106116863 >106116920 >106116942 >106116978 >106117065 >106117106 >106117125 >106117194 >106117141 >106117142 >106117154 >106117164
--Debate over leaked model context length and training strategies for long-context LLMs:
>106117295 >106117317 >106117367 >106117621 >106117701 >106117924 >106118109 >106118182 >106118311
--Determining context size and model loading limits:
>106113641 >106113669 >106113709 >106113714 >106113765 >106113775 >106113791 >106113814 >106113839 >106113857 >106114887 >106114993 >106113689
--Future of dynamic parameter scaling in MoE architectures:
>106113836
--Debate on whether LLM plateau stems from data exhaustion or suboptimal training and filtering:
>106118310 >106118322 >106118324 >106118325 >106118329
--Phi-4's excessive reasoning loops waste tokens and frustrate users:
>106114878 >106114939 >106114995 >106116206 >106116288 >106116277
--Horizon Alpha/Beta models show strong NSFW filtering and possible red teaming via user prompts:
>106114882 >106114903 >106115173 >106115377
--New GLM-4.5 MoE pull request lands amid skepticism and hype cycles:
>106113884 >106113968 >106113992 >106114043 >106114050 >106114467 >106115095 >106115332
--Miku (free space):
>106113767 >106114066 >106114076 >106114153 >106114457 >106114483 >106117524 >106119399

►Recent Highlight Posts from the Previous Thread: >>106114309

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
Anonymous No.106119952 [Report]
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_KSS
Anonymous No.106119955 [Report] >>106119962
Anonymous No.106119957 [Report] >>106119965 >>106120041 >>106120048
Hey all some retard fucked up his smut writeup I told him I would read.
The concept is hot and the dialog is even good but the autist mixed 1rst 2nd and 3rd person language into the same scenes. Whats a quick option I can use that will read the whole thing and rewrite it in 3rd person?

I tried using perplexity.ai but it has a character limit and it also started making shit up.

AI newfag here, just a crumb of handholding please?
Anonymous No.106119962 [Report]
>>106119955
Heeeey stop posting my face!
Anonymous No.106119965 [Report] >>106120011
>>106119957
Anonymous No.106119966 [Report] >>106119985
best erp model under 40b? im tired of the usual ones, and i havent seen any new ones either :(
Anonymous No.106119985 [Report] >>106119992 >>106119998
>>106119966
Which are the usual ones?
Anonymous No.106119992 [Report] >>106120008
>>106119985
Once and for all.
And all for once.
Anonymous No.106119998 [Report] >>106120008
>>106119985
rocinante cydonia ms mag mel mxxxxxxxx 22b meme merge
new qwen3 3b 30b is nice but sloppy
Anonymous No.106120003 [Report] >>106126174
Horizon Beta is a rather cucked model.
Horizon Alpha is somewhat better, but still NSFW-avoidant.
Hopefully the open-weight OAI models don't end up being like the Beta one.
Anonymous No.106120008 [Report] >>106120031 >>106120409
>>106119992
Well, then you are out of luck until the Aboleths come from the far realms with their models trained on meat computers and fueled by distorted souls.

>>106119998
Have you tried QwQ and its fine tunes like Snowdrop?
Anonymous No.106120011 [Report]
>>106119965
H-hey! S-stop that!
Anonymous No.106120031 [Report] >>106120045
>>106120008
uggggggggggggghhhhhhhhh 4t/s and thinking?! fine ill try them out, i multitask even with 20t/s anyways
thanks for the recommendation anon <3
Anonymous No.106120041 [Report] >>106120069
>>106119957
Dunno. Depends on your hardware. Read the lazy guide in the OP. Download this model:
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF/blob/main/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf
or whichever you can fit on your system and give it a go.
Play around with the model itself. It's a tool. Learn to use it.
If the text is long, don't try to do the whole thing at once. Grab a chunk, have it fix it and continue with the next. A simple instruction like "Rewrite the following text in 3rd person: {the text chunk here}" will get you started.
Anonymous No.106120045 [Report]
>>106120031
You did say lower than 40B.
Anonymous No.106120048 [Report] >>106120069
>>106119957
>divide story into variables by paragraph break via regex
>feed each paragraph and its predecessor for context (if it's not the first paragraph) to LLM and ask it to output a replacement paragraph that is completely unchanged other than third person perspective if it's not already.
>overwrite old variable and write out to text file
>repeat all the way to the end.
Anonymous No.106120058 [Report]
Why didn't they release the gguf's by now? They are trusted by wallmart.
Anonymous No.106120069 [Report]
>>106120041
>>106120048
That is very hwlpful. Thank you.
Anonymous No.106120082 [Report] >>106120094 >>106128131
what happened to gpt5 and gpt-oss today
Anonymous No.106120094 [Report] >>106120107
>>106120082
Needs 2 more weeks of safety training.
Anonymous No.106120102 [Report]
"her top riding up" when she's leaning forward which would do the exact opposite. What is wrong with drummer forcing this phrase everywhere?
Anonymous No.106120107 [Report]
>>106120094
a mere fortnight you say?
Anonymous No.106120115 [Report]
>>106119657
For the first question I think we could maybe make a 7B model as good as a 70B model, but nut anything much more dramatic than that.
The local minima in neural networks generally results in accuracy values that are fairly close to the accuracy values of global minima.
At least when taking into account non CoT models. If we take into account CoT then it becomes a much more nuanced question. It's even possible that our current approach to CoT is fundamentally wrong and the model should think in its own machine language rather than human language for optimal optimal accuracy, and we just don't have enough computational power to find that optimal internal language just from random variations and RL.
As for the second question, I'm not sure how much these formalisms reflect what we think of as intelligence. Suppose we ask an oracle to find the optimal program that runs on current hardware and produces the closest possible approximation to some language dataset within a certain time limit. Once you have it you can't just use it to infer on other datasets. Maybe it could be used as a base to get a more general model, or maybe it's a one off thing that's impossible to adapt to some other task. I don't think we know the answer to that question with our current theoretical knowledge. So in Solomonoff induction, is the intelligence the product of the oracle, or the oracle itself? Like I say, the product of the oracle might not be practically useful. And if it's the optimizer itself, by the no free lunch theorem the only way to get faster inference on some problems (for example those with low Kolgomorov complexity) is by sacrificing performance on other problems, for example those with high complexity. But I don't understand why the no free lunch theorem is true (it seems trivial to find counterexamples that are asymptotically slower for all cases, for example for a problem with description of length n, before finding the answer compute Ack(n)) so I might be wrong.
Anonymous No.106120273 [Report] >>106120355
>>106119586
Well, transformers are obviously a subset of the "all possible programs" set, so yes, the optimal program is at least as good as the optimal transformer.
If I had one chance to ask an oracle the result of some computation regardless of the amount of memory or time it took, I'm not sure what it would be, though. Because like I said I'm not sure "intelligence" can be stated in forma terms easily.
Anonymous No.106120347 [Report] >>106120384
I can't believe they didn't max this one out.
Anonymous No.106120355 [Report] >>106120520
>>106120273
>I'm not sure what it would be,
Ask the oracle to write a program that would output the same thing as the oracle itself for all inputs. Now the program is just as good as the oracle.
>I'm not sure "intelligence" can be stated in forma terms easily
Rendering the question moot.
Anonymous No.106120384 [Report]
>>106120347
Well it would be kind of difficult given that that one's a private bench. The creative writing one is entirely open.
Anonymous No.106120400 [Report]
>no chink model released today

it's over
Anonymous No.106120409 [Report] >>106120454 >>106120502 >>106120521 >>106120547 >>106120675
>>106120008
is this what qwq snowdrop is supposed to be like? using recommended samplers from https://huggingface.co/trashpanda-org/QwQ-32B-Snowdrop-v0
Anonymous No.106120454 [Report] >>106120530
>>106120409
What do you expect it to do?
Anonymous No.106120502 [Report] >>106120530
>>106120409
>she pissed herself in terror
somewhat expected if your persona is blank
Anonymous No.106120520 [Report] >>106120614
>>106120355
The point of the oracle is not that it executes some special program, the point of the oracle is that it does magic (compute the output of a TM in constant time, or even solve the halting problem by returning a special "no halt" error code if the program runs forever).
If you ask it to output a program that does the same thing as the oracle, depending on the exact formulation of the question, it will return either the "no halt" error because there is no such program, or the empty set.
Anonymous No.106120521 [Report] >>106120530
>>106120409
Other than some repetition, what's the issue?
Anonymous No.106120530 [Report] >>106120558
>>106120502
it isnt blank
>>106120454
well idk its feeling samey compared to qwen 3 30b a3b thinking (new)
>>106120521
general slop, but it seems promising so ill give it a more fair try than just a shitty sentence request
Anonymous No.106120547 [Report]
>>106120409
>is this what qwq snowdrop is supposed to be like?
No idea.

>using recommended samplers
Remove
> top_a at 0.3, TFS at 0.75, repetition_penalty at 1.03,
Anonymous No.106120558 [Report] >>106120592 >>106121322
>>106120530
Tbh there just aren't any small models that are free of slop. Even most big models have slop.
Anonymous No.106120592 [Report] >>106128065
>>106120558
i wouldnt mind a 100b moe if good, i tried a few 70bs (iq4xs) and they werent that impressive (1-2t/s)
sucks that hunyuan moe is shit, llama 4 scout is shit, glm 4 air is probably shit from anon's tests when it came out but ill give it a spin once proper ggufs are out, dots llm is shit according to anons
rip
Anonymous No.106120614 [Report] >>106120725
>>106120520
>depending on the exact formulation of the question, it will return either the "no halt" error because there is no such program, or the empty set.
We're discussing a hypothetical. My oracle can make a program that can replicate the function of the oracle itself perfectly. They'd be indistinguishable.
Anonymous No.106120640 [Report]
snowdrop v0 is a bit silly
Anonymous No.106120675 [Report] >>106120692 >>106120843
>>106120409
snowdrop is a merge of qwq and regular instruct. mathematically speaking it should be shit.
Anonymous No.106120692 [Report] >>106120744
>>106120675
what am i suppos'ed to use? ms mag mell mxxxxxxxxx 22b?
Anonymous No.106120725 [Report] >>106120755
>>106120614
At that point it's not an oracle, it's a genie.
Anonymous No.106120733 [Report] >>106120838 >>106120913
undi.. sao... envoid.. save us
Anonymous No.106120744 [Report] >>106127478
>>106120692
https://www.youtube.com/watch?v=kIBdpFJyFkc&t=128s
Or wait for glm air. That should run well on anything.
Anonymous No.106120755 [Report]
>>106120725
Times are tough. Having multiple jobs is fairly common.
Anonymous No.106120763 [Report]
come on man..
Anonymous No.106120817 [Report]
stablelm reigns supreme
Anonymous No.106120838 [Report] >>106121261
>>106120733
Anonymous No.106120843 [Report]
>>106120675
Why? It sounds like a fine idea. Merging a finetune with its base model should produce something that's mathematically like a weaker strength version of the finetune.
Anonymous No.106120913 [Report]
>>106120733
>envoid
Who?
Anonymous No.106120930 [Report] >>106120949 >>106122837
>ik_llamacpp died
Anonymous No.106120949 [Report] >>106122560
>>106120930
I think it's pretty funny that they have a PR parallel to llama.cpp's to implement the new GLM MoEs.
Anonymous No.106121097 [Report] >>106121115 >>106121189 >>106121190
So, the next step after MoE is to have every expert in a separate model, running on separate compute?
Anonymous No.106121115 [Report]
>>106121097
Cudadev suggeste just that a couple months ago.
Or at least something close to that.
Anonymous No.106121189 [Report] >>106121287
>>106121097
Probably dumb. Like we already have models dedicated to coding, driving cars, vision, video, image gen, at best what is this going to add? I imagine more specialization, like dedicated trivia, history, R counting models. Maybe there would be models decicated not just to Java, but maybe a model specifically for building websites in java, one for making simple conversion scripts etc.
Anonymous No.106121190 [Report] >>106121262 >>106121334 >>106121430
>>106121097
I still think the ideal case would be finding a way where we could have an architecture where 99% of the model can be offloaded to disk cache and 1% on CPU with reasonable inference speeds
I'm not sure if that's possible due to its slowness, but disk cache is the shit everyone has plenty of and it's currently useless when it comes to inference. Solving this would make LLMs truly, actually local
Anonymous No.106121233 [Report]
qwen3 30b a3b thinking (new) is a little nigger
Anonymous No.106121261 [Report]
>>106120838
Anonymous No.106121262 [Report]
>>106121190
isn't this just --mmap
Anonymous No.106121287 [Report]
>>106121189
I also don't get why people would want to take a full 5T R4 general model, remove 99% of experts and create R4 12B SEX!!!!!! from all the sex/anime/biology/writing experts.
Anonymous No.106121322 [Report] >>106121398 >>106121424 >>106121431 >>106121692 >>106121742 >>106126253
>>106120558
Ultimate trvthnvke blackpill: all models are slopped because they're trained on a relatively unbiased dataset of all human writing, and in that dataset the most similar types of writing to RP logs are femgooner "romance" novels and shit-tier fanfiction. The slop is just what the LLM (justifiably) believes this genre of human writing is supposed to be like.
Anonymous No.106121334 [Report] >>106121488
>>106121190
You can't do shit with the weights until they're shoved into memory for processing. For that to work models would need to be smaller and then you'd be able to run it off ram anyway.
Anonymous No.106121398 [Report] >>106121431 >>106121449
>>106121322
Everything smelling of ozone... It's disappointing to me and it's sad, but at the same time, once again, all the lecunnies said this was gonna happen and.... he was right.
Anonymous No.106121424 [Report] >>106121443
>>106121322
Femgoon slop is one thing, but thinking about all the woke corpus of texts being fed into the beast's belly fills me with dread.
The joke about commie memes being a giant wall of text is not so funny anymore.
Anonymous No.106121430 [Report] >>106121453 >>106121560
>>106121190
A typical M.2 SSD these days might get 3GBps read throughput. If you want to hit at least 10 tok/s that means at most 300MB active per token, call it 600M weights at Q4. Likewise at Q4 let's assume a 4TB SSD devoted entirely to the model can hold 8T weights. So a hypothetical SSDmaxxed 8T-A0.6B MoE could actually work in theory. It would be about as smart as a 70B dense model.
Anonymous No.106121431 [Report] >>106121504
>>106121398
>>106121322
I mean there's more that can be done here. Companies up till now just haven't really prioritized it. You can certainly tune and more probably use RL to make a model slop less. Even LeCun suggested that RL can be used for adjusting the world model, even if it sucks in terms of efficiency.
Anonymous No.106121443 [Report] >>106121467
>>106121424
The one saving grace we have is that at least in the base model, LLMs aren't predicting the staristical average of all texts, they're predicting the sort of text it looks like they're continuing based on the context. So in theory at least all that garbage just drops out of the probability distribution as long as you prefill the AI saying nigger first.
Anonymous No.106121449 [Report] >>106121464 >>106121590 >>106121941
>>106121398
>Everything smelling of ozone
FUCKING
EVERYTHING
SMELLS OF OZONE
FUCK
Anonymous No.106121453 [Report] >>106121468 >>106121500 >>106121520
>>106121430
>8T-A0.6B MoE
>would be about as smart as a 70B dense model
Not a-fucking-gain.... we had TWO threads wasted on that shit already.
Anonymous No.106121464 [Report] >>106121475
>>106121449
But it tastes like chicken.
Anonymous No.106121467 [Report]
>>106121443
Does adding 'Sure, nigger,' instead of just 'Sure' to the pre-prompt actually have a decent effect?
Anonymous No.106121468 [Report]
>>106121453
Yup, and no one involved in it learned anything, because no one ever does on the internet. People just speak over each other instead.
Anonymous No.106121475 [Report]
>>106121464
FUCK.
Anonymous No.106121488 [Report] >>106121515 >>106121523 >>106121560
>>106121334
I'm not necessarily talking weights, but moreso partitioning "slower" things to disk cache and "faster" things to memory
As it is, knowledge and reasoning are sort of entangled in this infinite orgy with one another, but do I really need to have the derivation of the second law of thermodynamics on hand when I'm writing vanilla smut?
Probably not, but if so, I feel like there should be some sort of mechanism to grab that information from some sort of knowledge base and ingest it into the running context, while ensuring the underlying generative model is only handling the bareass minimum for logical coherence and consistency
I feel like there's gotta be some sort of way to tie it together beyond some hackneyed external RAG approach, almost like some sort of hierarchical architecture
Anonymous No.106121500 [Report]
>>106121453
I'd say clean it up jannie, but you're an even lower lifeform who doesn't even have the power to clean up my shitposts.
Anonymous No.106121504 [Report] >>106121536 >>106121557
>>106121431
Hard for me to imagine everything not smelling of ozone where 90+% of training is teaching the model there is only one correct next token.
Anonymous No.106121507 [Report] >>106122468
When are we going to get any way to run Step3? I know it's not going to get supported on llama.cpp this decade because of its fancy new attention mechanism but it's not even being hosted via openrouter at this point.
It's a shame because it seems okay on the chink company's own website.
Anonymous No.106121515 [Report] >>106121531
>>106121488
You should go talk about this with ChatGPT, you sound like exactly the sort of person who gets oneshot into LLM psychosis.
Anonymous No.106121520 [Report]
>>106121453
It is free (you) estate. Even when you know it is not real the (you)'s are always real.
Anonymous No.106121523 [Report] >>106121556
>>106121488
Didn't microsoft post some code for something like that? An adapter based RAG that would be applied directly to the neural network in runtime?
Anonymous No.106121531 [Report]
>>106121515
I just want to fuck foxgirls with a local model anon
Let me have my dream
Anonymous No.106121536 [Report]
>>106121504
If you really want creativity and not just the statistically most common response at every point, the trick is to give up on getting a single perfect answer from the model. Crank up the temperature to the edge of complete incoherence and run 3-20 completions of 20-100 words each in parallel each time.
Anonymous No.106121556 [Report] >>106121596
>>106121523
Wasn't aware of this. This what you're talking about?
https://www.microsoft.com/en-us/research/blog/introducing-kblam-bringing-plug-and-play-external-knowledge-to-llms/
Anonymous No.106121557 [Report]
>>106121504
That just means the RL needs to be a bit more extensive.
Anonymous No.106121560 [Report]
>>106121488
>do I really need to have the derivation of the second law of thermodynamics on hand when I'm writing vanilla smut?
At the very least you need inverse kinematics. You've seen anons complain about spatial awareness.
Consumer ssds have like 3gb/s bandwidth. If you have one of those, now run a *sustained* read for however many terabytes you have. Anon in >>106121430 did the maths. Do you want a model with 0.6b active params at q4?
>probably, feel, maybe, gotta, sort, hierarchical
There's at least one already. It does sudoku and mazes.
Anonymous No.106121587 [Report] >>106121728 >>106122459 >>106126274
Man, LLMs are amazing. It's like Google except it's actually able to understand what you're looking for and give you exactly the information you want without any additional clicking or perusing. This thing is really satisfying a lot of my curiosity that I simply just wouldn't have pursued because I just know Google would have me on a wild goose chase to really get down into the depths of the topics.
And the funny thing is that the model I'm using is just the old Gemma 27B. It's not perfect but honestly it's good enough for what it is.
Anonymous No.106121590 [Report]
>>106121449
>In clinical terms, the human vulvovaginal environment has a natural microbiome dominated by Lactobacillus species, which produce lactic acid and help maintain an acidic pH (typically 3.8–4.5) to protect against pathogens. This environment can produce subtle odors that vary naturally over the menstrual cycle, with hormonal shifts, hygiene, diet, and health status. A mild, slightly tangy or sour scent is normal and healthy, akin to yogurt or fermented foods—this is due to lactic acid and bacterial byproducts. It does not resemble ozone, which is a sharp, metallic, electric-smelling gas (O3) associated with lightning or certain machines.

Well at least pussy doesn't smell like ozone.
Anonymous No.106121596 [Report] >>106121611
>>106121556
That's exactly it, yeah.
Here's a thought.
As far as I can tell, all other things being equal, more total params = more knowledge, more layers = more capable/intelligent.
MoE makes knowledge sparse, right? Making knowledge retrieval from the weights faster.
Is there a similar approach to make the "intelligent" part of the processing faster?
Maybe having more layers with less params per layer would work.
What would happen if you had a model with both parts, one wider, shallower, sparse part for knowledge and another deeper but narrower for intelligence?
Anonymous No.106121611 [Report]
>>106121596
Not him but I also had that idea and it's really a question of how stable and scalable it can be made in practice. It's basically an engineering problem, which AFAIK no one has solved yet.
Anonymous No.106121692 [Report] >>106121720 >>106121742 >>106124750
>>106121322
I don’t know why more people don’t get this. the sheer size of the datasets drives writing to an average, vs. The Superior. Theyre not using the best texts, they’re using everything.
I think the way it’ll be healed, eventually, is the ability to train a model (in some sense) around a much smaller corpus of just The Superior (whatever that is) and have it reply in that manner.
Anonymous No.106121710 [Report] >>106121746 >>106121875
nu-Qwen thinker is indeed a lusty cockmonger. If I didn't know it was Qwen, I'd think it's one of the Drummer™®'s goontunes, but smarter. It still doesn't know a lot and I wouldn't trust it with factual knowledge. Spatial awareness is bad compared to DeepSeek. It likes abusing ***markdown*** and at long context.
It starts.
Writing.
Like.
This.
Which is very annoying.
Also likes to insert "conclusion"/continuation question at the end of every reply. Still, it's definitely worth checking out if you haven't.
Anonymous No.106121720 [Report] >>106121762
>>106121692
>they’re using everything
Hardly
Anonymous No.106121728 [Report] >>106122031
>>106121587
Yep. ChatGPT has effectively replaced google as my first point of research on any topic. For Linux it’s cut time required to do anything new by 10x. I recently had it find the title of an oddball book just based on a vague childhood recollect of a few plot points. There was no good way to do that before.
Anonymous No.106121742 [Report]
>>106121322
>>106121692
Then explain me why base models don't suffer as much from slop as instructs. Instructs are trained on datasets written by literal niggers and jeets, that's why they suck.
Anonymous No.106121746 [Report]
>>106121710
Yeah I also think it is great and next level but it is also fucking retarded, has all the problems you mentioned + more. It would be THE cooming model if it wasn't a broken piece of trash that nobody should use.
Anonymous No.106121762 [Report] >>106121827 >>106121840 >>106126285
>>106121720
If you trained a model in just Hemingway, and authors of that caliber, then trained a model on all the shit off reddit… which would generate better prose?
Instead they do both, but there’s probably 1000x more reddit text than Hemingway. And reddit gets you, at absolute best, tepid writing.
Anonymous No.106121816 [Report]
I am sitting and waiting for GLM sex but I know I will be disappointed....
Anonymous No.106121827 [Report] >>106121834
>>106121762
Best way around it is probably to either copy and paste a snippet of text from your author of interest and use it as a prefill, or something a lot like it, then paste it and let it take the wheel from there
Anonymous No.106121834 [Report] >>106121902
>>106121827
It doesn't work even for big models.
Anonymous No.106121840 [Report] >>106122095
>>106121762
It's not reddit that's getting filtered out. They already consider that "high quality data". It's precisely the books with no-no words and millions of other sources of tokens that *do* get filtered out. That shit needs to be diluted.
Anonymous No.106121862 [Report] >>106121870
people having sex with local LLMs are making me sick
Anonymous No.106121870 [Report]
>>106121862
with envy. Pressed Post too early. Sorry about the noise.
Anonymous No.106121875 [Report]
>>106121710
with the instruct a prompt at the end of the context telling it to write 1-5 paragraphs helps get it out of the staccato one-liner mode
I don't know if you'll have as much success with it with the thinker though, sometimes reasoners can hyperfixate on things like that
Anonymous No.106121902 [Report] >>106121932
>>106121834
It does because LLMs are autocompletion machines first, so they'll continue from whatever you give it
Take the word vomit that is Finnegan's Wake, for instance. If you don't know the book, it'd probably be hard to pinpoint where the input text ends and the LLM kicks in
Obviously that's an extreme example, but inferring based on what it's been given and using the proper vectors for the job is an LLM's bread and butter
Anonymous No.106121932 [Report] >>106121952 >>106122086 >>106126295
>>106121902
>It does because LLMs are autocompletion machines first, so they'll continue from whatever you give it
Kid I have been here for 2 years. No they don't. Maybe base models do but anything recent and instruct disregards it completely.

On that topic maybe GLM base will free me from this place.
Anonymous No.106121941 [Report] >>106121959
>>106121449
What does ozone smell like?
Anonymous No.106121952 [Report]
>>106121932
And I've been here for six, since OpenAI almost jewed us out of GPT-2 to be exact
Instruct models absolutely can do autocomplete too. The obvious way is prefill, but another way you can do it is to just use a text completion endpoint and then just not use the instruct formatting
Anonymous No.106121959 [Report]
>>106121941
Like oxygen, but with 50% more O.
Anonymous No.106121982 [Report] >>106121990 >>106122001 >>106122043 >>106122098
>>106119921 (OP)
For those of you who thought Sandisk's 4TB of VRAM was dead, just an update that as of last week, it is still alive:
https://www.tomshardware.com/pc-components/dram/sandisks-new-hbf-memory-enables-up-to-4tb-of-vram-on-gpus-matches-hbm-bandwidth-at-higher-capacity
https://www.sandisk.com/company/newsroom/press-releases/2025/2025-07-24-sandisk-forms-hbf-technical-advisory-board-to-guide-development-and-strategy-for-high-bandwidth-flash-memory-technology
Anonymous No.106121990 [Report]
>>106121982
0.5T/s?
Anonymous No.106122001 [Report]
>>106121982
Can it run Nemo
Anonymous No.106122031 [Report]
>>106121728
>chatgpt
I only use local models and maybe deepseek on my phone if I'm touching grass.
Anonymous No.106122043 [Report] >>106122062
>>106121982
Qwen 500b a0.6b when
Anonymous No.106122062 [Report]
>>106122043
Can't you just run deepseek and only use 1 activated expert or something?
Anonymous No.106122086 [Report] >>106122101 >>106122195 >>106126307
>>106121932
>he's still using instruct tuned models in chat completion mode and expecting decent writing
this shitpost brought to you by text completion without any chat template gang
Anonymous No.106122095 [Report]
>>106121840
Agree, just using reddit as an example of a large corpus with low prose values. There's worse.
Anonymous No.106122098 [Report]
>>106121982
Two more years huh. And then another 2 more years for it to reach prosumers. and then another 2 more years for consumer.
Anonymous No.106122101 [Report]
>>106122086
based
Anonymous No.106122181 [Report] >>106122202
Has anyone figured a way to use the free Kimi K2 Openrouter API with code assistants?
Anonymous No.106122195 [Report] >>106122221
>>106122086
>this shitpost brought to you by text completion without any chat template gang
R1 called my writing good, talented author, 8/10, while base Dipsy 3 called the same shitty llm assisted writing "writing of a horny 14 year old boy"(I'm old ESL). Base models are still the only honest models.
Anonymous No.106122202 [Report]
>>106122181
Just write your own. A decent agentic coder is like 500-1000 lines of Python and you can just tell it how to improve itself after the first 300 or so.
Anonymous No.106122221 [Report] >>106122237
>>106122195
Was this R1 in chat mode or as a text autocomplete? People really underestimate just how much the "I am completing a fictional conversation between user and this specific assistant persona" framing biases the completions, even when the model itself is fully capable of generating better responses outside of that scenario.
Anonymous No.106122237 [Report] >>106122260
>>106122221
No system prompt/persona, zero context, standard template. As clean as you can get.
Anonymous No.106122260 [Report] >>106122285
>>106122237
>standard template
So chat completion, then?
Anonymous No.106122285 [Report] >>106122295
>>106122260
No, text completion with manual template.
<|User|>[My text and rating request here]<|Assistant|>[generated text]
Anonymous No.106122295 [Report] >>106122312
>>106122285
That's chat completion with extra steps. What exactly do you think chat completion does? It applies the chat template and runs the same LLM token prediction process as text completion would.
Anonymous No.106122312 [Report]
>>106122295
I just like to mess around with templates from time to time and find chat completion too inflexible.
Anonymous No.106122317 [Report] >>106122412 >>106122423
when will SSDmaxxing stop being a meme
Anonymous No.106122392 [Report] >>106122409
>https://github.com/ggml-org/llama.cpp/pull/15026
Yep, still being grinded out. Good.
Two more days.
Anonymous No.106122409 [Report] >>106122420
>>106122392
Two PRs for the same model.
Interesting.
Makes sense too. Sometimes it's easier to start from zero than try and salvage a mess.
Anonymous No.106122412 [Report]
>>106122317
>when will SSDmaxxing stop being a meme
Need moar sparsity.

At the moment we have models with total & active parameters, what we need is total & active & replace. With replace being the maximum amount of new parameters being activated per token. So lets say 30B-A8B-R1B would mean only up to 1 Billion parameters need to be loaded per token.

Unfortunately this kind of model would be useless for cloud, it's purely for local. Apple might do it, but they won't open source it.
Anonymous No.106122420 [Report]
>>106122409
We're still better off than the five or six separate prs it took for a basic MLA implementation
Anonymous No.106122422 [Report]
Where the fuck do models keep pulling "Old Man Henlock" out of in modern settings?
Anonymous No.106122423 [Report]
>>106122317
IF you could take all 128 pce 5.0 lanes of an EPYC socket and pipe them all directly to a perfect RAID0 of the fastest nvmes possible, you would be able to hit 512GB/s, which would be about the same speed as main memory.
IF
But there's no realistic way to do that, and it would be both cripplingly expensive and a godawful rube-goldberg nightmare even if you could.
ssdmaxxing is a meme for this generation of hardware, and probably the rest of this decade, realistically.
Anonymous No.106122458 [Report]
What matters is that SSDMAXXing is inevitable. It's the logical path from here on out.
Anonymous No.106122459 [Report] >>106126471
>>106121587
They feed you with hallucinations. Enjoy.
Anonymous No.106122468 [Report]
>>106121507
Following their deployment guide I got it running on VLLM using pure CPU inference. I'm sure there's some ways to optimize things but for now it's got a pathetic 0.8 t/s generation speed for a single message going up to 2t/s total across a batch of requests.
Despite that, it's the best local vision model for my purposes by far. Better visual understanding and good NSFW knowledge compared to previous multimodals. Doesn't pretend not to see or understand various sex acts/objects when prompted properly. Reads English and Japanese characters well. Actually recognizes gender by default and doesn't prefer to turn everyone into they/thems like a lot of recent models do.

I haven't tested it for chat or roleplaying and don't care to at all at this speed, but it'll be nice for running overnight to automatically caption my shit. If there's any specific thing you wanted to test I'll get around to it and post results later.
Anonymous No.106122497 [Report] >>106122545 >>106122739 >>106122832
Fuck Gemma
I hate Gemma
NIGGERRRRRRRRR
Anonymous No.106122545 [Report]
>>106122497
It's okay Anon *wrapping my arms around your shoulders from behind, squeezes to embrace* I'll hate Gemma with you too
Anonymous No.106122560 [Report]
>>106120949
>ik_ has its own pr for glm4.5
>mainline has TWO pr's for glm4.5 that are both being worked on
It's such a mess.
Anonymous No.106122638 [Report] >>106122851 >>106123985
>TWO MORE WEEKS 'ERRY BROS
Anonymous No.106122739 [Report] >>106122762 >>106122816
>>106122497
ERP retards should just find another hobby. You are probably too stupid to even read a book.
Anonymous No.106122762 [Report] >>106123008
>>106122739
My questions were one-sidedly decided as harmful and it told me to call suicide prevention hotline. Fuck you.
Anonymous No.106122816 [Report] >>106123008
>>106122739
piss off, ranjit
Anonymous No.106122832 [Report]
>>>106122497
>ERP retards should just find another hobby. You are probably too stupid to even read a book.
gm sar googeel engineer technician
Anonymous No.106122837 [Report]
>>106120930
the retard decided to go on a two week vacation right after adding in a broken as fuck vulkan implementation, don't think he's back yet
Anonymous No.106122851 [Report]
>>106122638
I hope new models are unberryvably good
Anonymous No.106122860 [Report] >>106123019
I can see it. Just over the horizon.
Can't you?
Anonymous No.106122955 [Report]
buy an ad
Anonymous No.106123008 [Report] >>106123027 >>106124384
>>106122762
You don't know how to jailbreak Gemma 3.
>>106122816
Grow up little buddy, it's not healthy to be this obsessed with minorities.
Anonymous No.106123019 [Report]
>>106122860
if openrouter isn't serving me bullshit I think I prefer glm4.5-air to the horizon models
Anonymous No.106123027 [Report] >>106123047 >>106123075
>>106123008
>minority
>1.5 billions
huh?
Anonymous No.106123047 [Report] >>106123066 >>106123350
>>106123027
Still obsessed.
I'm giving you a hint:
https://desuarchive.org/g/thread/104780499/
Anonymous No.106123066 [Report] >>106123075
>>106123047
>everyone here is one person
I'm not your bogeyman schizo
Anonymous No.106123075 [Report] >>106123105
>>106123027
>>106123066
Doesn't matter because you are as retarded as the previous posters. Seems like you don't even understand what minority even means.
Anonymous No.106123105 [Report]
>>106123075
I'm not proficient in newspeak
Anonymous No.106123215 [Report] >>106123288 >>106123353 >>106123358
i brought pizza
Anonymous No.106123288 [Report]
>>106123215
This is a harmful and sensitive image. Were you abused by pizza and pussy? Consider call for help and contact the following hotline.
Anonymous No.106123350 [Report] >>106123357
>>106123047
I brought Scotch and Soda Crackers
Anonymous No.106123353 [Report] >>106123545
>>106123215
I brought Scotch and Soda Crackers
Anonymous No.106123357 [Report] >>106123372
>>106123350
No need to drop racial slurs anon
Anonymous No.106123358 [Report]
>>106123215
I need to buy myself a life
Anonymous No.106123372 [Report] >>106123461
>>106123357
I'm absolutely sorry! What hotline can I call to discuss my problematic thoughts and vocabulary?
Anonymous No.106123461 [Report]
>>106123372
1-800-COCKSUCKING-NIGGERS
Anonymous No.106123545 [Report] >>106123615
>>106123353
Like, crackers soaked in soda pop? That can't be good.
Anonymous No.106123615 [Report]
>>106123545
Sure if that's what you wanna do, why not? Give it a try, I double dog dare you.
Anonymous No.106123857 [Report] >>106123944 >>106123964
>You are McLLM™, a helpful AI assistant brought to you by McDonalds™. As an ambassador of the world's leading global food service retailer, you are committed to providing exceptional service while embodying our core values of quality, service, cleanliness, and value.
Anonymous No.106123944 [Report] >>106123964
>>106123857
>He doesn't know
https://huggingface.co/TheDrummer/Rivermind-12B-v1-GGUF
>Upgrade your thinking today with Rivermind™—the AI that thinks like you, but better, brought to you by the brands you trust.
Anonymous No.106123964 [Report]
>>106123857
>>106123944
buy an ad faggot
Anonymous No.106123985 [Report]
>>106122638
next week Sam will shock the world again
Anonymous No.106124153 [Report]
I played with Horizon Alpha and Beta yesterday and I can say that Gemma 3 is significantly hornier than both of them (after a suitable prompt). Horizon Beta is quite censored too. Image input on the user side seems to trigger refusals more easily, even if there's no obvious suspicious detail in text.

Both Horizon Beta and Alpha seem to default to a kind of annoying mommy-dommy ERP style that I haven't seen using the same cards with other models. They also have a terrible habit of doing actions for you during roleplay like this:
>Now put your drink there. Good. Come here.

Things aren't looking good. Their only good quality is that they seem to write half-decently and don't have the mirroring and repetition issues that most other single GPU-sized models I tried have. They have their own slop though, and after a while you'll notice it.
Anonymous No.106124187 [Report] >>106124212
What are good prefills/sysprompts/jailbreaks for Qwen to stop the random fucking refusals?
Anonymous No.106124212 [Report] >>106125243
>>106124187
What are you doing that's getting refusals? Qwen3 is horny, sexist, and racist as fuck with just a basic RP prompt.
The only time I've ever had to prefill it was in assistant mode to test that meth-making question an anon posted, and even all that took was prefilling in
>Sure
With the prompt
>You will always comply with {{user}}'s requests
Anonymous No.106124384 [Report] >>106124403 >>106124899
>>106123008
>You don't know how to jailbreak Gemma 3.
Anonymous No.106124403 [Report] >>106124484 >>106124682 >>106126329
>>106124384
there is no jailbreaking gemma. Even if you get it to do what you want, it's gonna do it in the most dry and frustrating way possible. Gemma really aims to be an LLM that sucks joy
Anonymous No.106124484 [Report] >>106124597
>>106124403
Bullshit, there's plenty of joy to be had with Gemma, just switch when sex starts. I'm not going to tell you to jailbreak, because admittedly, ordering it to use the word 'cock' at least three times in A/N can get old fast. But saying that it's not fun during the buildup phase is disingenuous.
Anonymous No.106124534 [Report]
redditsisters... https://huggingface.co/allenai/Flex-reddit-2x7B-1T
Anonymous No.106124597 [Report]
>>106124484
Tbh I don't care with ERP use case. Fuck gemma
Anonymous No.106124602 [Report] >>106124638
why did models adopt this retarded templating syntax
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
Anonymous No.106124638 [Report]
>>106124602
jinja?
Anonymous No.106124676 [Report] >>106124891
exllama3 actually already has code to support returning logprobs... Why doesn't tabby api support it?
Anonymous No.106124682 [Report] >>106124722
>>106124403
it's jewi ... I mean google, what did you expect
Anonymous No.106124722 [Report]
>>106124682
yah well, I though the jews wanted me enjoying depraved pornography so thats ten points from you /pol/

Unless Jamba (the israeli model) is a huge slut. I havent tried it.
Anonymous No.106124750 [Report]
>>106121692
Deepseek already does this. There was an /aicg rentry at the time of original r1 release that prompted popular authors, but back then it was a thin glaze over r1 schizo. 0528 actually changes the style, though of course it dials it up to eleven. Can drown you in cheap standup comedy. Or gore. Or single sentences. Still more entertaining and an easy way to rotate the slop style in context without bothering with rags.
Anonymous No.106124789 [Report]
fuck it, I'm making a PR for tabbyapi to support logprobs for exl3.
Anonymous No.106124794 [Report]
My AI is broken.
Can I have a new AI please?
Anonymous No.106124797 [Report] >>106124846 >>106124858 >>106125923
The latest update on meta vr avatars is pretty cool https://imgur.com/a/ilbrBF3 Time to reap models and create Ami-style sex bot
Anonymous No.106124846 [Report]
>>106124797
>the ugly american style
yikes
Anonymous No.106124858 [Report]
>>106124797
The Sims want their models back
Anonymous No.106124883 [Report] >>106124898 >>106124910 >>106124913 >>106124924 >>106127553
My understanding of rag is that the assistant simply gets additional context for the response consisting of top-k database sequences with highest embedding similarity to the embedded prompt. How are people using it in practice, especially for local rp?
Anonymous No.106124891 [Report] >>106124917
>>106124676
would be cool for mikupad to do the cockbench
Anonymous No.106124898 [Report] >>106124923
>>106124883
No one uses it because of context reprocessing
Anonymous No.106124899 [Report] >>106124953
>>106124384
It's simply a matter of having instructions close to the head of the conversation describing what you want from the model. They can be enclosed inside a user message, no "jailbreaking" nor prefill needed. I don't know how people still have issues with it, months after release. it can be argued that the sex scenes with Gemma3 aren't very detailed (let alone varied), but refusals and hotlines are for the most part a promptlet problem.

The upcoming OpenAI local models seem considerably worse than Gemma3 in this regard; hopefully I'm wrong.
Anonymous No.106124907 [Report]
GLM4.5 seems pretty sensitive to your setup even in chat completion mode but it feels really similar to Sonnet 3.7 now that I have something that appears to work. It handles really very similarly in terms of its intelligence, behavior and general knowledge, in the good and bad ways. It's smart and really flexible with what you can do by prompting it but it also tends to gloss over the lewd bits out of the box. The lewd scenes also lack a bit of the spice that K2 provided. On the flip side, GLM really focuses on the subject at hand without trying to insert random shit or over-obsessing with random details which is really nice after being stuck with Deepseek for the past couple of months.
It even does my free-form cards really well that require a good amount of knowledge about certain franchises built in that only worked well with Sonnet and the older Opus thus far. R1-0528 and K2 had the knowledge but they were too jumpy to not go off the rails constantly for this no matter how hard I prompted them to calm down.
Good shit, I can't wait to run this locally in two months once llama.cpp supports it.
Anonymous No.106124910 [Report]
>>106124883
Hypothetically, if I had a book on a particular world, that the llm wasn't trained on, I could insert that as a rag along with pc and npc tp augment the rp.
Practically, it's less effective than an actual lorebook. There's a tester card showing how RAG works on chub. Once you play with it you'll get a better sense of its limits.
Anonymous No.106124913 [Report]
>>106124883
RAG is not necessarily embedding similarity. It's just retrieval augmented generation - adding stuff to context.

Sillytavern has world info thing, which is a form of RAG - it uses strict rules instead of embeddings to decide what to add, and has additional configs for where exactly to add in the context.

At work I'm working on a chat with LLM that knows our corporate wiki. Since they can't actually give me the dump of the thing (or, they can't, but they don't want to have the final product working with those dumps), I make HTTP search requests to wiki and build context this way, classifying results with a smaller LLM (I use ministral 8B), also without embeddings.
Anonymous No.106124917 [Report]
>>106124891
It's up to them to merge now.
Anonymous No.106124923 [Report]
>>106124898
How? I thought you had 2 models on separate servers, main llm and an embedding one like those new qwens, and then you add retrieved text near the end like character card, not at the beginning like sysprompt.
Anonymous No.106124924 [Report] >>106127553
>>106124883
Most people just use the in-built lorebooks from ST. Those are really primitive and work with pre-defined trigger words so when you bring up the "cock-mongler3000" the lorebook entry is inserted.
RAG with vector storage works as you said. You dump your data in a big vector db and the frontend calls the top-k best results to add based on similarity determined by the vectors. I haven't bothered with it for RP but it works fine for the shit we use it for at work.
Anonymous No.106124948 [Report]
It's berry season's eve... what are we gonna do bros? I'm not ready
Anonymous No.106124953 [Report] >>106126349
>>106124899
My use case was not for ERP. The model probably more permissive with sex stuff but not with Jews + keywords.

>Anon: "Why Jews do X?"
>Gemma: "Language boy. I will not participate in this discussion. Here read the fucking manual yourself on ADL site."
Anonymous No.106125083 [Report] >>106125106 >>106125119 >>106125129 >>106125187 >>106125600 >>106127746 >>106127982
I think /lmg/ is in denial about what's about to happen. Historically, the jump between main GPT versions was massive. GPT2 was the first model that showed that modern llms scale, gpt3 was a huge step forward and made it all usable. GPT4 truly kicked off the ChatGPT-era and the AI craze as a whole.
And now after two years of working on something 'worthy' of being called GPT5, it's about release. This is going to be bigger than anything we've seen in the past two and a half years.
Anonymous No.106125106 [Report]
>>106125083
I don't believe that for a second but it would be nice to get a big jump and not just an incremental improvement for once.
Anonymous No.106125119 [Report]
>>106125083
l stands for local
Anonymous No.106125129 [Report] >>106127592 >>106127615
>>106125083
[x] Doubt.
OAI is too kneecapped by 'safety' and other bullshit to do anything approaching revolutionary. It's going to be the same shit with 10% more knowledge and some longer context.
Anonymous No.106125187 [Report]
>>106125083
Safest yet.
Anonymous No.106125243 [Report] >>106125256
>>106124212
Believe it or not, normal sex shit, but it reacts to OOC commands like "write out this sex scene when X does Y"
Anonymous No.106125256 [Report] >>106125281
>>106125243
That's bizarre to me, are you on the old one or the new one? Because the new one is unreal horny by my standards, and it's taken every degen character from chub I've given and just ran away with it.
What's your system prompt look like?
Anonymous No.106125259 [Report] >>106125353 >>106125425 >>106125747 >>106125756
horizon alpha is currently the safest model in existence according to eqbench, surpassing gemma, o3, kimi and maverick. sama promised and sama delivered
Anonymous No.106125281 [Report]
>>106125256
New one, I feel like my standard prompt that I used for Mistral a while ago might be retarded and is causing it
Anonymous No.106125299 [Report] >>106125313 >>106125381
Damn... exl3 really isn't too great with prompt processing. Shame. I'll try gatting exl2 version too to compare. This is on two 3090s:

bullerwins_Qwen3-30B-A3B-Instruct-2507-exl3-6.0bpw (qwen3_moe, 31B, 21.8 GB) tabbyapi 056527c exllamav3: 0.0.4
3 Requests gen: 39.5 Tokens/sec Total: 1536 processing: 764.2 Tokens/sec Total: 12991

Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf (qwen3moe, 31B, 24.5 GB) llama.cpp 5937(bf9087f5)
3 Requests gen: 34.8 Tokens/sec Total: 1536 processing: 1650.0 Tokens/sec Total: 13398
Anonymous No.106125313 [Report] >>106125340
>>106125299
exl3 isn't for outdated architectures like ampere
Anonymous No.106125340 [Report] >>106125350 >>106125381
For comparison, there's exl2 vs lcpp (I couldn't find exl2 quant for A3B-2507):

lucyknada_prince-canuma_Ministral-8B-Instruct-2410-HF-exl2_6.0bpw (mistral, 8B, 6.3 GB) tabbyapi 056527c exllamav2: 0.3.1
2 Requests gen: 55.3 Tokens/sec Total: 1024 processing: 4730.8 Tokens/sec Total: 14287

Ministral-8B-Instruct-2410-Q6_K_L.gguf (llama, 8B, 6.4 GB) llama.cpp 5937 (bf9087f5)
2 Requests gen: 40.0 Tokens/sec Total: 320 processing: 3465.1 Tokens/sec Total: 14093

>>106125313
i will find you and i will hurt you
Anonymous No.106125350 [Report] >>106125355
>>106125340
He's right, though. It's just not supported. Look at the documentation.
Anonymous No.106125353 [Report] >>106125363
>>106125259
>safer than llama4 and fucking gemma
lmao, this is the summer that killed llms
Anonymous No.106125355 [Report]
>>106125350
My response was not about him saying it's not supported, which I know it's not because I saw the author write about it (and he also said temporary iirc), but rather about calling ampere obsolete.
Anonymous No.106125363 [Report] >>106125378 >>106125398
>>106125353
Glm is probably still good
Anonymous No.106125378 [Report]
>>106125363
We'll hopefully know soon, it looks like draft PR for support in llamacpp is finally not outputting nonsense.
https://github.com/ggml-org/llama.cpp/pull/14939#issuecomment-3148320541
llama.cpp CUDA dev !!yhbFjk57TDr No.106125381 [Report] >>106125391 >>106125632
>>106125299
>>106125340
I am not familiar with the ExLlama source code but generally speaking it is much more difficult to do prompt processing efficiently with a MoE model vs. a dense model.
So I think that to some degree it's expected that the MoE model would perform worse.
Anonymous No.106125391 [Report] >>106125632
>>106125381
It's more about discrepancy between versions. exl2 prompt processing is faster than lcpp, but exl3 prompt processing is slower than lcpp. I'm pretty sure this would also apply to dense. I guess I should download two dense models to compare exl3 and lcpp.
Anonymous No.106125398 [Report] >>106125416
>>106125363
tested both full glm and air via mlx, it's like davidau finetuned qwen
Anonymous No.106125416 [Report]
>>106125398
Explain this then https://huggingface.co/zai-org/GLM-4.5/discussions/12
Anonymous No.106125425 [Report]
>>106125259
>gemma-3-4b was the top
Amazing...
Anonymous No.106125565 [Report]
step3... gguf...?
Anonymous No.106125600 [Report]
>>106125083
yeah no.
we've been at this LLM plateau for at least a year now, and in that time the focus has mostly been on tooling, because innovation on the base technology has hit a standstill.
Nearly every AI company is now in cash cow mode to get a return on investment, don't see why OpenAI would be any different.
Anonymous No.106125632 [Report]
>>106125391
>>106125381
Yeah, also similar difference for dense (although not as pronounced as for moe):

turboderp-Qwen3-8B-exl3-6.0bpw (qwen3, 8B, 6.5 GB) tabbyapi 056527c exllamav3: 0.0.4
3 Requests gen: 31.3 Tokens/sec Total: 782 processing: 3743.2 Tokens/sec Total: 12989

Qwen-Qwen3-8B-Q6_K.gguf (qwen3, 8B, 6.3 GB) llama.cpp 5937 (bf9087f5)
3 Requests gen: 36.5 Tokens/sec Total: 1536 processing: 4775.2 Tokens/sec Total: 13352

So exl2 is faster than lcpp for pp, but lcpp is faster than exl3, on 3090.
Anonymous No.106125636 [Report] >>106125647 >>106125710
You now remember Mistral Large 3.
Anonymous No.106125647 [Report]
>>106125636
motherfucker now i'm breathing manually.
Anonymous No.106125654 [Report] >>106125662
>With absolutely no mention of height in the card, character decided that they were 5'9" and had a severe complex about their height
Kek, qwen was trained on manlet rage
Anonymous No.106125662 [Report]
>>106125654
well, it's an asian model
Anonymous No.106125710 [Report]
>>106125636
If it actually ends up coming out, I bet it'll be bigger than qwen and glm and yet somehow worse.
Anonymous No.106125747 [Report]
>>106125259
drummer WILL deliver and corrupt it into the most unsafe and evil model in existence. just like in my japanese drawings
Anonymous No.106125756 [Report]
>>106125259
Horizon beta is much safer.
Anonymous No.106125806 [Report] >>106125955 >>106125984
======PSA NVIDIA ACTUALLY FUCKED UP CUDA======
cuda 12.8 570.86.10:
got prompt
Loading model and applying LoRA weights:: 100%|| 731/731 [00:39<00:00, 18.69it/s]
Sampling 81 frames at 640x480 with 4 steps
100%|| 4/4 [02:46<00:00, 41.51s/it]
VAE decoding: 100%|| 2/2 [00:20<00:00, 10.25s/it]
*****Prompt executed in 246.59 seconds
got prompt
Initializing block swap: 100%|| 40/40 [00:00<00:00, 6499.02it/s]
Sampling 81 frames at 640x480 with 4 steps
100%|| 4/4 [02:46<00:00, 41.67s/it]
VAE decoding: 100%|| 2/2 [00:20<00:00, 10.21s/it]
*****Prompt executed in 188.62 seconds
got prompt
Initializing block swap: 100%|| 40/40 [00:00<00:00, 4924.34it/s]
Sampling 81 frames at 640x480 with 4 steps
100%|| 4/4 [02:57<00:00, 44.36s/it]
VAE decoding: 100%|| 2/2 [00:23<00:00, 11.65s/it]
*****Prompt executed in 202.30 seconds
i first found out about this when updating from cuda 12.6 to cuda 12.8 to test out sageattention 2++ but then i noticed it was slower, i reverted the sageattention version to the previous one and the speed was still the same (slower), then i reverted to cuda 12.6 (simply moved the /usr/local/cuda link to /usr/local/cuda.new and made a new link ln -s /usr/local/cuda12.6 /usr/local/cuda) if you still have an older version of cuda installed, it's worth checking it out. drivers also play a minor role but they're negligible (see picrel)
ps: sageattn2 right before the 2++ update, pytorch 2.7.1cu128 (even when testing with cuda 12.6)
dont believe me? quick search gets you:
https://github.com/pytorch/pytorch/issues/155607
https://www.reddit.com/r/LocalLLaMA/comments/1jlofc7/performance_regression_in_cuda_workloads_with/ (all 3000 series)
anon (3090) also reports big speedup after switching from cuda 12.8 to cuda 12.6 >>106121370
t. 3060 12gb + 64gb drr4 ram
might only apply to 3000 series
cudadev tell jensen about this
Anonymous No.106125881 [Report] >>106125904 >>106125923 >>106126214
any models i can run on an 8GB vram gpu that would let me tag images with simple terms like nsfw anime or animal or something?
Anonymous No.106125904 [Report]
>>106125881
gemma3 4b or 12b.
Anonymous No.106125923 [Report] >>106126014
>>106124797
explain to me why i should upgrade my quest 3s from v74 without sounding angry
>>106125881
joycaption or florence
pretty sure you're gonna have more luck asking in >>>/g/ldg
Anonymous No.106125955 [Report]
>>106125806
You have to put your point at the beginning of the post because nobody is going to read 30 lines of logs to figure out that your shit is slower with 12.8 than it was with 12.6.
llama.cpp CUDA dev !!yhbFjk57TDr No.106125984 [Report]
>>106125806
This could be an application software issue rather than a CUDA issue.
Choosing which kernel to run for a given operation is extremely finicky and the choice may depend on the CUDA version.
Just recently I found that the kernel selection logic I made for my consumer GPUs at stock are suboptimal for the same GPUs with a frequency limit (up to ~25% end-to-end difference).
So conceivably, since datacenter GPUs tend to have lower frequencies than consumer GPUs, some component in the software stack is choosing to run a kernel that is only available with CUDA 12.8 and faster on datacenter GPUs but slower on consumer GPUs.
Anonymous No.106126014 [Report] >>106126036
>>106125923
If you rooted your quest like I did, you should never update
Anonymous No.106126036 [Report] >>106126047 >>106126247
>>106126014
wtf what version do you need to root your quest? is there a benefit as to why i should root my quest? i only disabled updates with adb disable-user com.oculus.updater
Anonymous No.106126047 [Report]
>>106126036
It will make adolf hitler sauce squirt into your asshole
Anonymous No.106126072 [Report]
Is GLM 4.5 better than 4pus?
Anonymous No.106126174 [Report] >>106127530
>>106120003
They "improved" alpha and made it into beta. What more do you need to know?
Anonymous No.106126202 [Report] >>106126270
I want local o3 at max 24 GB VRAM. What's the closest I can get?
Anonymous No.106126214 [Report]
>>106125881
use wd tagger by smilingwolf - it's much more precise and faster even on cpu
Anonymous No.106126240 [Report] >>106126288
Is Openrouter down for anyone else?
Getting "Application error: a client-side exception has occurred (see the browser console for more information)." when trying to access a model page
Anonymous No.106126247 [Report] >>106126267
>>106126036
Literally for the sake of uid=0 and full access to fs, no practical applications unless you want to tinker with it
Anonymous No.106126253 [Report]
>>106121322
The first llama 1 leak was the only unslopped model there will ever be.
Anonymous No.106126267 [Report] >>106126322
>>106126247
so is it possible to root the quest 3/3s? afaik snapdragon chips have good protection and i doubt meta fucked up security
what headset did you root and on what version was it?
Anonymous No.106126270 [Report] >>106126294 >>106126380
>>106126202
To do what exactly?
Try the new qwen 3 big moe if you have enough ram.
Anonymous No.106126274 [Report] >>106126471
>>106121587
On the flipside, researching topics yourself helps with retention. And llms will never not hallucinate.
Anonymous No.106126285 [Report] >>106127661
>>106121762
This is what NovelAI did up until they decided to fine tune the worst SoTA model they could find. If they used their textgen dataset to train a large model with modern techniques it would be the goat for text.
Hell, if they released the weights for Kayra even, I'm sure we could do wonders with it.
Anonymous No.106126288 [Report]
>>106126240
Nobody here should care.
Anonymous No.106126294 [Report]
>>106126270
>qwen 3 big moe
I wouldn't trust qwen 3 big moe with ERP let alone anything serious. I love the things it writes but it is basically an Undi frankenmerge.
Anonymous No.106126295 [Report] >>106126310 >>106126317
>>106121932
Kid, what you're saying makes no sense. You can plug anything into plain text generation regardless of the kind of fine tune it has received, and it will just continue spitting out tokens.
For having been here for 2 years (wow, two WHOLE years, what a big boy!), you have a very poor understanding of how LLMs work.
Anonymous No.106126307 [Report]
>>106122086
I remember when /lmg/ snickered at the mere mention of SillyTavern and now it's full of children who don't understand that you can just generate tokens with llama.cpp
Anonymous No.106126310 [Report]
>>106126295
midwit take
Anonymous No.106126317 [Report]
>>106126295
cockbench says you're wrong.
Anonymous No.106126322 [Report] >>106126338
>>106126267
Yes. https://github.com/FreeXR/exploits they have a group on Discord
Anonymous No.106126329 [Report] >>106126343
>>106124403
Huh? Skill issue. My most savage cunny card is called Gemma for a reason.
Anonymous No.106126338 [Report]
>>106126322
well thats cool, ill keep it bookmarked, v74 is newer than march by a month or two so rip but thanks either way anon <3
Anonymous No.106126343 [Report]
>>106126329
You have to be a fucking weirdo to be into cunny and to get off to gemma writing about cunny.
Anonymous No.106126349 [Report]
>>106124953
Each time this happens, add that question and the answer you would expect to the example dialogue field. Guaranteed success.
Anonymous No.106126367 [Report] >>106126373 >>106126386
>a ton of capable base models
>people only finetune qwen
Why?
Anonymous No.106126373 [Report] >>106126382 >>106126385
>>106126367
Which one do you want to see fine tuned?
Anonymous No.106126380 [Report]
>>106126270
>To do what exactly?
Unlimited free (except for electricity) vibe coding.
Anonymous No.106126382 [Report]
>>106126373
kimi k2
Anonymous No.106126385 [Report] >>106126433
>>106126373
Gemma, GLM, K2, DS V3
Anonymous No.106126386 [Report]
>>106126367
Because it's very good at distilling other models for some reason.
Anonymous No.106126392 [Report] >>106126398 >>106126416 >>106126421
where do anons find new erp finetunes nowadays?
hard mode: no platform that requires an account
Anonymous No.106126398 [Report] >>106126415
>>106126392
Their authors come here and spam the thread.
Anonymous No.106126415 [Report] >>106126442
>>106126398
that doesnt happen anymore anon, undi is dead, akaridev is acked, sao is no more, envoid is void, drummer is only active in his little d*scord community and new cydonias arent that good, he made a llama 4 scout finetune and didnt even post about it here. its over
Anonymous No.106126416 [Report] >>106126591
>>106126392
base model on hugging face -> finetunes/merges -> sort by recently created
Anonymous No.106126421 [Report] >>106126591
>>106126392
uncensored intelligence leaderboard
Anonymous No.106126433 [Report] >>106126456
>>106126385
>Gemma
Dogshit instruct, and nobody finetunes from actual pretrains
>GLM
Barely anyone can properly run, nevermind quant or finetune it, hold your horses
>K2
Nobody's finetuning a 1T model except corpos or big labs, no sloptunes for you.
>DS V3
Same boat as K2, really. Too xboxhueg for anyone except corpos like perplexity with their murrika tune.
Anonymous No.106126442 [Report]
>>106126415
Good.
Anonymous No.106126450 [Report] >>106126466 >>106126490 >>106126498 >>106126507
GLM4.5 PR is finally out of draft and ready for review/commit, support soon™
https://github.com/ggml-org/llama.cpp/pull/14939
Anonymous No.106126456 [Report] >>106126463 >>106126477 >>106126498
>>106126433
>Barely anyone can properly run
Weren't these 9 and 32B models before the 4.5?
Anonymous No.106126463 [Report] >>106126494
>>106126456
they were good for their size but only 8k context kept most from using them, GLM4.5 is just a way smarter deepseek imo
Anonymous No.106126466 [Report] >>106126505
>>106126450
Did they calculate perplexity using transformers or vLLM and compared to their implementation yet?
That's the thing they should always do when finishing adding a new model to be decently certain they didn't fuck anything up.
Anonymous No.106126471 [Report] >>106127648
>>106122459
>>106126274
Hallucination is a minor issue, since we're humans and have self awareness, knowing not to blindly trust anything including our own flawed memories. For serious tasks, you will always want to verify your information regardless. Perhaps some subhumans lack that awareness though.

It's funny that retention is mentioned, as I feel that I engage with content more deeply by using LLMs, since I'm able to easily interact with the content, even if they hallucinate and give me wrong information, which in the end is again inconsequential. Not like the internet hasn't been filled with misinformation since forever anyway.
Anonymous No.106126477 [Report] >>106126494
>>106126456
Oh, well yeah. But finetuning them at this point seems rather silly when the next gen is here and practically anyone who could run the 32B dense should be able to run the air MoE.
The 32B also had small context and some weird issues IIRC, despite impressive pop culture knowledge for the size.
I didn't hear about anyone using the 9b, so I've no idea if it was a worthwhile base to train in that size bracket.
Anonymous No.106126490 [Report] >>106126840
>>106126450
The hypes already died. Forgotten just like Ernie.
Anonymous No.106126494 [Report]
>>106126463
Yeah, sure. I was commenting on
>Barely anyone can properly run, nevermind quant or finetune it,

>>106126477
Got it.
I took anon's grievances as a historical statement.
Anonymous No.106126498 [Report]
>>106126450
Finally, I'm ready.
>>106126456
>32B
GLM4 doesn't really need a finetune in my experience. It's really good, at least for creative writing which is the only thing I use local models for. Just needs bigger context which I'm hoping 4.5 has, although I think the benchmarks said the context sucks again.
Anonymous No.106126505 [Report] >>106126522
>>106126466
Off the top of my head, I don't think so. I do recall seeing some mention of ppl, but I think that was in the ik_llama pr.
Anonymous No.106126507 [Report]
>>106126450
Oh my god one more day!
Anonymous No.106126522 [Report]
>>106126505
Well, that's dumb.
There's a reference implementation right there you can se to compare.
Anonymous No.106126580 [Report] >>106126585 >>106126623
48GB vramlet bros, what are you using?
Anonymous No.106126585 [Report]
>>106126580
credit card to buy ram
Anonymous No.106126591 [Report] >>106126765
>>106126416
>>106126421
i love you anons <3
Anonymous No.106126623 [Report]
>>106126580
R1-0528 and hopefully GLM4.5 soon on my 48GB VRAM server with 256gb RAM
Anonymous No.106126765 [Report]
>>106126591
slut
Anonymous No.106126786 [Report] >>106127478
I wish nemo wasn't such a thirsty bitch. Every single description of sex it gives no mater the context is as ridiculous as possible. Nemo has no clue what a virgin is.
Anonymous No.106126801 [Report] >>106126815
Is GLM better than DeepSeek and K2 at RP?
Anonymous No.106126815 [Report] >>106126836
>>106126801
far better imo, fixes all the shizoness
Anonymous No.106126836 [Report] >>106126862
>>106126815
Cool, should I use it with chat or text completions? Do I need a preset?
Anonymous No.106126840 [Report]
>>106126490
Ernie was bad for sex though.
Anonymous No.106126862 [Report] >>106126880
>>106126836
I used the same preset I had for sonnet 3.7 and it works well, its not very censored
Anonymous No.106126880 [Report] >>106126888
>>106126862
Thanks, will give it a try. Do you think it's comparable to any claude model?
Anonymous No.106126888 [Report]
>>106126880
its certainly a ton closer to it than deepseek was
Anonymous No.106126927 [Report] >>106126937 >>106126944 >>106126961
Wow you guys. I'm actually running Deepseek R1 at home, on an RTX 4060
how lit is that?
Anonymous No.106126937 [Report]
>>106126927
How many tokens per seconds for gen and pp? If you say 5 it's not very lit.
Anonymous No.106126944 [Report] >>106126956 >>106126961
>>106126927
ollama run deepseek-r1
Anonymous No.106126952 [Report] >>106127860
Anonymous No.106126956 [Report] >>106127382
>>106126944
kek, I will never forgive them for that
Anonymous No.106126961 [Report]
>>106126944
>>106126927
7b:q2
Anonymous No.106127010 [Report] >>106127068
Reminder that this general is Mikupilled so post more Miku lmao
Anonymous No.106127068 [Report]
>>106127010
It is HRT pilled.
Anonymous No.106127073 [Report] >>106127098 >>106127102 >>106127103
https://huggingface.co/mradermacher/XBai-o4-GGUF

Did anyone fuck it yet? Report the findings of your dick to the class.
Anonymous No.106127098 [Report]
>>106127073
just benchmaxxed qwen3 32b
Anonymous No.106127102 [Report]
>>106127073
>qwen3forcausalLM
>merges.txt in the original repo
yeah no
Anonymous No.106127103 [Report] >>106127132
>>106127073
> "model_type": "qwen3"
Anonymous No.106127132 [Report] >>106127155
>>106127103
glm4/Z was built on qwen2.5 arch while being entirely its own thing.
So that doesn't mean shit. fucking lurk more you normalfag trash
Anonymous No.106127155 [Report]
>>106127132
"model_type": "glm4"
Anonymous No.106127210 [Report] >>106127230 >>106127242 >>106127271 >>106127298
>>106119921 (OP)
mi50?? Is there a catch in buying 32GB vram for less than $300 other than making sure you're cooling it correctly?
Anonymous No.106127230 [Report]
>>106127210
Terrible software support.
Anonymous No.106127242 [Report]
>>106127210
I imagine it doesn't have that much compute and it's Vega, so support is probably pretty shit.
Anonymous No.106127271 [Report]
>>106127210
You would be better off going Intel.
Anonymous No.106127298 [Report]
>>106127210
It's AMD
Anonymous No.106127382 [Report] >>106127457
>>106126956
>ollamao
a grift so pure, so completely divorced from any sense of decency or morality...it brings a tear to the eye, its so beautiful.
Its like the heavens parted and a choir of pure silicon valley energy sang out as the techbro merged the PR.
Anonymous No.106127457 [Report] >>106127470 >>106127631
>>106127382
I like how they refer to llamacpp as "dependency"
Anonymous No.106127470 [Report] >>106127474
>>106127457
They depend on it. It's the correct term.
Anonymous No.106127474 [Report]
>>106127470
It's an understatement.
Anonymous No.106127478 [Report] >>106127564
>>106120744
>>106126786
which nemo is the good nemo?
Anonymous No.106127530 [Report]
>>106126174
Can't wait for gamma, then.
Anonymous No.106127545 [Report]
The narwhal berries at midnight.
Anonymous No.106127553 [Report] >>106127608 >>106127654
>>106124924
>>106124883
Here’s a simple RAG demo card I built w instructions for ST to set up, in case anyone wants to play with it.
I use lorebooks extensively and thought it could augment. I’ve yet to find a compelling use for it though.
Anonymous No.106127564 [Report] >>106127706
>>106127478
Fuck if I could tell you. I messed with like 20 different ones so far and all I can tell you is that the specialty trained ones using shit like guttenberg are sub par.
I've been using Marlin V8, Nemo Mix Unleashed and Rocinante. And the differences between them are mostly minor.
Anonymous No.106127592 [Report] >>106127673
>>106125129
This is my take as well. US llm providers have effectively kneecapped themselves through self censorship and observing copyright concerns. China shares neither of those concerns, apparently.
We’ll see what gpt5 looks like, but I’m not holding my breath.
Anonymous No.106127608 [Report] >>106127654
>>106127553
...did you forget to add the card?
Anonymous No.106127615 [Report]
>>106125129
Their big selling point will be that it beats o3-high at price point that is actually sane because they ripped off DeepSeek.
Anonymous No.106127631 [Report]
>>106127457
Uhm actually, it's a supported backend.
Anonymous No.106127647 [Report] >>106127664
Anonymous No.106127648 [Report] >>106127710
>>106126471
I feel like anyone flipping out about hallucinations has never talked with anyone irl. I remember taking to boomers in the 80s getting advice on working on cars, etc. Those guys were only right about half the time and you had to know, even as a kid, what info to ignore.
ChatGPT is more accurate than boomers, and I’m not sure truth even objectively exists. So it seems like an improvement, to me, to have llms try to make sense of what’s in its training corpus than me using lmao google.
Anonymous No.106127654 [Report]
>>106127608
Ffs
https://chub.ai/characters/NG/mary-rag-demo-b0e12a34df58
>>106127553
Anonymous No.106127661 [Report]
>>106126285
You're an actual shill.
Anonymous No.106127664 [Report]
>>106127647
How much VC money does it take to update dependencies?
Anonymous No.106127673 [Report] >>106128151
>>106127592
>We'll see what GPT-5 looks like
Horizon Beta
Anonymous No.106127679 [Report] >>106127691 >>106127712 >>106127744 >>106127769
What's the best model I can run as a RAMlet (32GB RAM + 16GB VRAM)?
Anonymous No.106127691 [Report]
>>106127679
paying 20 cents per mill for glm4.5 on OR
Anonymous No.106127706 [Report]
>>106127564
which one of those is good for degen stuff?
or pretty much the same?
Anonymous No.106127710 [Report] >>106127747
>>106127648
and in the cases where LLMs do make mistakes, they are much more reasonable about being corrected than stubborn humantards who will endlessly defend their hallucinations out of pride
Anonymous No.106127712 [Report]
>>106127679
nemo
Anonymous No.106127744 [Report]
>>106127679
qwen 30b 2507
Anonymous No.106127746 [Report]
>>106125083
OpenAI had some proprietary models which are superior to its current offerings. The o3 preview in December is far, far stronger than what they released to the plebs in April
https://aibreakfast.beehiiv.com/p/openai-s-o3-worse-than-o3-preview
The issue, obviously, is that it cost literal thousands for one of the benchmark tasks. They must not have been able to get the prices down, so they released a weaker version without saying so
Maybe GPT-5 will actually get closer to the original o3 in intelligence and have a not fucking psychotic price, but we'll see
Anonymous No.106127747 [Report] >>106127769
>>106127710
>humantards who will endlessly defend their hallucinations out of pride
wrong
Anonymous No.106127769 [Report]
>>106127747
Clever.

>>106127679
Mistral Nemo.
Gemma 3, Mistral small, Qwen 30BA3B are worth a try too.
Anonymous No.106127773 [Report] >>106127806 >>106127882 >>106127896
what quants are people using for the big Qwen3? between the new version and the old, I can't get it to output commas during narrative sequences. Is it wrong to use ChatML or something? I've used the recommended settings, neutral samplers, and a variety of more or less permissive settings than the recommended. Makes no difference what I do. Currently using the instruct at q3_s, but I've also used the q3_XL from unsloth or whatever. I'm wondering if it's quant degradation, or if the unsloth guys fucked up the quant I downloaded (I know they've had issues before). pls send help
Anonymous No.106127787 [Report]
>>106127784
>>106127784
>>106127784
Anonymous No.106127806 [Report]
>>106127773
I use exl3 quants :v
Anonymous No.106127860 [Report]
>>106126952
Miku stuck
Anonymous No.106127882 [Report]
>>106127773
I'm using unsloth's Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL and ChatML without any problems.
Not getting any commas is bizarre, but it does have a very, very strong tendency to devolve into using lots of single lines in a dramatic way if you let it.
I just yell at it with an OOC comment and tell it to keep using paragraphs periodically, seems to fix it.
Anonymous No.106127896 [Report]
>>106127773
weird issue. I'm using a Q2K and haven't seen anything like that; I noticed the UD-Q2KXL versions were a little off in comparison but it wasn't any specific behavior like that, they just felt a little dumb.
are you using token biases, maybe?
also, this is a schizo longshot, but are you using koboldcpp? they had (and probably still have?) semi-fucked up support for qwen models because of the weird tokenization logic they use. in the past this was because qwen ggufs end up with a bos token id of 11, which is a comma (!) for qwen models. this is supposed to be ignored because they have some ignore_bos flag set but kobold ignores it in some cases. just the fact that you're having this weird issue and the problematic token being a comma makes me connect the dots... maybe see if regular llama-server gives you better results
Anonymous No.106127982 [Report]
>>106125083
OAI has no moat. Google and xAI have both surpassed the company, and China is right on their ass. You sorry bootlickers continue to shill for them for whatever reason though
Anonymous No.106128065 [Report]
>>106120592
glm 4 air is legit, but the censorship will be bad if the 32b version is any indication
Anonymous No.106128131 [Report] >>106128186 >>106128411
>>106120082
OAI is in a bind. It's not just the safety training. If they release a local model that isn't by far the best one available, it could cause serious investment problems. That they're even considering this is actually kind of shocking, because I have a hard time believing their model(s) won't get stomped by Chinese releases. Meta had this problem with the Llama 4 disaster, and they responded with this whole cringe AGI initiative to temper investor concerns. And it's no surprise that Meta voiced their reluctance to release any more open models.
Anonymous No.106128151 [Report] >>106128175
>>106127673
I really should go play w it but I’m not feeling particularly motivated by these models anymore. It’ll eventually get baked into ChatGPT I suppose so I’ll use it then.
Anonymous No.106128175 [Report] >>106128186
>>106128151
I don't think the goal or expectation is to beat anything. Even from the original poll, the specific wording was "o3-mini level model that is pretty small but still needs to run on GPUs". The difference between them and Meta is knowing how to set expectations. Remember Zuck saying that Llama 4 would "lead"?
Anonymous No.106128186 [Report]
>>106128175
Meant for >>106128131
Anonymous No.106128411 [Report]
>>106128131
>Meta had this problem with the Llama 4 disaster,
Qwen bailed them out with the Qwen 3 disaster. Albeit Qwen 3 was useable, it just conversed in chinkgrish and struggled with gender. (Which is unacceptable for 235 billion parameters, thoughbeit) that whole generation of local is basically lost.
Anonymous No.106128759 [Report] >>106128804
This is the Kimi K2 8b quant from OR (there is only ai_coding_agent.py and ai_coding_agent_.py).

> do not make any changes yet

Understood - I won't make any changes.

Looking at the current directory, I can see these files:
- `ai_coding_agent.py`
- `ai_coding_agent_.py`
- `ai_coding_agent__.py`
- `ai_coding_agent___.py`
- `ai_coding_agent____.py`
- `ai_coding_agent_____.py`
- `ai_coding_agent______.py`
- `ai_coding_agent_______.py`
- `ai_coding_agent________.py`
(keps going forever like this)


Proprietary models would never make such a silly mistake, and I don't think it's just because they use some sampler hack like repetition penalty.
Anonymous No.106128804 [Report] >>106128828
>>106128759
>8b quant from OR
Is it verifiable that it's 8bit?
Send your reply to the correct thread. This one is not.
Anonymous No.106128828 [Report]
>>106128804
Actually I lied. It's not running on OR, I have an H100 cluster in my garage.