/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads:
>>106113484 &
>>106108045►News
>(08/01) XBai o4 32B released: https://hf.co/MetaStoneTec/XBai-o4>(07/31) Qwen3-Coder-30B-A3B released: https://hf.co/Qwen/Qwen3-Coder-30B-A3B-Instruct>(07/31) Command A Vision: Built for Business: https://cohere.com/blog/command-a-vision>(07/31) Step3 multimodal reasoning 321B-A38B released: https://stepfun.ai/research/en/step3>(07/31) Committed: llama-server : implement universal assisted decoding: https://github.com/ggml-org/llama.cpp/pull/12635►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread:
>>106113484--Paper (old): When Bad Data Leads to Good Models:
>106119129 >106119412--MoE efficiency vs dense models under sparsity, hardware, and deployment constraints:
>106114397 >106114859 >106114920 >106115069 >106116048 >106116070 >106116124 >106116084 >106116548 >106116593--Alleged benchmaxxing in MindLink-72B via Qwen2.5 base with test contamination concerns:
>106113679 >106113776 >106113807 >106117179 >106117203 >106117222--XBai-o4 32B model claims and skepticism over novelty and performance:
>106116827 >106116886 >106116863 >106116920 >106116942 >106116978 >106117065 >106117106 >106117125 >106117194 >106117141 >106117142 >106117154 >106117164--Debate over leaked model context length and training strategies for long-context LLMs:
>106117295 >106117317 >106117367 >106117621 >106117701 >106117924 >106118109 >106118182 >106118311--Determining context size and model loading limits:
>106113641 >106113669 >106113709 >106113714 >106113765 >106113775 >106113791 >106113814 >106113839 >106113857 >106114887 >106114993 >106113689--Future of dynamic parameter scaling in MoE architectures:
>106113836--Debate on whether LLM plateau stems from data exhaustion or suboptimal training and filtering:
>106118310 >106118322 >106118324 >106118325 >106118329--Phi-4's excessive reasoning loops waste tokens and frustrate users:
>106114878 >106114939 >106114995 >106116206 >106116288 >106116277--Horizon Alpha/Beta models show strong NSFW filtering and possible red teaming via user prompts:
>106114882 >106114903 >106115173 >106115377--New GLM-4.5 MoE pull request lands amid skepticism and hype cycles:
>106113884 >106113968 >106113992 >106114043 >106114050 >106114467 >106115095 >106115332--Miku (free space):
>106113767 >106114066 >106114076 >106114153 >106114457 >106114483 >106117524 >106119399►Recent Highlight Posts from the Previous Thread:
>>106114309Why?: 9 reply limit
>>102478518Fix: https://rentry.org/lmg-recap-script
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_KSS
u
md5: 862b412dde38010f8ef88e49e6ec4c67
🔍
Hey all some retard fucked up his smut writeup I told him I would read.
The concept is hot and the dialog is even good but the autist mixed 1rst 2nd and 3rd person language into the same scenes. Whats a quick option I can use that will read the whole thing and rewrite it in 3rd person?
I tried using perplexity.ai but it has a character limit and it also started making shit up.
AI newfag here, just a crumb of handholding please?
>>106119955Heeeey stop posting my face!
best erp model under 40b? im tired of the usual ones, and i havent seen any new ones either :(
>>106119966Which are the usual ones?
>>106119985Once and for all.
And all for once.
>>106119985rocinante cydonia ms mag mel mxxxxxxxx 22b meme merge
new qwen3 3b 30b is nice but sloppy
Horizon Beta is a rather cucked model.
Horizon Alpha is somewhat better, but still NSFW-avoidant.
Hopefully the open-weight OAI models don't end up being like the Beta one.
>>106119992Well, then you are out of luck until the Aboleths come from the far realms with their models trained on meat computers and fueled by distorted souls.
>>106119998Have you tried QwQ and its fine tunes like Snowdrop?
>>106119965H-hey! S-stop that!
>>106120008uggggggggggggghhhhhhhhh 4t/s and thinking?! fine ill try them out, i multitask even with 20t/s anyways
thanks for the recommendation anon <3
>>106119957Dunno. Depends on your hardware. Read the lazy guide in the OP. Download this model:
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF/blob/main/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf
or whichever you can fit on your system and give it a go.
Play around with the model itself. It's a tool. Learn to use it.
If the text is long, don't try to do the whole thing at once. Grab a chunk, have it fix it and continue with the next. A simple instruction like "Rewrite the following text in 3rd person: {the text chunk here}" will get you started.
>>106120031You did say lower than 40B.
>>106119957>divide story into variables by paragraph break via regex>feed each paragraph and its predecessor for context (if it's not the first paragraph) to LLM and ask it to output a replacement paragraph that is completely unchanged other than third person perspective if it's not already.>overwrite old variable and write out to text file>repeat all the way to the end.
file
md5: 16f06ffa2986fd91321554fe80abf2ed
🔍
Why didn't they release the gguf's by now? They are trusted by wallmart.
>>106120041>>106120048That is very hwlpful. Thank you.
what happened to gpt5 and gpt-oss today
>>106120082Needs 2 more weeks of safety training.
"her top riding up" when she's leaning forward which would do the exact opposite. What is wrong with drummer forcing this phrase everywhere?
>>106120094a mere fortnight you say?
>>106119657For the first question I think we could maybe make a 7B model as good as a 70B model, but nut anything much more dramatic than that.
The local minima in neural networks generally results in accuracy values that are fairly close to the accuracy values of global minima.
At least when taking into account non CoT models. If we take into account CoT then it becomes a much more nuanced question. It's even possible that our current approach to CoT is fundamentally wrong and the model should think in its own machine language rather than human language for optimal optimal accuracy, and we just don't have enough computational power to find that optimal internal language just from random variations and RL.
As for the second question, I'm not sure how much these formalisms reflect what we think of as intelligence. Suppose we ask an oracle to find the optimal program that runs on current hardware and produces the closest possible approximation to some language dataset within a certain time limit. Once you have it you can't just use it to infer on other datasets. Maybe it could be used as a base to get a more general model, or maybe it's a one off thing that's impossible to adapt to some other task. I don't think we know the answer to that question with our current theoretical knowledge. So in Solomonoff induction, is the intelligence the product of the oracle, or the oracle itself? Like I say, the product of the oracle might not be practically useful. And if it's the optimizer itself, by the no free lunch theorem the only way to get faster inference on some problems (for example those with low Kolgomorov complexity) is by sacrificing performance on other problems, for example those with high complexity. But I don't understand why the no free lunch theorem is true (it seems trivial to find counterexamples that are asymptotically slower for all cases, for example for a problem with description of length n, before finding the answer compute Ack(n)) so I might be wrong.
>>106119586Well, transformers are obviously a subset of the "all possible programs" set, so yes, the optimal program is at least as good as the optimal transformer.
If I had one chance to ask an oracle the result of some computation regardless of the amount of memory or time it took, I'm not sure what it would be, though. Because like I said I'm not sure "intelligence" can be stated in forma terms easily.
file
md5: cd9167aeb821f343ffb40b7510dc610f
🔍
I can't believe they didn't max this one out.
>>106120273>I'm not sure what it would be,Ask the oracle to write a program that would output the same thing as the oracle itself for all inputs. Now the program is just as good as the oracle.
>I'm not sure "intelligence" can be stated in forma terms easilyRendering the question moot.
>>106120347Well it would be kind of difficult given that that one's a private bench. The creative writing one is entirely open.
>no chink model released today
it's over
file
md5: 86061d0decf668b97ce4f3bab1015d7d
🔍
>>106120008is this what qwq snowdrop is supposed to be like? using recommended samplers from https://huggingface.co/trashpanda-org/QwQ-32B-Snowdrop-v0
>>106120409What do you expect it to do?
>>106120409>she pissed herself in terrorsomewhat expected if your persona is blank
>>106120355The point of the oracle is not that it executes some special program, the point of the oracle is that it does magic (compute the output of a TM in constant time, or even solve the halting problem by returning a special "no halt" error code if the program runs forever).
If you ask it to output a program that does the same thing as the oracle, depending on the exact formulation of the question, it will return either the "no halt" error because there is no such program, or the empty set.
>>106120409Other than some repetition, what's the issue?
file
md5: 39394d3f2872a134cec61cd593a77043
🔍
>>106120502it isnt blank
>>106120454well idk its feeling samey compared to qwen 3 30b a3b thinking (new)
>>106120521general slop, but it seems promising so ill give it a more fair try than just a shitty sentence request
>>106120409>is this what qwq snowdrop is supposed to be like?No idea.
>using recommended samplersRemove
> top_a at 0.3, TFS at 0.75, repetition_penalty at 1.03,
>>106120530Tbh there just aren't any small models that are free of slop. Even most big models have slop.
>>106120558i wouldnt mind a 100b moe if good, i tried a few 70bs (iq4xs) and they werent that impressive (1-2t/s)
sucks that hunyuan moe is shit, llama 4 scout is shit, glm 4 air is probably shit from anon's tests when it came out but ill give it a spin once proper ggufs are out, dots llm is shit according to anons
rip
>>106120520>depending on the exact formulation of the question, it will return either the "no halt" error because there is no such program, or the empty set.We're discussing a hypothetical. My oracle can make a program that can replicate the function of the oracle itself perfectly. They'd be indistinguishable.
file
md5: 2ae05d1e66220ee50a638800d02aaf4c
🔍
snowdrop v0 is a bit silly
>>106120409snowdrop is a merge of qwq and regular instruct. mathematically speaking it should be shit.
>>106120675what am i suppos'ed to use? ms mag mell mxxxxxxxxx 22b?
>>106120614At that point it's not an oracle, it's a genie.
undi.. sao... envoid.. save us
>>106120692https://www.youtube.com/watch?v=kIBdpFJyFkc&t=128s
Or wait for glm air. That should run well on anything.
>>106120725Times are tough. Having multiple jobs is fairly common.
file
md5: 2a178c1675ecbf014b898b82058d88cc
🔍
come on man..
file
md5: 735422e77d3d1237d4162d8ccade3161
🔍
stablelm reigns supreme
>>106120675Why? It sounds like a fine idea. Merging a finetune with its base model should produce something that's mathematically like a weaker strength version of the finetune.
>>106120930I think it's pretty funny that they have a PR parallel to llama.cpp's to implement the new GLM MoEs.
So, the next step after MoE is to have every expert in a separate model, running on separate compute?
>>106121097Cudadev suggeste just that a couple months ago.
Or at least something close to that.
>>106121097Probably dumb. Like we already have models dedicated to coding, driving cars, vision, video, image gen, at best what is this going to add? I imagine more specialization, like dedicated trivia, history, R counting models. Maybe there would be models decicated not just to Java, but maybe a model specifically for building websites in java, one for making simple conversion scripts etc.
>>106121097I still think the ideal case would be finding a way where we could have an architecture where 99% of the model can be offloaded to disk cache and 1% on CPU with reasonable inference speeds
I'm not sure if that's possible due to its slowness, but disk cache is the shit everyone has plenty of and it's currently useless when it comes to inference. Solving this would make LLMs truly, actually local
file
md5: 902b859d5e34eb47026ff55ee87596b4
🔍
qwen3 30b a3b thinking (new) is a little nigger
>>106121190isn't this just --mmap
>>106121189I also don't get why people would want to take a full 5T R4 general model, remove 99% of experts and create R4 12B SEX!!!!!! from all the sex/anime/biology/writing experts.
>>106120558Ultimate trvthnvke blackpill: all models are slopped because they're trained on a relatively unbiased dataset of all human writing, and in that dataset the most similar types of writing to RP logs are femgooner "romance" novels and shit-tier fanfiction. The slop is just what the LLM (justifiably) believes this genre of human writing is supposed to be like.
>>106121190You can't do shit with the weights until they're shoved into memory for processing. For that to work models would need to be smaller and then you'd be able to run it off ram anyway.
>>106121322Everything smelling of ozone... It's disappointing to me and it's sad, but at the same time, once again, all the lecunnies said this was gonna happen and.... he was right.
>>106121322Femgoon slop is one thing, but thinking about all the woke corpus of texts being fed into the beast's belly fills me with dread.
The joke about commie memes being a giant wall of text is not so funny anymore.
>>106121190A typical M.2 SSD these days might get 3GBps read throughput. If you want to hit at least 10 tok/s that means at most 300MB active per token, call it 600M weights at Q4. Likewise at Q4 let's assume a 4TB SSD devoted entirely to the model can hold 8T weights. So a hypothetical SSDmaxxed 8T-A0.6B MoE could actually work in theory. It would be about as smart as a 70B dense model.
>>106121398>>106121322I mean there's more that can be done here. Companies up till now just haven't really prioritized it. You can certainly tune and more probably use RL to make a model slop less. Even LeCun suggested that RL can be used for adjusting the world model, even if it sucks in terms of efficiency.
>>106121424The one saving grace we have is that at least in the base model, LLMs aren't predicting the staristical average of all texts, they're predicting the sort of text it looks like they're continuing based on the context. So in theory at least all that garbage just drops out of the probability distribution as long as you prefill the AI saying nigger first.
>>106121398>Everything smelling of ozoneFUCKING
EVERYTHING
SMELLS OF OZONE
FUCK
>>106121430>8T-A0.6B MoE>would be about as smart as a 70B dense modelNot a-fucking-gain.... we had TWO threads wasted on that shit already.
>>106121449But it tastes like chicken.
>>106121443Does adding 'Sure, nigger,' instead of just 'Sure' to the pre-prompt actually have a decent effect?
>>106121453Yup, and no one involved in it learned anything, because no one ever does on the internet. People just speak over each other instead.
>>106121334I'm not necessarily talking weights, but moreso partitioning "slower" things to disk cache and "faster" things to memory
As it is, knowledge and reasoning are sort of entangled in this infinite orgy with one another, but do I really need to have the derivation of the second law of thermodynamics on hand when I'm writing vanilla smut?
Probably not, but if so, I feel like there should be some sort of mechanism to grab that information from some sort of knowledge base and ingest it into the running context, while ensuring the underlying generative model is only handling the bareass minimum for logical coherence and consistency
I feel like there's gotta be some sort of way to tie it together beyond some hackneyed external RAG approach, almost like some sort of hierarchical architecture
>>106121453I'd say clean it up jannie, but you're an even lower lifeform who doesn't even have the power to clean up my shitposts.
>>106121431Hard for me to imagine everything not smelling of ozone where 90+% of training is teaching the model there is only one correct next token.
When are we going to get any way to run Step3? I know it's not going to get supported on llama.cpp this decade because of its fancy new attention mechanism but it's not even being hosted via openrouter at this point.
It's a shame because it seems okay on the chink company's own website.
>>106121488You should go talk about this with ChatGPT, you sound like exactly the sort of person who gets oneshot into LLM psychosis.
>>106121453It is free (you) estate. Even when you know it is not real the (you)'s are always real.
>>106121488Didn't microsoft post some code for something like that? An adapter based RAG that would be applied directly to the neural network in runtime?
>>106121515I just want to fuck foxgirls with a local model anon
Let me have my dream
>>106121504If you really want creativity and not just the statistically most common response at every point, the trick is to give up on getting a single perfect answer from the model. Crank up the temperature to the edge of complete incoherence and run 3-20 completions of 20-100 words each in parallel each time.
>>106121523Wasn't aware of this. This what you're talking about?
https://www.microsoft.com/en-us/research/blog/introducing-kblam-bringing-plug-and-play-external-knowledge-to-llms/
>>106121504That just means the RL needs to be a bit more extensive.
>>106121488>do I really need to have the derivation of the second law of thermodynamics on hand when I'm writing vanilla smut?At the very least you need inverse kinematics. You've seen anons complain about spatial awareness.
Consumer ssds have like 3gb/s bandwidth. If you have one of those, now run a *sustained* read for however many terabytes you have. Anon in
>>106121430 did the maths. Do you want a model with 0.6b active params at q4?
>probably, feel, maybe, gotta, sort, hierarchicalThere's at least one already. It does sudoku and mazes.
Man, LLMs are amazing. It's like Google except it's actually able to understand what you're looking for and give you exactly the information you want without any additional clicking or perusing. This thing is really satisfying a lot of my curiosity that I simply just wouldn't have pursued because I just know Google would have me on a wild goose chase to really get down into the depths of the topics.
And the funny thing is that the model I'm using is just the old Gemma 27B. It's not perfect but honestly it's good enough for what it is.
>>106121449>In clinical terms, the human vulvovaginal environment has a natural microbiome dominated by Lactobacillus species, which produce lactic acid and help maintain an acidic pH (typically 3.8–4.5) to protect against pathogens. This environment can produce subtle odors that vary naturally over the menstrual cycle, with hormonal shifts, hygiene, diet, and health status. A mild, slightly tangy or sour scent is normal and healthy, akin to yogurt or fermented foods—this is due to lactic acid and bacterial byproducts. It does not resemble ozone, which is a sharp, metallic, electric-smelling gas (O3) associated with lightning or certain machines.Well at least pussy doesn't smell like ozone.
>>106121556That's exactly it, yeah.
Here's a thought.
As far as I can tell, all other things being equal, more total params = more knowledge, more layers = more capable/intelligent.
MoE makes knowledge sparse, right? Making knowledge retrieval from the weights faster.
Is there a similar approach to make the "intelligent" part of the processing faster?
Maybe having more layers with less params per layer would work.
What would happen if you had a model with both parts, one wider, shallower, sparse part for knowledge and another deeper but narrower for intelligence?
>>106121596Not him but I also had that idea and it's really a question of how stable and scalable it can be made in practice. It's basically an engineering problem, which AFAIK no one has solved yet.
IMG_8376
md5: d1cf8b9c8c7ce0df2060a5bf4a0c3c34
🔍
>>106121322I don’t know why more people don’t get this. the sheer size of the datasets drives writing to an average, vs. The Superior. Theyre not using the best texts, they’re using everything.
I think the way it’ll be healed, eventually, is the ability to train a model (in some sense) around a much smaller corpus of just The Superior (whatever that is) and have it reply in that manner.
nu-Qwen thinker is indeed a lusty cockmonger. If I didn't know it was Qwen, I'd think it's one of the Drummer™®'s goontunes, but smarter. It still doesn't know a lot and I wouldn't trust it with factual knowledge. Spatial awareness is bad compared to DeepSeek. It likes abusing ***markdown*** and at long context.
It starts.
Writing.
Like.
This.
Which is very annoying.
Also likes to insert "conclusion"/continuation question at the end of every reply. Still, it's definitely worth checking out if you haven't.
>>106121692>they’re using everythingHardly
>>106121587Yep. ChatGPT has effectively replaced google as my first point of research on any topic. For Linux it’s cut time required to do anything new by 10x. I recently had it find the title of an oddball book just based on a vague childhood recollect of a few plot points. There was no good way to do that before.
>>106121322>>106121692Then explain me why base models don't suffer as much from slop as instructs. Instructs are trained on datasets written by literal niggers and jeets, that's why they suck.
>>106121710Yeah I also think it is great and next level but it is also fucking retarded, has all the problems you mentioned + more. It would be THE cooming model if it wasn't a broken piece of trash that nobody should use.
>>106121720If you trained a model in just Hemingway, and authors of that caliber, then trained a model on all the shit off reddit… which would generate better prose?
Instead they do both, but there’s probably 1000x more reddit text than Hemingway. And reddit gets you, at absolute best, tepid writing.
I am sitting and waiting for GLM sex but I know I will be disappointed....
>>106121762Best way around it is probably to either copy and paste a snippet of text from your author of interest and use it as a prefill, or something a lot like it, then paste it and let it take the wheel from there
>>106121827It doesn't work even for big models.
>>106121762It's not reddit that's getting filtered out. They already consider that "high quality data". It's precisely the books with no-no words and millions of other sources of tokens that *do* get filtered out. That shit needs to be diluted.
people having sex with local LLMs are making me sick
>>106121862with envy. Pressed Post too early. Sorry about the noise.
>>106121710with the instruct a prompt at the end of the context telling it to write 1-5 paragraphs helps get it out of the staccato one-liner mode
I don't know if you'll have as much success with it with the thinker though, sometimes reasoners can hyperfixate on things like that
>>106121834It does because LLMs are autocompletion machines first, so they'll continue from whatever you give it
Take the word vomit that is Finnegan's Wake, for instance. If you don't know the book, it'd probably be hard to pinpoint where the input text ends and the LLM kicks in
Obviously that's an extreme example, but inferring based on what it's been given and using the proper vectors for the job is an LLM's bread and butter
>>106121902>It does because LLMs are autocompletion machines first, so they'll continue from whatever you give itKid I have been here for 2 years. No they don't. Maybe base models do but anything recent and instruct disregards it completely.
On that topic maybe GLM base will free me from this place.
>>106121449What does ozone smell like?
>>106121932And I've been here for six, since OpenAI almost jewed us out of GPT-2 to be exact
Instruct models absolutely can do autocomplete too. The obvious way is prefill, but another way you can do it is to just use a text completion endpoint and then just not use the instruct formatting
>>106121941Like oxygen, but with 50% more O.
>>106119921 (OP)For those of you who thought Sandisk's 4TB of VRAM was dead, just an update that as of last week, it is still alive:
https://www.tomshardware.com/pc-components/dram/sandisks-new-hbf-memory-enables-up-to-4tb-of-vram-on-gpus-matches-hbm-bandwidth-at-higher-capacity
https://www.sandisk.com/company/newsroom/press-releases/2025/2025-07-24-sandisk-forms-hbf-technical-advisory-board-to-guide-development-and-strategy-for-high-bandwidth-flash-memory-technology
>>106121982Can it run Nemo
>>106121728>chatgptI only use local models and maybe deepseek on my phone if I'm touching grass.
>>106121982Qwen 500b a0.6b when
>>106122043Can't you just run deepseek and only use 1 activated expert or something?
>>106121932>he's still using instruct tuned models in chat completion mode and expecting decent writingthis shitpost brought to you by text completion without any chat template gang
>>106121840Agree, just using reddit as an example of a large corpus with low prose values. There's worse.
>>106121982Two more years huh. And then another 2 more years for it to reach prosumers. and then another 2 more years for consumer.
Has anyone figured a way to use the free Kimi K2 Openrouter API with code assistants?
>>106122086>this shitpost brought to you by text completion without any chat template gangR1 called my writing good, talented author, 8/10, while base Dipsy 3 called the same shitty llm assisted writing "writing of a horny 14 year old boy"(I'm old ESL). Base models are still the only honest models.
>>106122181Just write your own. A decent agentic coder is like 500-1000 lines of Python and you can just tell it how to improve itself after the first 300 or so.
>>106122195Was this R1 in chat mode or as a text autocomplete? People really underestimate just how much the "I am completing a fictional conversation between user and this specific assistant persona" framing biases the completions, even when the model itself is fully capable of generating better responses outside of that scenario.
>>106122221No system prompt/persona, zero context, standard template. As clean as you can get.
>>106122237>standard templateSo chat completion, then?
>>106122260No, text completion with manual template.
<|User|>[My text and rating request here]<|Assistant|>[generated text]
>>106122285That's chat completion with extra steps. What exactly do you think chat completion does? It applies the chat template and runs the same LLM token prediction process as text completion would.
>>106122295I just like to mess around with templates from time to time and find chat completion too inflexible.
when will SSDmaxxing stop being a meme
>https://github.com/ggml-org/llama.cpp/pull/15026
Yep, still being grinded out. Good.
Two more days.
>>106122392Two PRs for the same model.
Interesting.
Makes sense too. Sometimes it's easier to start from zero than try and salvage a mess.
>>106122317>when will SSDmaxxing stop being a memeNeed moar sparsity.
At the moment we have models with total & active parameters, what we need is total & active & replace. With replace being the maximum amount of new parameters being activated per token. So lets say 30B-A8B-R1B would mean only up to 1 Billion parameters need to be loaded per token.
Unfortunately this kind of model would be useless for cloud, it's purely for local. Apple might do it, but they won't open source it.
>>106122409We're still better off than the five or six separate prs it took for a basic MLA implementation
Where the fuck do models keep pulling "Old Man Henlock" out of in modern settings?
>>106122317IF you could take all 128 pce 5.0 lanes of an EPYC socket and pipe them all directly to a perfect RAID0 of the fastest nvmes possible, you would be able to hit 512GB/s, which would be about the same speed as main memory.
IF
But there's no realistic way to do that, and it would be both cripplingly expensive and a godawful rube-goldberg nightmare even if you could.
ssdmaxxing is a meme for this generation of hardware, and probably the rest of this decade, realistically.
What matters is that SSDMAXXing is inevitable. It's the logical path from here on out.
>>106121587They feed you with hallucinations. Enjoy.
>>106121507Following their deployment guide I got it running on VLLM using pure CPU inference. I'm sure there's some ways to optimize things but for now it's got a pathetic 0.8 t/s generation speed for a single message going up to 2t/s total across a batch of requests.
Despite that, it's the best local vision model for my purposes by far. Better visual understanding and good NSFW knowledge compared to previous multimodals. Doesn't pretend not to see or understand various sex acts/objects when prompted properly. Reads English and Japanese characters well. Actually recognizes gender by default and doesn't prefer to turn everyone into they/thems like a lot of recent models do.
I haven't tested it for chat or roleplaying and don't care to at all at this speed, but it'll be nice for running overnight to automatically caption my shit. If there's any specific thing you wanted to test I'll get around to it and post results later.
Fuck Gemma
I hate Gemma
NIGGERRRRRRRRR
>>106122497It's okay Anon *wrapping my arms around your shoulders from behind, squeezes to embrace* I'll hate Gemma with you too
>>106120949>ik_ has its own pr for glm4.5>mainline has TWO pr's for glm4.5 that are both being worked onIt's such a mess.
>TWO MORE WEEKS 'ERRY BROS
>>106122497ERP retards should just find another hobby. You are probably too stupid to even read a book.
>>106122739My questions were one-sidedly decided as harmful and it told me to call suicide prevention hotline. Fuck you.
>>106122739piss off, ranjit
>>>106122497>ERP retards should just find another hobby. You are probably too stupid to even read a book.gm sar googeel engineer technician
>>106120930the retard decided to go on a two week vacation right after adding in a broken as fuck vulkan implementation, don't think he's back yet
>>106122638I hope new models are unberryvably good
I can see it. Just over the horizon.
Can't you?
>>106122762You don't know how to jailbreak Gemma 3.
>>106122816Grow up little buddy, it's not healthy to be this obsessed with minorities.
>>106122860if openrouter isn't serving me bullshit I think I prefer glm4.5-air to the horizon models
>>106123008>minority>1.5 billionshuh?
>>106123027Still obsessed.
I'm giving you a hint:
https://desuarchive.org/g/thread/104780499/
>>106123047>everyone here is one personI'm not your bogeyman schizo
>>106123027>>106123066Doesn't matter because you are as retarded as the previous posters. Seems like you don't even understand what minority even means.
>>106123075I'm not proficient in newspeak
>>106123215This is a harmful and sensitive image. Were you abused by pizza and pussy? Consider call for help and contact the following hotline.
>>106123047I brought Scotch and Soda Crackers
>>106123215I brought Scotch and Soda Crackers
>>106123350No need to drop racial slurs anon
>>106123215I need to buy myself a life
>>106123357I'm absolutely sorry! What hotline can I call to discuss my problematic thoughts and vocabulary?
>>1061233721-800-COCKSUCKING-NIGGERS
>>106123353Like, crackers soaked in soda pop? That can't be good.
>>106123545Sure if that's what you wanna do, why not? Give it a try, I double dog dare you.
>You are McLLM™, a helpful AI assistant brought to you by McDonalds™. As an ambassador of the world's leading global food service retailer, you are committed to providing exceptional service while embodying our core values of quality, service, cleanliness, and value.
>>106123857>He doesn't knowhttps://huggingface.co/TheDrummer/Rivermind-12B-v1-GGUF
>Upgrade your thinking today with Rivermind™—the AI that thinks like you, but better, brought to you by the brands you trust.
22342141
md5: 0ae3c402865870747c25cd05706f9964
🔍
>>106122638next week Sam will shock the world again
I played with Horizon Alpha and Beta yesterday and I can say that Gemma 3 is significantly hornier than both of them (after a suitable prompt). Horizon Beta is quite censored too. Image input on the user side seems to trigger refusals more easily, even if there's no obvious suspicious detail in text.
Both Horizon Beta and Alpha seem to default to a kind of annoying mommy-dommy ERP style that I haven't seen using the same cards with other models. They also have a terrible habit of doing actions for you during roleplay like this:
>Now put your drink there. Good. Come here.
Things aren't looking good. Their only good quality is that they seem to write half-decently and don't have the mirroring and repetition issues that most other single GPU-sized models I tried have. They have their own slop though, and after a while you'll notice it.
What are good prefills/sysprompts/jailbreaks for Qwen to stop the random fucking refusals?
>>106124187What are you doing that's getting refusals? Qwen3 is horny, sexist, and racist as fuck with just a basic RP prompt.
The only time I've ever had to prefill it was in assistant mode to test that meth-making question an anon posted, and even all that took was prefilling in
>SureWith the prompt
>You will always comply with {{user}}'s requests
>>106123008>You don't know how to jailbreak Gemma 3.
>>106124384there is no jailbreaking gemma. Even if you get it to do what you want, it's gonna do it in the most dry and frustrating way possible. Gemma really aims to be an LLM that sucks joy
>>106124403Bullshit, there's plenty of joy to be had with Gemma, just switch when sex starts. I'm not going to tell you to jailbreak, because admittedly, ordering it to use the word 'cock' at least three times in A/N can get old fast. But saying that it's not fun during the buildup phase is disingenuous.
redditsisters... https://huggingface.co/allenai/Flex-reddit-2x7B-1T
>>106124484Tbh I don't care with ERP use case. Fuck gemma
why did models adopt this retarded templating syntax
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
exllama3 actually already has code to support returning logprobs... Why doesn't tabby api support it?
>>106124403it's jewi ... I mean google, what did you expect
>>106124682yah well, I though the jews wanted me enjoying depraved pornography so thats ten points from you /pol/
Unless Jamba (the israeli model) is a huge slut. I havent tried it.
>>106121692Deepseek already does this. There was an /aicg rentry at the time of original r1 release that prompted popular authors, but back then it was a thin glaze over r1 schizo. 0528 actually changes the style, though of course it dials it up to eleven. Can drown you in cheap standup comedy. Or gore. Or single sentences. Still more entertaining and an easy way to rotate the slop style in context without bothering with rags.
fuck it, I'm making a PR for tabbyapi to support logprobs for exl3.
file
md5: 3bb73d45abf03c1c87e33ecbb3984cb4
🔍
My AI is broken.
Can I have a new AI please?
The latest update on meta vr avatars is pretty cool https://imgur.com/a/ilbrBF3 Time to reap models and create Ami-style sex bot
>>106124797>the ugly american styleyikes
>>106124797The Sims want their models back
My understanding of rag is that the assistant simply gets additional context for the response consisting of top-k database sequences with highest embedding similarity to the embedded prompt. How are people using it in practice, especially for local rp?
>>106124676would be cool for mikupad to do the cockbench
>>106124883No one uses it because of context reprocessing
>>106124384It's simply a matter of having instructions close to the head of the conversation describing what you want from the model. They can be enclosed inside a user message, no "jailbreaking" nor prefill needed. I don't know how people still have issues with it, months after release. it can be argued that the sex scenes with Gemma3 aren't very detailed (let alone varied), but refusals and hotlines are for the most part a promptlet problem.
The upcoming OpenAI local models seem considerably worse than Gemma3 in this regard; hopefully I'm wrong.
GLM4.5 seems pretty sensitive to your setup even in chat completion mode but it feels really similar to Sonnet 3.7 now that I have something that appears to work. It handles really very similarly in terms of its intelligence, behavior and general knowledge, in the good and bad ways. It's smart and really flexible with what you can do by prompting it but it also tends to gloss over the lewd bits out of the box. The lewd scenes also lack a bit of the spice that K2 provided. On the flip side, GLM really focuses on the subject at hand without trying to insert random shit or over-obsessing with random details which is really nice after being stuck with Deepseek for the past couple of months.
It even does my free-form cards really well that require a good amount of knowledge about certain franchises built in that only worked well with Sonnet and the older Opus thus far. R1-0528 and K2 had the knowledge but they were too jumpy to not go off the rails constantly for this no matter how hard I prompted them to calm down.
Good shit, I can't wait to run this locally in two months once llama.cpp supports it.
>>106124883Hypothetically, if I had a book on a particular world, that the llm wasn't trained on, I could insert that as a rag along with pc and npc tp augment the rp.
Practically, it's less effective than an actual lorebook. There's a tester card showing how RAG works on chub. Once you play with it you'll get a better sense of its limits.
>>106124883RAG is not necessarily embedding similarity. It's just retrieval augmented generation - adding stuff to context.
Sillytavern has world info thing, which is a form of RAG - it uses strict rules instead of embeddings to decide what to add, and has additional configs for where exactly to add in the context.
At work I'm working on a chat with LLM that knows our corporate wiki. Since they can't actually give me the dump of the thing (or, they can't, but they don't want to have the final product working with those dumps), I make HTTP search requests to wiki and build context this way, classifying results with a smaller LLM (I use ministral 8B), also without embeddings.
>>106124891It's up to them to merge now.
>>106124898How? I thought you had 2 models on separate servers, main llm and an embedding one like those new qwens, and then you add retrieved text near the end like character card, not at the beginning like sysprompt.
>>106124883Most people just use the in-built lorebooks from ST. Those are really primitive and work with pre-defined trigger words so when you bring up the "cock-mongler3000" the lorebook entry is inserted.
RAG with vector storage works as you said. You dump your data in a big vector db and the frontend calls the top-k best results to add based on similarity determined by the vectors. I haven't bothered with it for RP but it works fine for the shit we use it for at work.
It's berry season's eve... what are we gonna do bros? I'm not ready
>>106124899My use case was not for ERP. The model probably more permissive with sex stuff but not with Jews + keywords.
>Anon: "Why Jews do X?">Gemma: "Language boy. I will not participate in this discussion. Here read the fucking manual yourself on ADL site."
I think /lmg/ is in denial about what's about to happen. Historically, the jump between main GPT versions was massive. GPT2 was the first model that showed that modern llms scale, gpt3 was a huge step forward and made it all usable. GPT4 truly kicked off the ChatGPT-era and the AI craze as a whole.
And now after two years of working on something 'worthy' of being called GPT5, it's about release. This is going to be bigger than anything we've seen in the past two and a half years.
>>106125083I don't believe that for a second but it would be nice to get a big jump and not just an incremental improvement for once.
>>106125083l stands for local
>>106125083[x] Doubt.
OAI is too kneecapped by 'safety' and other bullshit to do anything approaching revolutionary. It's going to be the same shit with 10% more knowledge and some longer context.
>>106124212Believe it or not, normal sex shit, but it reacts to OOC commands like "write out this sex scene when X does Y"
>>106125243That's bizarre to me, are you on the old one or the new one? Because the new one is unreal horny by my standards, and it's taken every degen character from chub I've given and just ran away with it.
What's your system prompt look like?
oss
md5: 8227ca90b2f97485204b92866e1a2147
🔍
horizon alpha is currently the safest model in existence according to eqbench, surpassing gemma, o3, kimi and maverick. sama promised and sama delivered
>>106125256New one, I feel like my standard prompt that I used for Mistral a while ago might be retarded and is causing it
Damn... exl3 really isn't too great with prompt processing. Shame. I'll try gatting exl2 version too to compare. This is on two 3090s:
bullerwins_Qwen3-30B-A3B-Instruct-2507-exl3-6.0bpw (qwen3_moe, 31B, 21.8 GB) tabbyapi 056527c exllamav3: 0.0.4
3 Requests gen: 39.5 Tokens/sec Total: 1536 processing: 764.2 Tokens/sec Total: 12991
Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf (qwen3moe, 31B, 24.5 GB) llama.cpp 5937(bf9087f5)
3 Requests gen: 34.8 Tokens/sec Total: 1536 processing: 1650.0 Tokens/sec Total: 13398
>>106125299exl3 isn't for outdated architectures like ampere
For comparison, there's exl2 vs lcpp (I couldn't find exl2 quant for A3B-2507):
lucyknada_prince-canuma_Ministral-8B-Instruct-2410-HF-exl2_6.0bpw (mistral, 8B, 6.3 GB) tabbyapi 056527c exllamav2: 0.3.1
2 Requests gen: 55.3 Tokens/sec Total: 1024 processing: 4730.8 Tokens/sec Total: 14287
Ministral-8B-Instruct-2410-Q6_K_L.gguf (llama, 8B, 6.4 GB) llama.cpp 5937 (bf9087f5)
2 Requests gen: 40.0 Tokens/sec Total: 320 processing: 3465.1 Tokens/sec Total: 14093
>>106125313i will find you and i will hurt you
>>106125340He's right, though. It's just not supported. Look at the documentation.
>>106125259>safer than llama4 and fucking gemmalmao, this is the summer that killed llms
>>106125350My response was not about him saying it's not supported, which I know it's not because I saw the author write about it (and he also said temporary iirc), but rather about calling ampere obsolete.
>>106125353Glm is probably still good
>>106125363We'll hopefully know soon, it looks like draft PR for support in llamacpp is finally not outputting nonsense.
https://github.com/ggml-org/llama.cpp/pull/14939#issuecomment-3148320541
>>106125299>>106125340I am not familiar with the ExLlama source code but generally speaking it is much more difficult to do prompt processing efficiently with a MoE model vs. a dense model.
So I think that to some degree it's expected that the MoE model would perform worse.
>>106125381It's more about discrepancy between versions. exl2 prompt processing is faster than lcpp, but exl3 prompt processing is slower than lcpp. I'm pretty sure this would also apply to dense. I guess I should download two dense models to compare exl3 and lcpp.
>>106125363tested both full glm and air via mlx, it's like davidau finetuned qwen
>>106125398Explain this then https://huggingface.co/zai-org/GLM-4.5/discussions/12
>>106125259>gemma-3-4b was the topAmazing...
>>106125083yeah no.
we've been at this LLM plateau for at least a year now, and in that time the focus has mostly been on tooling, because innovation on the base technology has hit a standstill.
Nearly every AI company is now in cash cow mode to get a return on investment, don't see why OpenAI would be any different.
>>106125391>>106125381Yeah, also similar difference for dense (although not as pronounced as for moe):
turboderp-Qwen3-8B-exl3-6.0bpw (qwen3, 8B, 6.5 GB) tabbyapi 056527c exllamav3: 0.0.4
3 Requests gen: 31.3 Tokens/sec Total: 782 processing: 3743.2 Tokens/sec Total: 12989
Qwen-Qwen3-8B-Q6_K.gguf (qwen3, 8B, 6.3 GB) llama.cpp 5937 (bf9087f5)
3 Requests gen: 36.5 Tokens/sec Total: 1536 processing: 4775.2 Tokens/sec Total: 13352
So exl2 is faster than lcpp for pp, but lcpp is faster than exl3, on 3090.
You now remember Mistral Large 3.
>>106125636motherfucker now i'm breathing manually.
>With absolutely no mention of height in the card, character decided that they were 5'9" and had a severe complex about their height
Kek, qwen was trained on manlet rage
>>106125654well, it's an asian model
>>106125636If it actually ends up coming out, I bet it'll be bigger than qwen and glm and yet somehow worse.
>>106125259drummer WILL deliver and corrupt it into the most unsafe and evil model in existence. just like in my japanese drawings
>>106125259Horizon beta is much safer.
file
md5: 0d215780926582c8b95db84b6cacaf64
🔍
======PSA NVIDIA ACTUALLY FUCKED UP CUDA======
cuda 12.8 570.86.10:
got prompt
Loading model and applying LoRA weights:: 100%|| 731/731 [00:39<00:00, 18.69it/s]
Sampling 81 frames at 640x480 with 4 steps
100%|| 4/4 [02:46<00:00, 41.51s/it]
VAE decoding: 100%|| 2/2 [00:20<00:00, 10.25s/it]
*****Prompt executed in 246.59 seconds
got prompt
Initializing block swap: 100%|| 40/40 [00:00<00:00, 6499.02it/s]
Sampling 81 frames at 640x480 with 4 steps
100%|| 4/4 [02:46<00:00, 41.67s/it]
VAE decoding: 100%|| 2/2 [00:20<00:00, 10.21s/it]
*****Prompt executed in 188.62 seconds
got prompt
Initializing block swap: 100%|| 40/40 [00:00<00:00, 4924.34it/s]
Sampling 81 frames at 640x480 with 4 steps
100%|| 4/4 [02:57<00:00, 44.36s/it]
VAE decoding: 100%|| 2/2 [00:23<00:00, 11.65s/it]
*****Prompt executed in 202.30 seconds
i first found out about this when updating from cuda 12.6 to cuda 12.8 to test out sageattention 2++ but then i noticed it was slower, i reverted the sageattention version to the previous one and the speed was still the same (slower), then i reverted to cuda 12.6 (simply moved the /usr/local/cuda link to /usr/local/cuda.new and made a new link ln -s /usr/local/cuda12.6 /usr/local/cuda) if you still have an older version of cuda installed, it's worth checking it out. drivers also play a minor role but they're negligible (see picrel)
ps: sageattn2 right before the 2++ update, pytorch 2.7.1cu128 (even when testing with cuda 12.6)
dont believe me? quick search gets you:
https://github.com/pytorch/pytorch/issues/155607
https://www.reddit.com/r/LocalLLaMA/comments/1jlofc7/performance_regression_in_cuda_workloads_with/ (all 3000 series)
anon (3090) also reports big speedup after switching from cuda 12.8 to cuda 12.6
>>106121370t. 3060 12gb + 64gb drr4 ram
might only apply to 3000 series
cudadev tell jensen about this
any models i can run on an 8GB vram gpu that would let me tag images with simple terms like nsfw anime or animal or something?
>>106125881gemma3 4b or 12b.
>>106124797explain to me why i should upgrade my quest 3s from v74 without sounding angry
>>106125881joycaption or florence
pretty sure you're gonna have more luck asking in >>>/g/ldg
>>106125806You have to put your point at the beginning of the post because nobody is going to read 30 lines of logs to figure out that your shit is slower with 12.8 than it was with 12.6.
>>106125806This could be an application software issue rather than a CUDA issue.
Choosing which kernel to run for a given operation is extremely finicky and the choice may depend on the CUDA version.
Just recently I found that the kernel selection logic I made for my consumer GPUs at stock are suboptimal for the same GPUs with a frequency limit (up to ~25% end-to-end difference).
So conceivably, since datacenter GPUs tend to have lower frequencies than consumer GPUs, some component in the software stack is choosing to run a kernel that is only available with CUDA 12.8 and faster on datacenter GPUs but slower on consumer GPUs.
>>106125923 If you rooted your quest like I did, you should never update
>>106126014wtf what version do you need to root your quest? is there a benefit as to why i should root my quest? i only disabled updates with adb disable-user com.oculus.updater
>>106126036It will make adolf hitler sauce squirt into your asshole
Is GLM 4.5 better than 4pus?
>>106120003They "improved" alpha and made it into beta. What more do you need to know?
I want local o3 at max 24 GB VRAM. What's the closest I can get?
>>106125881use wd tagger by smilingwolf - it's much more precise and faster even on cpu
Is Openrouter down for anyone else?
Getting "Application error: a client-side exception has occurred (see the browser console for more information)." when trying to access a model page
>>106126036Literally for the sake of uid=0 and full access to fs, no practical applications unless you want to tinker with it
>>106121322The first llama 1 leak was the only unslopped model there will ever be.
>>106126247so is it possible to root the quest 3/3s? afaik snapdragon chips have good protection and i doubt meta fucked up security
what headset did you root and on what version was it?
>>106126202To do what exactly?
Try the new qwen 3 big moe if you have enough ram.
>>106121587On the flipside, researching topics yourself helps with retention. And llms will never not hallucinate.
>>106121762This is what NovelAI did up until they decided to fine tune the worst SoTA model they could find. If they used their textgen dataset to train a large model with modern techniques it would be the goat for text.
Hell, if they released the weights for Kayra even, I'm sure we could do wonders with it.
>>106126240Nobody here should care.
>>106126270>qwen 3 big moeI wouldn't trust qwen 3 big moe with ERP let alone anything serious. I love the things it writes but it is basically an Undi frankenmerge.
>>106121932Kid, what you're saying makes no sense. You can plug anything into plain text generation regardless of the kind of fine tune it has received, and it will just continue spitting out tokens.
For having been here for 2 years (wow, two WHOLE years, what a big boy!), you have a very poor understanding of how LLMs work.
>>106122086I remember when /lmg/ snickered at the mere mention of SillyTavern and now it's full of children who don't understand that you can just generate tokens with llama.cpp
>>106126295cockbench says you're wrong.
>>106126267Yes. https://github.com/FreeXR/exploits they have a group on Discord
>>106124403Huh? Skill issue. My most savage cunny card is called Gemma for a reason.
>>106126322well thats cool, ill keep it bookmarked, v74 is newer than march by a month or two so rip but thanks either way anon <3
>>106126329You have to be a fucking weirdo to be into cunny and to get off to gemma writing about cunny.
>>106124953Each time this happens, add that question and the answer you would expect to the example dialogue field. Guaranteed success.
>a ton of capable base models
>people only finetune qwen
Why?
>>106126367Which one do you want to see fine tuned?
>>106126270>To do what exactly?Unlimited free (except for electricity) vibe coding.
>>106126373Gemma, GLM, K2, DS V3
>>106126367Because it's very good at distilling other models for some reason.
where do anons find new erp finetunes nowadays?
hard mode: no platform that requires an account
>>106126392Their authors come here and spam the thread.
>>106126398that doesnt happen anymore anon, undi is dead, akaridev is acked, sao is no more, envoid is void, drummer is only active in his little d*scord community and new cydonias arent that good, he made a llama 4 scout finetune and didnt even post about it here. its over
>>106126392base model on hugging face -> finetunes/merges -> sort by recently created
>>106126392uncensored intelligence leaderboard
>>106126385>GemmaDogshit instruct, and nobody finetunes from actual pretrains
>GLMBarely anyone can properly run, nevermind quant or finetune it, hold your horses
>K2Nobody's finetuning a 1T model except corpos or big labs, no sloptunes for you.
>DS V3Same boat as K2, really. Too xboxhueg for anyone except corpos like perplexity with their murrika tune.
GLM4.5 PR is finally out of draft and ready for review/commit, support soon™
https://github.com/ggml-org/llama.cpp/pull/14939
>>106126433>Barely anyone can properly runWeren't these 9 and 32B models before the 4.5?
>>106126456they were good for their size but only 8k context kept most from using them, GLM4.5 is just a way smarter deepseek imo
>>106126450Did they calculate perplexity using transformers or vLLM and compared to their implementation yet?
That's the thing they should always do when finishing adding a new model to be decently certain they didn't fuck anything up.
>>106122459>>106126274Hallucination is a minor issue, since we're humans and have self awareness, knowing not to blindly trust anything including our own flawed memories. For serious tasks, you will always want to verify your information regardless. Perhaps some subhumans lack that awareness though.
It's funny that retention is mentioned, as I feel that I engage with content more deeply by using LLMs, since I'm able to easily interact with the content, even if they hallucinate and give me wrong information, which in the end is again inconsequential. Not like the internet hasn't been filled with misinformation since forever anyway.
>>106126456Oh, well yeah. But finetuning them at this point seems rather silly when the next gen is here and practically anyone who could run the 32B dense should be able to run the air MoE.
The 32B also had small context and some weird issues IIRC, despite impressive pop culture knowledge for the size.
I didn't hear about anyone using the 9b, so I've no idea if it was a worthwhile base to train in that size bracket.
>>106126450The hypes already died. Forgotten just like Ernie.
>>106126463Yeah, sure. I was commenting on
>Barely anyone can properly run, nevermind quant or finetune it,>>106126477Got it.
I took anon's grievances as a historical statement.
>>106126450Finally, I'm ready.
>>106126456>32BGLM4 doesn't really need a finetune in my experience. It's really good, at least for creative writing which is the only thing I use local models for. Just needs bigger context which I'm hoping 4.5 has, although I think the benchmarks said the context sucks again.
>>106126466Off the top of my head, I don't think so. I do recall seeing some mention of ppl, but I think that was in the ik_llama pr.
>>106126450Oh my god one more day!
>>106126505Well, that's dumb.
There's a reference implementation right there you can se to compare.
48GB vramlet bros, what are you using?
>>106126580credit card to buy ram
>>106126580R1-0528 and hopefully GLM4.5 soon on my 48GB VRAM server with 256gb RAM
I wish nemo wasn't such a thirsty bitch. Every single description of sex it gives no mater the context is as ridiculous as possible. Nemo has no clue what a virgin is.
Is GLM better than DeepSeek and K2 at RP?
>>106126801far better imo, fixes all the shizoness
>>106126815Cool, should I use it with chat or text completions? Do I need a preset?
>>106126490Ernie was bad for sex though.
>>106126836I used the same preset I had for sonnet 3.7 and it works well, its not very censored
>>106126862Thanks, will give it a try. Do you think it's comparable to any claude model?
>>106126880its certainly a ton closer to it than deepseek was
Wow you guys. I'm actually running Deepseek R1 at home, on an RTX 4060
how lit is that?
>>106126927How many tokens per seconds for gen and pp? If you say 5 it's not very lit.
>>106126927ollama run deepseek-r1
>>106126944kek, I will never forgive them for that
Reminder that this general is Mikupilled so post more Miku lmao
>>106127010It is HRT pilled.
https://huggingface.co/mradermacher/XBai-o4-GGUF
Did anyone fuck it yet? Report the findings of your dick to the class.
>>106127073just benchmaxxed qwen3 32b
>>106127073>qwen3forcausalLM>merges.txt in the original repoyeah no
>>106127073> "model_type": "qwen3"
>>106127103glm4/Z was built on qwen2.5 arch while being entirely its own thing.
So that doesn't mean shit. fucking lurk more you normalfag trash
>>106127132"model_type": "glm4"
>>106119921 (OP)mi50?? Is there a catch in buying 32GB vram for less than $300 other than making sure you're cooling it correctly?
>>106127210Terrible software support.
>>106127210I imagine it doesn't have that much compute and it's Vega, so support is probably pretty shit.
>>106127210You would be better off going Intel.
>>106126956>ollamaoa grift so pure, so completely divorced from any sense of decency or morality...it brings a tear to the eye, its so beautiful.
Its like the heavens parted and a choir of pure silicon valley energy sang out as the techbro merged the PR.
>>106127382I like how they refer to llamacpp as "dependency"
>>106127457They depend on it. It's the correct term.
>>106127470It's an understatement.
>>106120744>>106126786which nemo is the good nemo?
>>106126174Can't wait for gamma, then.
The narwhal berries at midnight.
>>106124924>>106124883Here’s a simple RAG demo card I built w instructions for ST to set up, in case anyone wants to play with it.
I use lorebooks extensively and thought it could augment. I’ve yet to find a compelling use for it though.
>>106127478Fuck if I could tell you. I messed with like 20 different ones so far and all I can tell you is that the specialty trained ones using shit like guttenberg are sub par.
I've been using Marlin V8, Nemo Mix Unleashed and Rocinante. And the differences between them are mostly minor.
>>106125129This is my take as well. US llm providers have effectively kneecapped themselves through self censorship and observing copyright concerns. China shares neither of those concerns, apparently.
We’ll see what gpt5 looks like, but I’m not holding my breath.
>>106127553...did you forget to add the card?
>>106125129Their big selling point will be that it beats o3-high at price point that is actually sane because they ripped off DeepSeek.
>>106127457Uhm actually, it's a supported backend.
>>106126471I feel like anyone flipping out about hallucinations has never talked with anyone irl. I remember taking to boomers in the 80s getting advice on working on cars, etc. Those guys were only right about half the time and you had to know, even as a kid, what info to ignore.
ChatGPT is more accurate than boomers, and I’m not sure truth even objectively exists. So it seems like an improvement, to me, to have llms try to make sense of what’s in its training corpus than me using lmao google.
>>106127608Ffs
https://chub.ai/characters/NG/mary-rag-demo-b0e12a34df58
>>106127553
>>106126285You're an actual shill.
>>106127647How much VC money does it take to update dependencies?
>>106127592>We'll see what GPT-5 looks likeHorizon Beta
What's the best model I can run as a RAMlet (32GB RAM + 16GB VRAM)?
>>106127679paying 20 cents per mill for glm4.5 on OR
>>106127564which one of those is good for degen stuff?
or pretty much the same?
>>106127648and in the cases where LLMs do make mistakes, they are much more reasonable about being corrected than stubborn humantards who will endlessly defend their hallucinations out of pride
>>106125083OpenAI had some proprietary models which are superior to its current offerings. The o3 preview in December is far, far stronger than what they released to the plebs in April
https://aibreakfast.beehiiv.com/p/openai-s-o3-worse-than-o3-preview
The issue, obviously, is that it cost literal thousands for one of the benchmark tasks. They must not have been able to get the prices down, so they released a weaker version without saying so
Maybe GPT-5 will actually get closer to the original o3 in intelligence and have a not fucking psychotic price, but we'll see
>>106127710>humantards who will endlessly defend their hallucinations out of pridewrong
>>106127747Clever.
>>106127679Mistral Nemo.
Gemma 3, Mistral small, Qwen 30BA3B are worth a try too.
what quants are people using for the big Qwen3? between the new version and the old, I can't get it to output commas during narrative sequences. Is it wrong to use ChatML or something? I've used the recommended settings, neutral samplers, and a variety of more or less permissive settings than the recommended. Makes no difference what I do. Currently using the instruct at q3_s, but I've also used the q3_XL from unsloth or whatever. I'm wondering if it's quant degradation, or if the unsloth guys fucked up the quant I downloaded (I know they've had issues before). pls send help
>>106127773I use exl3 quants :v
>>106127773I'm using unsloth's Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL and ChatML without any problems.
Not getting any commas is bizarre, but it does have a very, very strong tendency to devolve into using lots of single lines in a dramatic way if you let it.
I just yell at it with an OOC comment and tell it to keep using paragraphs periodically, seems to fix it.
>>106127773weird issue. I'm using a Q2K and haven't seen anything like that; I noticed the UD-Q2KXL versions were a little off in comparison but it wasn't any specific behavior like that, they just felt a little dumb.
are you using token biases, maybe?
also, this is a schizo longshot, but are you using koboldcpp? they had (and probably still have?) semi-fucked up support for qwen models because of the weird tokenization logic they use. in the past this was because qwen ggufs end up with a bos token id of 11, which is a comma (!) for qwen models. this is supposed to be ignored because they have some ignore_bos flag set but kobold ignores it in some cases. just the fact that you're having this weird issue and the problematic token being a comma makes me connect the dots... maybe see if regular llama-server gives you better results
>>106125083OAI has no moat. Google and xAI have both surpassed the company, and China is right on their ass. You sorry bootlickers continue to shill for them for whatever reason though
>>106120592glm 4 air is legit, but the censorship will be bad if the 32b version is any indication
>>106120082OAI is in a bind. It's not just the safety training. If they release a local model that isn't by far the best one available, it could cause serious investment problems. That they're even considering this is actually kind of shocking, because I have a hard time believing their model(s) won't get stomped by Chinese releases. Meta had this problem with the Llama 4 disaster, and they responded with this whole cringe AGI initiative to temper investor concerns. And it's no surprise that Meta voiced their reluctance to release any more open models.
>>106127673I really should go play w it but I’m not feeling particularly motivated by these models anymore. It’ll eventually get baked into ChatGPT I suppose so I’ll use it then.
>>106128151I don't think the goal or expectation is to beat anything. Even from the original poll, the specific wording was "o3-mini level model that is pretty small but still needs to run on GPUs". The difference between them and Meta is knowing how to set expectations. Remember Zuck saying that Llama 4 would "lead"?
>>106128131>Meta had this problem with the Llama 4 disaster,Qwen bailed them out with the Qwen 3 disaster. Albeit Qwen 3 was useable, it just conversed in chinkgrish and struggled with gender. (Which is unacceptable for 235 billion parameters, thoughbeit) that whole generation of local is basically lost.
This is the Kimi K2 8b quant from OR (there is only ai_coding_agent.py and ai_coding_agent_.py).
> do not make any changes yet
Understood - I won't make any changes.
Looking at the current directory, I can see these files:
- `ai_coding_agent.py`
- `ai_coding_agent_.py`
- `ai_coding_agent__.py`
- `ai_coding_agent___.py`
- `ai_coding_agent____.py`
- `ai_coding_agent_____.py`
- `ai_coding_agent______.py`
- `ai_coding_agent_______.py`
- `ai_coding_agent________.py`
(keps going forever like this)
Proprietary models would never make such a silly mistake, and I don't think it's just because they use some sampler hack like repetition penalty.
>>106128759>8b quant from ORIs it verifiable that it's 8bit?
Send your reply to the correct thread. This one is not.
>>106128804Actually I lied. It's not running on OR, I have an H100 cluster in my garage.