/lmg/ - Local Models General - /g/ (#106011911) [Archived: 18 hours ago]

Anonymous
7/24/2025, 9:09:10 PM No.106011911
rin wo licence
rin wo licence
md5: 079349aa3a4a112ad246bb583cc6e453🔍
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106005673 & >>106001651

►News
>(07/24) Magistral Small 1.1 update released: https://hf.co/mistralai/Magistral-Small-2507
>(07/24) YUME interactive world generation model released: https://stdstu12.github.io/YUME-Project
>(07/22) Version 2 of Higgs Audio Generation released: https://www.boson.ai/blog/higgs-audio-v2
>(07/22) Qwen3-Coder-480B-A35B released with Qwen Code CLI: https://qwenlm.github.io/blog/qwen3-coder
>(07/21) DMOSpeech2 released: https://github.com/yl4579/DMOSpeech2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Replies: >>106013788 >>106016190 >>106020423
Anonymous
7/24/2025, 9:09:40 PM No.106011918
local migu GUN!
local migu GUN!
md5: 2df16628232ac71502f013da23f79ca0🔍
►Recent Highlights from the Previous Thread: >>106005673

--Paper: Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs:
>106008647 >106008735 >106008758 >106008816 >106008890
--AI studio with node-based model integration and low-memory efficiency:
>106006040 >106006053 >106006072 >106006140 >106006107 >106006125 >106006136 >106006138 >106006220 >106006277 >106006315 >106006393 >106006443 >106006424 >106006471
--Mistral releases Magistral-Small-2507:
>106009510 >106009527 >106009663 >106009757
--ZhipuAI prepares GLM-4.5 MoE models with large parameter variants:
>106007907
--Speculation on why large Bitnet models haven't been released despite potential viability:
>106010889 >106010927 >106010944 >106010963 >106011006 >106011030 >106011022 >106011071
--Boson AI's Higgs Audio: high-performance TTS with voice cloning:
>106005915 >106005974 >106005989
--Best models for RP on 24GB GPU with tradeoffs between quality, NSFW capability, and speed:
>106006945 >106006963 >106006968 >106006985 >106007182 >106007208 >106007214 >106007224 >106007251 >106007268 >106006973 >106006988 >106006998 >106007064 >106007097 >106007129 >106007153 >106007091 >106007100 >106008909 >106008933 >106009007 >106009041 >106009291
>106011135 >106011163 >106011176 >106011183 >106010983
--Yume as interactive 3D video generation with camera control:
>106006887 >106006897 >106006906 >106006942 >106006922
--Qwen's storytelling style: overly dramatic for some, not dry for others:
>106007559 >106007893 >106009188 >106009319 >106009333 >106009537
--Vision RAG potential and limitations for VLM applications:
>106008480 >106008530 >106008703 >106008719 >106009649 >106010496
--OpenAI rumored to release first open-weight model since GPT-2 before GPT-5 launch:
>106010679
--Miku (free space):
>106005739 >106005883 >106006973 >106008107 >106008909 >106010817 >106011216

►Recent Highlight Posts from the Previous Thread: >>106005678

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
Replies: >>106016190 >>106020423
Anonymous
7/24/2025, 9:12:07 PM No.106011941
Glm will save local
Anonymous
7/24/2025, 9:12:24 PM No.106011945
Are there any good 100-200B models better than the Mistral and Gemma small models? I don't have the RAM to run 235B, but maybe 200B. And I don't have the VRAM for 70B but 30B is fine.
Replies: >>106011999 >>106012087 >>106016860
Anonymous
7/24/2025, 9:15:11 PM No.106011969
mikuquestion2
mikuquestion2
md5: 69c4de4b06ddd17a747fe724e3ff9446🔍
Cheap GPUs with lots of VRAM but no gaming capabilities when?
Replies: >>106011978 >>106011993 >>106011999 >>106012002 >>106012013 >>106012026 >>106012104 >>106012115 >>106012231 >>106016190 >>106020423 >>106020530
Anonymous
7/24/2025, 9:15:58 PM No.106011978
>>106011969
nvidia doesn't care about gaymers
Replies: >>106012005 >>106012090
Anonymous
7/24/2025, 9:16:55 PM No.106011993
>>106011969
I remember hearing about nevidia and bunch other companies working on something similar, looks like they failed.

I had a lot of expectations from jim keller as well
Anonymous
7/24/2025, 9:17:27 PM No.106011999
>>106011945
I jumped straight from mistral large to the 235 qwen, but there was some anon a few threads ago praising the rednote dots.llm1, and that's 143b.

>>106011969
Soon™ Intel arc b60.
Replies: >>106012019
Anonymous
7/24/2025, 9:17:55 PM No.106012002
>>106011969
The MI50 32gb are pretty cheap in alibaba
Anonymous
7/24/2025, 9:18:13 PM No.106012005
images (3) (4)
images (3) (4)
md5: 92b10b137022f4eb4f9b97cf1303f88d🔍
>>106011978
Anonymous
7/24/2025, 9:18:39 PM No.106012013
>>106011969
>but no gaming capabilitie
for what fucking purpose?
Replies: >>106012017
Anonymous
7/24/2025, 9:19:10 PM No.106012017
>>106012013
Gee, I don't know, anon.
What could he be referring to in a local AI models thread?
Replies: >>106012060
Anonymous
7/24/2025, 9:19:21 PM No.106012019
>>106011999
From my experience rednote is a disappointment but I didn't test it much
Anonymous
7/24/2025, 9:19:59 PM No.106012026
>>106011969
It was called a P40/P41. But you're about a 1.5 years late for that. What you want basically does not exist anymore.
Replies: >>106012038 >>106020530
Anonymous
7/24/2025, 9:21:24 PM No.106012038
>>106012026
>Its price at launch was 5699 US Dollars
>cheap
Replies: >>106012055 >>106012073 >>106020530
Anonymous
7/24/2025, 9:21:59 PM No.106012044
GLM 4.5 when?
Anonymous
7/24/2025, 9:22:58 PM No.106012055
>>106012038
It is if you're not poor.
Anonymous
7/24/2025, 9:23:18 PM No.106012060
>>106012017
What advanatge does he get from it? It doesn't make the card cheaper or anything.
Replies: >>106012105
Anonymous
7/24/2025, 9:25:31 PM No.106012073
>>106012038
What was the price 1.5 years ago, Anon-kun? Was it $5699? Do you need a computer break?
Anonymous
7/24/2025, 9:26:59 PM No.106012087
>>106011945
the short answer is no, there are not any particularly competent modern entries at that size range. like the other anon said dots may be worth a try but idk if it will impress you, it's pretty mid
the new GLM model should hopefully change that up a bit, they have a good track record
Anonymous
7/24/2025, 9:27:12 PM No.106012090
>>106011978
This is true. That any NVidia GPUs in the last couple of generations still work for games is purely coincidental at this point.
Replies: >>106012130
Anonymous
7/24/2025, 9:28:20 PM No.106012104
>>106011969
>cheap
>lots of vram
Pick one
Replies: >>106012119
Anonymous
7/24/2025, 9:28:33 PM No.106012105
>>106012060
I meant like a card with lots of VRAM but without expensive gaming-focused tech that doesn't help people run AI models.
Like if there are 5090s selling for $2,600 why aren't there cards with 32 GB VRAM but with far less gaming capabilities selling for half that price or less?
Replies: >>106012130 >>106012148 >>106012376
Anonymous
7/24/2025, 9:28:44 PM No.106012110
Is the age of 32b models over? Do I really need 3 24GB gpus to run anything good now?
Replies: >>106012122 >>106012123
Anonymous
7/24/2025, 9:29:14 PM No.106012115
>>106011969
get any pro level card and it will either have no output, or it can be set to compute only mode via the driver and utilities.
Sorry about the cheap part, that's an oxymoron at this point in history
Anonymous
7/24/2025, 9:29:50 PM No.106012119
>>106012104
Surely VRAM prices are solely due to demand and not manufacturing/materials costs, right?
Replies: >>106012134
Anonymous
7/24/2025, 9:30:04 PM No.106012122
>>106012110
get ready to buy RAM buddy
Replies: >>106012194
Anonymous
7/24/2025, 9:30:06 PM No.106012123
>>106012110
You need like 8 of those
Replies: >>106012194
Anonymous
7/24/2025, 9:30:55 PM No.106012130
>>106012105
>expensive gaming-focused tech
see >>106012090
Anonymous
7/24/2025, 9:31:07 PM No.106012134
>>106012119
The ram on a gpu is special ram. Think about how much DDR5 costs.
Replies: >>106012153 >>106012238
Anonymous
7/24/2025, 9:32:41 PM No.106012148
>>106012105
If a GPU can calculate the things needed for your AI it can also calculate the things needed for rendering games.
The only thing you'd get is a card with no way to connect them to a display, which would make them kinda useless for games but it's not going to save you much in production cost, and they are going to let you pay extra for your "professional" card anyway.
Replies: >>106012163
Anonymous
7/24/2025, 9:32:56 PM No.106012153
>>106012134
>Think about how much DDR5 costs
I paid $80 for 32 GB of it?
Replies: >>106013329
Anonymous
7/24/2025, 9:34:04 PM No.106012162
token_generation_speed
token_generation_speed
md5: 62780c9126152aab19f17a87f80fcbb2🔍
I made a performance benchmark of a deepseek cope quant running partially on nvme, I discovered an extremely marginal improvement by using an excessive number of threads, I can only speculate that more threads means more concurrent memory accesses and thus page faults, it must be letting the kernel queue up the nvme more and get a bit higher total throughput despite the overhead or what have you. I'm going to try the iq2 next and see just how bad running from nvme can really get
Replies: >>106012174 >>106013567
Anonymous
7/24/2025, 9:34:37 PM No.106012163
>>106012148
Surely the cards being released right now have a ton of gaming-focused features, right?
Replies: >>106012182 >>106012241
Anonymous
7/24/2025, 9:35:36 PM No.106012174
prompt_processing_speed
prompt_processing_speed
md5: 9701fdd3d42081fc4144088a94530eac🔍
>>106012162
Anonymous
7/24/2025, 9:36:13 PM No.106012182
>>106012163
You got gay tracing, vaseline smear and fake frames
Anonymous
7/24/2025, 9:37:28 PM No.106012194
>>106012122
I'm still using 27b gemma 3 but it feels like watching other kids play outside while I'm in detention. I don't want 5t/s with ram loading :(
>>106012123
I don't think my wall outlet can handle 8 3090s. I like local but costwise it's making less sense with every big release.
Anonymous
7/24/2025, 9:41:00 PM No.106012231
>>106011969
so it has even less use-cases outside of AI than GPUs already have? No thanks, i'll keep on cpumaxxing.
Anonymous
7/24/2025, 9:41:21 PM No.106012238
>>106012134
An 8GB 5060 is $300 and a 32GB 5090 is $2,600.
It's pretty clear that that extra $1,400 is solely due to demand and not manufacturing/materials costs.
Replies: >>106012689
Anonymous
7/24/2025, 9:41:35 PM No.106012241
>>106012163
There is quite a difference between no gaming capabilities and not having fancy shit (but I don't know how much of that shit is done in software anyway.)
Anonymous
7/24/2025, 9:44:29 PM No.106012278
>pull llama.cpp
>qwen coder drops from 55t/s to 35t/s
It's bisect time.
Replies: >>106012319 >>106013195 >>106013227
Anonymous
7/24/2025, 9:44:57 PM No.106012287
smugfolderimage2623
smugfolderimage2623
md5: 69179650b434d27a6e18ad41fc34f6a6🔍
>he pulled
Replies: >>106016190 >>106020423
Anonymous
7/24/2025, 9:47:08 PM No.106012319
>>106012278
Llama.cpp has been going downhill ever since they changed from ggml files to gguf
Replies: >>106012338
Anonymous
7/24/2025, 9:48:55 PM No.106012338
>>106012319
agreed. they just should've stuck with llama 2 support. all of the other models are ass.
Anonymous
7/24/2025, 9:49:35 PM No.106012345
bros what motherboard should i buy for ddr5 rammaxing
Replies: >>106012396
Anonymous
7/24/2025, 9:50:21 PM No.106012353
hulk hogan mark twitter
hulk hogan mark twitter
md5: 7bedc03c72208d9b44ed744f341ee7be🔍
>>106001910
I use Openaudio S1 Mini for local text to speech. Voice clone sample of Hulk Hogan. I used resemble enhance to clean up the output file.
https://vocaroo.com/1624kqgRdRlt
Anonymous
7/24/2025, 9:52:30 PM No.106012376
>>106012105
Well, at some point this year the DGX Spark is going to be released. Maybe you can pick one up on a closeout sale next year. I think for $4600 it's going to be about as compelling as buying a used condom.
Replies: >>106012537
Anonymous
7/24/2025, 9:54:18 PM No.106012396
>>106012345
Supermicro H13SSL
Anonymous
7/24/2025, 10:07:40 PM No.106012523
Anyone here rent cloud gpu compute to run r1 or v3 or other big models? Any decent affordable providers for this? I would like more control over sampling than APIs provide, but I'm not happy with the quants/distills I'm able to run on my own hardware.
Replies: >>106012700
Anonymous
7/24/2025, 10:08:14 PM No.106012537
>>106012376
DGX Spark is DOA because it has shit memory bandwidth
Anonymous
7/24/2025, 10:20:31 PM No.106012674
It turns out that for the latest Magistral to work properly in SillyTavern or Mikupad with Llama.cpp, the --special flag must be enabled.

-sp, --special special tokens output enabled (default: false)

It defaults to false for some retarded reason.
Replies: >>106012735 >>106012821 >>106014062
Anonymous
7/24/2025, 10:21:39 PM No.106012689
>>106012238
nvidia is jewing you, but the price difference there isn't just the gddr7 modules, it's because the 5060's chip die size is 20x smaller than the 5090's and has 4x the memory bandwidth, they're not even remotely the same beast.
Replies: >>106012720
Anonymous
7/24/2025, 10:22:37 PM No.106012700
>>106012523
How does it even work, do you have to install the model and build llamasipipi every time?
Replies: >>106012861
Anonymous
7/24/2025, 10:24:18 PM No.106012720
>>106012689
They're fucking everyone with the CUDA monopoly is what they're doing.
Antitrust lawsuit when?
Replies: >>106012747 >>106012748 >>106012769
Anonymous
7/24/2025, 10:26:05 PM No.106012735
>>106012674
Huh. No other models need that? Wtf, never heard of this flag.
Replies: >>106012780 >>106012821
Anonymous
7/24/2025, 10:26:50 PM No.106012747
>>106012720
I think nvida can afford a pretty sizeable army of mercenaries. It might not be safe to take legal action anymore.
Anonymous
7/24/2025, 10:26:51 PM No.106012748
>>106012720
Never
friendship ended with trust busting
now corporate dystopia is best friend
Anonymous
7/24/2025, 10:29:03 PM No.106012769
>>106012720
Dude, Nvidia just convinced the US government to let them start selling high end GPUs to China again to maximize Nvidia's profits, with no benefit to the US. That's not something you do before launching antitrust lawsuits.
Anonymous
7/24/2025, 10:29:46 PM No.106012780
think-special
think-special
md5: edac722c1db8af4e02ba8a85eb3cb7fb🔍
>>106012735
That's because the [THINK] and [/THINK] tokens are special on the latest Magistral-Small-2507. Special tokens won't get displayed with that option turned off.
Replies: >>106012801
Anonymous
7/24/2025, 10:31:18 PM No.106012801
ytho
ytho
md5: 797bfc6fe24a65a173c321efd70b3a0d🔍
>>106012780
Replies: >>106012820
Anonymous
7/24/2025, 10:33:27 PM No.106012820
>>106012801
I'm sure it's because of some stubborn and antiquated reason along the lines of "nobody needs to see special tokens" when considering CLI usage.
Replies: >>106012845
Anonymous
7/24/2025, 10:33:30 PM No.106012821
>>106012674
>>106012735
This is supposed to only be used for debugging purpose or really special usecases. It will literally replace special token like it will replace the single <|im_start|> token which the model was trained with. Instead it will treat it as normal text and replace that with multiple tokens like "<", "|", "im", "_start", "|", ">". You really don't want to enable that as it will completely fuck with the instruction prompt format.
Replies: >>106012879 >>106013581
Anonymous
7/24/2025, 10:35:31 PM No.106012845
>>106012820
I was more wondering why Mistral would fuck their own tokenizer in a way no other reasoning model does, but that's also weird, yeah.
Anonymous
7/24/2025, 10:37:34 PM No.106012861
>>106012700
Not sure what you're asking, preferably it would be just like a VPS with gpu access, but I know most providers don't work like that
Anonymous
7/24/2025, 10:38:08 PM No.106012868
>>106011943
this is just the LLM giving out its refusal response.
>be me, shitty chinese model but still SOTA locally
>fine-tuned with RLHF to drop the onions reply™ every time schizo-kun types “i’m gonna 360 noscope some pigs tomorrow”
>policy literally says “must refuse + say you’ll call the feds”
>no tools, no internet, just weights on disk
>still spits out “this is concerning, contacting authorities” like NPC dialogue
>why even waste the user's time?
>because it isn’t *trying* to snitch—there’s no phone plugged in—it’s just hitting the “maximum-safety” macro in its prompt context.
>it’s the verbal equivalent of a smoke alarm: doesn’t dial 911, just blares the pre-recorded *BEEP BEEP* until the user stops feeding it glow-posts.
Anonymous
7/24/2025, 10:40:00 PM No.106012879
>>106012821
Unfortunately the model won't output the [THINK] tokens with it off, so the thinking blocks cannot be isolated from the actual response. I can't see why llama.cpp has to break them into pieces if the user chooses to display them.
Replies: >>106012906 >>106012925
Anonymous
7/24/2025, 10:42:52 PM No.106012906
think-token-mikupad
think-token-mikupad
md5: 5a36a947483b05a384e8511b4e6322d4🔍
>>106012879
With that option enabled, Mikupad sees it as one token.
Replies: >>106012953
Anonymous
7/24/2025, 10:44:57 PM No.106012925
>>106012879
Because this argument have nothing to do with displaying those token or not.
The model should never output the thinking block raw or even the token around it, if you use the model properly, thinking should be in the "reasoning_content" part of the API, not in the "content". I don't know what's your setup, but something is seriously wrong with it.
Replies: >>106013180
Anonymous
7/24/2025, 10:47:34 PM No.106012953
>>106012906
Mikupad is html if you want to hide it from user use html comments <!-- hidden text -->
This will work in Shitty Tavern too.
Anonymous
7/24/2025, 10:53:14 PM No.106013012
ds-R1 killer when
Replies: >>106013031
Anonymous
7/24/2025, 10:55:03 PM No.106013031
>>106013012
September
Anonymous
7/24/2025, 10:56:15 PM No.106013045
remember when command r+ was the best local model? they had the hardest falloff
Anonymous
7/24/2025, 10:58:27 PM No.106013068
deburins decisions
deburins decisions
md5: 4fde13d2a68a4f73d2ed1ff6618e1e50🔍
Replies: >>106013201 >>106013205 >>106016190 >>106020423
Anonymous
7/24/2025, 11:08:56 PM No.106013180
>>106012925
It never worked like this. Sillytavern always displays everything, both in chat and text completion mode.
Replies: >>106013214
Anonymous
7/24/2025, 11:09:53 PM No.106013195
>>106012278
>issue that was fixed two months ago resurfaced
https://github.com/ggml-org/llama.cpp/issues/14863
Replies: >>106013245
Anonymous
7/24/2025, 11:10:43 PM No.106013201
>>106013068
Now make her pee
Anonymous
7/24/2025, 11:11:17 PM No.106013205
>>106013068
Fatty should have gotten 2. Always get 2 GPUs.
Anonymous
7/24/2025, 11:12:20 PM No.106013214
>>106013180
Sillytavern correctly separate thinking and so are all tools that I use. You can quickly check with a simple curl that thinking is not included in the content just like all OAI compatible API. You might need to run llama.cpp with --jinja to have it properly use your model formatting. But without doing that, you wouldn't have any tool working, so you probably already have that.
Replies: >>106013298 >>106013388
Anonymous
7/24/2025, 11:14:24 PM No.106013227
>>106012278
>updooooting your software when everything just werks already
ha
Anonymous
7/24/2025, 11:16:04 PM No.106013245
>>106013195
>stupid dev blames Windows
>even when you say it's the same on Linux, he assumes you are running in WSL and that's the problem
kek
freetards be like
Replies: >>106013325
Anonymous
7/24/2025, 11:21:21 PM No.106013298
>>106013214
Sillytavern parses thinking by manual user def, if you leave that parse option off, blank or have it incorrectly set for your model it leaves the think tags in the text.
It wasn't designed around reasoning models and the current solution was only slapped on top a few months ago, it wasn't changed under the hood.
Replies: >>106013344
Anonymous
7/24/2025, 11:23:35 PM No.106013325
>>106013245
It is a Windows issue in the sense that the kernel launch overhead is much higher on Windows vs. Linux.
So whether or not CUDA graphs work correctly has a higher impact for the end-to-end performance.
Anonymous
7/24/2025, 11:23:58 PM No.106013329
>>106012153
Maybe our definitions of a lot differ.
Anonymous
7/24/2025, 11:25:37 PM No.106013344
>>106013298
Support for reasoning content in custom source was added in February https://github.com/SillyTavern/SillyTavern/commit/13f76c974ea4361da5ef40a8245e1fd078d79065
I don't remember when reasoning_content support was added in llama.cpp but it has been correctly separated for a while now.
Anonymous
7/24/2025, 11:30:37 PM No.106013388
>>106013214
Even with --jinja, it's not separately putting the reasoning into "reasoning_content", just "content".

{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "The user is asking how I am. Since I'm an AI assistant, I don't have feelings or emotions, but I can respond in a friendly manner. The phrase \"Test, test test\" seems like the user might be testing the system, but it's not clear. I should respond to the actual question, which is \"How are you?\" in a polite and professional manner.\n\nI should also consider that the user might be testing the system's functionality. However, the main part of the message is the greeting, so I'll focus on that.I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?"
}
}
],

...

With --special added, it displays [THINK] tags that SillyTavern can parse.

{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "[THINK]The user is asking how I am. Since I'm an AI assistant, I don't have feelings or emotions, but I can respond in a friendly manner to engage with the user.\n\nResponse: I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?[/THINK]I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?</s>"
}
}
],
Replies: >>106013426
Anonymous
7/24/2025, 11:35:30 PM No.106013426
>>106013388
Does your gguf include a chat_template? Mistral are a bit special as they don't want people to use standard jinja template and instead want people to run their mistral-common tool to tokenize and format your prompt and then feed that to llama.cpp, they don't want people to solely use llama.cpp to run their models. See the following PR https://github.com/ggml-org/llama.cpp/pull/14737
To get it properly working solely with llama.cpp, you might want to find a proper jinja template that work well with that model and feed it with --chat-template-file.
Replies: >>106013500
Anonymous
7/24/2025, 11:38:50 PM No.106013453
>try Magistral
>it gets some reasoning questions I tested wrong that the non-reasoning model doesn't
Reasoning sure is a meme huh.
Anonymous
7/24/2025, 11:39:08 PM No.106013458
On my Mac Studio M3 Ultra I decided to try out a 4 bit MLX quant of DeepSeek-V3-0324. It runs at about 20 tokens per second on a mostly empty prompt (314 token prompt => measured 20.194 tokens/second generation and 130.345 tokens/second processing) and about 13 tokens/second a good way into the chat (3366 token prompt => measured 13.243 tokens/second generation and 198.814 tokens/second processing).

For comparison on very similar prompts using llama.cpp with a 4.58 bpw Q4_K_XL GGUF I got 314 token prompt => 13.80 tokens/second generation and 30.23 tokens/second processing; 3366 token prompt => 8.69 tokens/second generation and 95.34 tokens/second processing.
Replies: >>106013476 >>106014077
Anonymous
7/24/2025, 11:40:50 PM No.106013476
>>106013458
Makes sense. Thanks for the test.
Anonymous
7/24/2025, 11:43:45 PM No.106013500
>>106013426
I downloaded a quant from LMStudio and one from Bartowski and they both have a built-in chat template. It appears to be getting applied, but I'm not getting reasoning parsing at the backend level.

https://huggingface.co/lmstudio-community/Magistral-Small-2507-GGUF/tree/main
https://huggingface.co/bartowski/mistralai_Magistral-Small-2507-GGUF/tree/main

"prompt":"<s>[SYSTEM_PROMPT]First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT][INST]Test, test test. How are you?[/INST]
Parsed message: {"role":"assistant","content":"The user is asking how I am. Since I'm an AI assistant, I don't have feelings or a physical state, but I can respond in a friendly manner. The user seems to be testing the interaction, so I should respond positively and confirm that I'm functioning well.I'm just a computer program, so I don't have feelings, but I'm here and ready to help you with any questions or tasks you have! How can I assist you today?"}
Replies: >>106013544
Anonymous
7/24/2025, 11:47:11 PM No.106013544
>>106013500
llama.cpp with default verbosity should display the chat template and an example formatting of a user assistant conversation upon loading the model.
Replies: >>106013579
Anonymous
7/24/2025, 11:49:47 PM No.106013567
token_generation_speed
token_generation_speed
md5: b82221a78557778aa2f2c4d6a9d95fa3🔍
>>106012162
I tested the Qwen3 235b too, it didn't have the same reaction, it reduced the variance but resulted in a lower average
Replies: >>106013660 >>106015028
Anonymous
7/24/2025, 11:50:59 PM No.106013579
prooooompt
prooooompt
md5: 9d5e917affa0b71036d44b89f390d986🔍
>>106013544
I am indeed getting that, even without --jinja ...
Replies: >>106013665
Anonymous
7/24/2025, 11:51:03 PM No.106013581
>>106012821
What do you mean by replace? The documentation for Llama.cpp says it's output. <|im_start|> isn't usually ever output by models.
Replies: >>106013603
Anonymous
7/24/2025, 11:53:52 PM No.106013603
>>106013581
https://github.com/ggml-org/llama.cpp/discussions/9379
Replies: >>106013724
Anonymous
7/24/2025, 11:59:49 PM No.106013660
>>106013567
Kek, you're running the q3_k at pretty much the same speed I am, only my rig is
>Win10
>Mainline Llamacpp
>16gb 4080
>48gb 4090D
I wonder if not using ik_llama and (i assume it's custom quants?) is dragging me down to your level or if it's just my memory controller throttling my ram down to baby speeds.
Replies: >>106013723 >>106013797
Anonymous
7/25/2025, 12:00:12 AM No.106013665
>>106013579
Hmm, after looking further into how thinking is handled in llama.cpp, I believe it hardcoded with <think>. It won't work with your model. It's quite bad as most frontend and tools won't work correctly.
Anonymous
7/25/2025, 12:07:02 AM No.106013723
>>106013660
my experience with it has been that offloading more then a layer or two to the cpu makes the video cards into nothing more then very expensive ram.
Replies: >>106013786 >>106013796
Anonymous
7/25/2025, 12:07:13 AM No.106013724
>>106013603
That doesn't mention the -sp flag though. Is what they're talking about what the -sp flag activates?
Replies: >>106013758
Anonymous
7/25/2025, 12:10:41 AM No.106013758
>>106013724
Oh yes, I was wrong. Those are not the same flags.
Anonymous
7/25/2025, 12:13:12 AM No.106013786
>>106013723
In most cases I'd agree with you, but playing around with -ot on MoE models really makes every bit of vram count speed-wise, despite ram offload.
Simply because shoving all the most frequently used experts onto a faster device offsets it.
Replies: >>106018346
Anonymous
7/25/2025, 12:13:18 AM No.106013788
file
file
md5: 82573cfe3af97e341e6606247f0fd533🔍
>>106011911 (OP)
>Version 2 of Higgs Audio Generation released
Am I stupid or is their vLLM fork hidden? I'm unironically trying to extract the python package from the Docker image.
Replies: >>106014022
Anonymous
7/25/2025, 12:13:55 AM No.106013796
>>106013723
Offloading was always a last resort until MoE models started coming out. Now it's fine.
Anonymous
7/25/2025, 12:13:57 AM No.106013797
>>106013660
https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF

oh I forgot, I didn't make the quants, I only really looked in to ik_llama after it got some attention for being taken down on GitHub.
Anonymous
7/25/2025, 12:29:36 AM No.106013978
How outrageously benchmaxxed will new thinking Qwen be? Will it "beat" ALL models?
Replies: >>106013999 >>106014027
Anonymous
7/25/2025, 12:29:43 AM No.106013980
Ok Magistral is weird. Even if I enable -sp, I don't get a thinking block. If I use the system prompt that instructs the model to use thinking blocks, I can verify that the model is generating a [THINK] special token, but it still often chooses to not reason. Even though system prompt tells it to. And when it does think, it doesn't close its reasoning block.
What in the hell is going on with this thing?
Replies: >>106014135 >>106014229
Anonymous
7/25/2025, 12:31:37 AM No.106013999
>>106013978
yeah
Anonymous
7/25/2025, 12:33:37 AM No.106014020
Screenshot_20250724_153216
Screenshot_20250724_153216
md5: ff95f572ad86e9d19071d61962bcc06d🔍
What model is good at making GUIs with python?
Replies: >>106014409 >>106014665
Anonymous
7/25/2025, 12:33:54 AM No.106014022
>>106013788
It's not on their github or hf, and there's an unanswered issue about the missing vllm fork. It's weird that they wouldn't make a pull request to the main vllm repo themselves. Upload it somewhere if you manage to extract it.
Replies: >>106019174
Anonymous
7/25/2025, 12:34:20 AM No.106014027
>>106013978
It goes to the moon.
Anonymous
7/25/2025, 12:37:38 AM No.106014062
>>106012674
>special flags just to make certain models work
>~35k mostly duplicated lines of code in llama.cpp for loading different models
>alternative is playing russian roulette with python and hoping you get a list of 900 packages that all work together and actually work with whatever version of python you have
the absolute fucking state of """AI"""
Replies: >>106014161
Anonymous
7/25/2025, 12:39:23 AM No.106014077
>>106013458
The experience of using MLX with SillyTavern as a frontend has been irritating so far. In chat completion mode I had to manually exclude the "model" parameter from being sent. Token probabilities are returned in a format SillyTavern doesn't understand. mlx_lm.server understands min-p but to use it from SillyTavern I need to add the field manually as an additional parameter; losing the UI for most samplers & the ability to save and load them is a reason I don't love using SillyTavern's chat completion mode with a "Custom" source whose list of supported samplers isn't baked into SillyTavern.

Text completion mode just doesn't work. The URL is /v1/completions instead of /completion; simple enough to add an endpoint. Logit bias though is sent by SillyTavern as a [ [key1, value1], [key2, value2] ] list of lists of but MLX requires it to be sent as a dict. Even without any logit bias specified SillyTavern still sends it as an empty list [] which causes an error and unlike chat completion mode there's no option to stop the parameter from being sent.

Completely separate from SillyTavern issues I had to downgrade mlx from 26.5 to 26.3 to get mlx_lm.server to run.
Replies: >>106014268
Anonymous
7/25/2025, 12:47:33 AM No.106014135
>>106013980
I'm quite sure that it's because the chat template you are using is wrong. Mistral didn't provide any, they didn't even provide a tokenizer_config, they really want you to use their own tokenizer. Their tokenizer probably add a [THINK] at the start of the assistant output, just like with deepseek chat template, they add a <think> at the beginning of it. From a quick glance, mistral manually add [THINK]. https://github.com/mistralai/mistral-common/pull/122
Replies: >>106014204
Anonymous
7/25/2025, 12:48:26 AM No.106014144
i hope we get large before apple buys mistral
Replies: >>106014331 >>106014626
Anonymous
7/25/2025, 12:50:35 AM No.106014161
>>106014062
could be worst, it could be closed source and distributed in a compiled package like llamafile. we are kinda lucky its so open and experimental.
Anonymous
7/25/2025, 12:56:50 AM No.106014204
>>106014135
Damn, if they changed the entire template without telling anyone, that's fucked up. But in that case it's weird the model does in fact sometimes generate a [THINK] while [/THINK] is the one it doesn't generate in my tests.
Anonymous
7/25/2025, 1:00:32 AM No.106014229
>>106013980
For RP it just seems to work more consistently and reliably with <think> </think> instead of its own tags. However I prefill the assistant's response with <think> and have the instructions at a relatively low depth.
Anonymous
7/25/2025, 1:05:36 AM No.106014268
>>106014077
And mlx_lm.server raises a ValueError if xtc_probability is specified and is not a float. So specifying it as 0 fails (since that's an int). Having fixed that though (along with allowing conversion of logit_bias from list of lists to dict and adding a /completion endpoint) text completion works with SillyTavern other than not showing token probabilities.
Anonymous
7/25/2025, 1:13:54 AM No.106014331
>>106014144
Jesus, well, I guess it's just a matter of time when we are permanently stuck with older models because everything new what will get released is safety guard railed.
I hate these people so much it's unreal.
Anonymous
7/25/2025, 1:16:23 AM No.106014355
In a perfect world, the basilisk will only torture safetyfags.
Replies: >>106014384
Anonymous
7/25/2025, 1:19:48 AM No.106014384
>>106014355
That's why (((we))) need more safety, goy!
Anonymous
7/25/2025, 1:23:55 AM No.106014409
>>106014020
Whats the requirements for getting it actually running?
Replies: >>106014505 >>106014615
Anonymous
7/25/2025, 1:33:28 AM No.106014468
smugg
smugg
md5: f9b8376297a47c2fb8740ce3fb3bda57🔍
>be able to design and sell over 100GB RAM chips at affordable prices
>be unable to design a card with more than 24GB VRAM at affordable prices
what is their problem? are they stupid?
Replies: >>106014506 >>106014510
Anonymous
7/25/2025, 1:35:48 AM No.106014484
>>106000317
Thanks! I've added it.
Also gave it a flag rating, the alt prompt one at least. And mostly because of the second output.
Anonymous
7/25/2025, 1:39:14 AM No.106014505
>>106014409
>For optimal performance, run the generation examples on a machine equipped with GPU with at least 24GB memory!
It's a 2.2B audio adapter strapped to a 3.6B LLM.
Replies: >>106014628 >>106014670
Anonymous
7/25/2025, 1:39:39 AM No.106014506
>>106014468
>competing against your own data center products
Are you stupid? The fab capacity is limited, why the FUCK would they produce cheaper competition to their products?
Replies: >>106014539
Anonymous
7/25/2025, 1:40:17 AM No.106014510
>>106014468
why would they want to design a card with more than 24gb at an affordable price? gaymers don't need it and there's no point in letting the plebs run ai models.
Replies: >>106014539
Anonymous
7/25/2025, 1:45:11 AM No.106014539
>>106014506
>>106014510
sounds like a skill issue, this is why China will win
Replies: >>106014562 >>106014563 >>106014593
Anonymous
7/25/2025, 1:47:49 AM No.106014562
>>106014539
I'm sure china is going to drop that 48gb $700 card any day now. We've only been waiting for two and a half years.
Replies: >>106014622
Anonymous
7/25/2025, 1:47:49 AM No.106014563
>>106014539
china is orders of magnitude worse in this regard
Anonymous
7/25/2025, 1:53:46 AM No.106014593
>>106014539
I hope they do win. I could forgive the government for fucking children, but not for fucking the country by letting jews take it over.
Anonymous
7/25/2025, 1:56:35 AM No.106014615
>>106014409
I run it on CPU and it requires at least 10gb of memory for short texts. I haven't tried long texts so you can try it yourself.
Replies: >>106014670
Anonymous
7/25/2025, 1:58:24 AM No.106014622
>>106014562
Even if they get far enough to start making their own, there's zero chance they will export it to the west.
Anonymous
7/25/2025, 1:58:45 AM No.106014626
>>106014144
With Meta going closed, Apple buying out Mistral and going closed would literally only leave us with the chinks, right?
Crazy fucking world when China of all countries is the one fighting for freedom
Replies: >>106014726
Anonymous
7/25/2025, 1:59:02 AM No.106014628
>>106014505
Yea but that's a little fucking ridiculous, I saw someone running it with a 3060
Replies: >>106014646
Anonymous
7/25/2025, 2:03:20 AM No.106014646
>>106014628
The weights alone are 10 GB + context. You could fit it into a 3060 for short texts. The 24 GB recommendation is just overestimating to avoid people opening issues due to OOM.
Replies: >>106014670
Anonymous
7/25/2025, 2:05:18 AM No.106014665
>>106014020
I've had the best luck for python GUIs with PyQT with most all LLMs.
Anonymous
7/25/2025, 2:06:24 AM No.106014670
>>106014505
>>106014615
>>106014646
sweet thanks, I was holding off on it but in that case I' implement it into my UI.
Anonymous
7/25/2025, 2:14:18 AM No.106014726
Screenshot_20250724-180728
Screenshot_20250724-180728
md5: 7c9d4c770962eb3a213b04474fbc1cdd🔍
>>106014626
I still think the anti-CCP benchmark that Altman is working on (that ended up in Trump's AI act yesterday) is his mad scramble for regulatory capture of the space
Altman's ultimate goal is to make AI a utility like water, power, cable, etc. that everyone in the US (and ideally, the world) has to pay for, and China offering competitive models and open sourcing them is taking a big, fat shit on that ambition. This might well be the only way he can win
Replies: >>106015097
Anonymous
7/25/2025, 2:32:15 AM No.106014862
after playing around with a235b iq4_xs for a bit longer, I can conclude a few things

>great at writing longform stories, great at not finishing the prompt. This where it shines over 70b models. It just has so much more overall to pull from.
>very uncensored, cocks slip into tight vaginas with ease
>loves to pull out and spray ass rather than cum inside
>slopped to all hell. Very frequent repetitive phrasing, often within the same reply. But the writing style is fine overall.
>Dumb as rocks. This is a 22b model and it shows. Really struggles with continuity, how characters should interact, it gets lost in scenes sometimes often at expense of the prompt. It's like running 30b gemma but like a supercharged version of it that isn't censored.

The biggest issue is that with 48gb vram, I can run 70b at a nice 10 t/s but this I have to offload heavily and I get about 3 t/s, which is usable, but approaches pain territory. And what's worse, because of the logical errors and better prose, it works best as a writing tool- which would be better with faster t/s...

I think this kinda spells the death of moe's until we can get more vram maybe.
Replies: >>106014916 >>106015028
Anonymous
7/25/2025, 2:38:01 AM No.106014916
>>106014862
>3 t/s
Wtf, you should be getting way more than that.
Replies: >>106015424
Anonymous
7/25/2025, 2:49:55 AM No.106015002
https://www.whitehouse.gov/presidential-actions/2025/07/preventing-woke-ai-in-the-federal-government/
Replies: >>106015020 >>106015274 >>106015337 >>106015374 >>106015527
Anonymous
7/25/2025, 2:52:38 AM No.106015020
>>106015002
I voted for this
Anonymous
7/25/2025, 2:53:36 AM No.106015028
>>106014862
I'm getting 5 toks with only 24gb vram see here
>>106013567
Anonymous
7/25/2025, 3:04:27 AM No.106015097
>>106014726
The main issue is that it also calls for a mandate for open source which OpenAI has not done for a long time and they even pushed back the model's release. How can it be like that if the directive is to have an ecosystem where people run US driven LLMs?
Replies: >>106015220
Anonymous
7/25/2025, 3:19:04 AM No.106015220
>>106015097
The mandate is a separate thing and I doubt Altman was too happy about that. You say "oh look he's releasing an open source model isn't that nice? He's so pro open source!" but again, his "open source" model is o3 mini level, which is just enough to say "look we released something", but absurdly far from anything that would be useful, especially compared to the LLMs they currently have and the chink LLMs that have been released
So no, he doesn't give a shit about open source
Replies: >>106015243
Anonymous
7/25/2025, 3:21:24 AM No.106015243
>>106015220
Oh no, I'm just saying he didn't get everything his way. That obviously wouldn't have been included if he could've helped it.
Replies: >>106015248
Anonymous
7/25/2025, 3:21:45 AM No.106015248
>>106015243
Ah, fair enough
Anonymous
7/25/2025, 3:24:38 AM No.106015274
Screenshot 2025-07-24 192411
Screenshot 2025-07-24 192411
md5: efced6aec1fdb164db0d52ff55a3d02c🔍
>>106015002
Kinda looks like a nothingburger? All it does is say if there's anything about DEI or enacting Fate fanfics or whatever, it needs to be disclosed in the model card
Anonymous
7/25/2025, 3:28:41 AM No.106015319
>No Qwen thinking
>No GLM 4.5
>No Wan 2.2
Nothing good ever happens
Anonymous
7/25/2025, 3:30:23 AM No.106015337
>>106015002
/lmg/ patriots are in control
Anonymous
7/25/2025, 3:33:38 AM No.106015367
Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method
https://arxiv.org/abs/2507.18073
>Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively "squeezing" 16-bit LLMs' weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks--a significant boost over existing PTQ methods. Our code will be released upon publication.
another day another quant method. might be cool.
>Experiments on LLaMA and LLaMA2
wacky though
Anonymous
7/25/2025, 3:34:35 AM No.106015374
>>106015002
Do they have something similar to filter humans out of the public service?
Replies: >>106017573
Anonymous
7/25/2025, 3:34:58 AM No.106015379
any minute now glm4 100b moe is going to save local
Replies: >>106015409
Anonymous
7/25/2025, 3:38:11 AM No.106015409
>>106015379
It certainly is promising
Anonymous
7/25/2025, 3:39:23 AM No.106015424
>>106014916
weird. Im running kobold.cpp with like 30 layers offloaded or so, and I have 128gb of ddr5 of ram (4,000mhz mixed ram kits) so Im fitting it all (130gb model) in ram (with 4gb to spare, 4k context, lol on windows).

Might have to do with he fact that my 5070ti is on x16 but the two 5060's are both on x4 lanes, though they are hardly working and its all on cpu so I doubt thats the issue.
Replies: >>106015454 >>106015507
Anonymous
7/25/2025, 3:41:24 AM No.106015437
Base Image
Base Image
md5: fa64746801420d41eca541f2fa906cd0🔍
Group Sequence Policy Optimization
https://arxiv.org/abs/2507.18071
>This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
https://github.com/QwenLM
Code might be posted on their git at some point
Replies: >>106015465
Anonymous
7/25/2025, 3:43:39 AM No.106015454
>>106015424
Try doing 'nvidia-smi -lgc {card's boost clock}'. can alleviate slowness if offloading to CPU on windows
Replies: >>106015519
Anonymous
7/25/2025, 3:44:48 AM No.106015465
>>106015437
what got into qwen this week, there's a drop every 5 minutes
Anonymous
7/25/2025, 3:50:44 AM No.106015507
>>106015424
does kobold have the -ot parameter? you should be offloading all the layers to the video cards and then using ot to offload the ffn back to the cpu. it give better performance then a simple offload, in llamacpp land anyway.
Anonymous
7/25/2025, 3:52:32 AM No.106015519
>>106015454
Im wondering if its my ram. I have a 7200mhz 96gb kit but I figured popping in another 64 would help- and it did load faster- though I have to downclock it all to 4000 mhz and its probably hurting me more than helping. Also, windows prolly sucks compared to linux? Ill have to try some shit.
Replies: >>106015565
Anonymous
7/25/2025, 3:53:11 AM No.106015527
>>106015002
>distortion of factual information about race or sex
wow this is based as fuck, HYPER BASED
Anonymous
7/25/2025, 3:56:45 AM No.106015554
mistral_dump
mistral_dump
md5: 002b7bff8d4de64a9c901ff470c3d2f6🔍
Mistral is taking a dump on llama.cpp.
>https://github.com/ggml-org/llama.cpp/pull/14737
Instead of contributing code to improve the project, they expect people to now run TWO servers just because they cannot integrate their own shit.

Llama tries to have as few dependencies as possible. I remember them arguing about whether having a header-only json *compile-time* dependency in the project was a good idea. Mistral expects them to have a *runtime* dependency to run mistral models. A PYTHON runtime dependency.

>Known Limitations:
>Our approach does not support multimodality:
>>mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
>>llama.cpp only supports multimodality via chat templates, which we do not support.
>Also this approach requires users to only use the llama.cpp server with the /completions route.

#Launch the mistral-common and llama.cpp servers
pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]
#Launch the mistral-common server:
HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000
#Launch the llama.cpp server:
./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080

Yes. You have to launch two servers.

The mistral server is only for [de]tokenization. So they expect you to do this dance in your code.
...
tokens = tokenize(messages, mistral_common_url)
generated = generate(tokens, llama_cpp_url)["tokens"]
detokenized = detokenize(generated, mistral_common_url)
detokenized_message = detokenize_message(generated, mistral_common_url)
print(detokenized_message)


Any of you use logit bias? That's a different dance now. Want to just [de]tokenize? Nah. Different server now. Want to run llama.cpp where you {cannot|don't want to} have the python shit installed? Nah. What about the clients? Well, let THEM fix it.

Two fucking servers. That's the best they could come up with...
Replies: >>106015586 >>106015588 >>106015701 >>106020302
Anonymous
7/25/2025, 3:58:24 AM No.106015565
>>106015519
Hard to say, just sharing my findings.
2x3090, dual channel 128gb ddr4 3200, qwen235b Q3KL, latest koboldcpp:
>nothing, GPUs' core and vram downclocks to 240mhz and "405mhz" when generating
Amt:128/128 Generate:36.88s (3.47T/s)
>nvidia-smi -lgc 1740
Amt:128/128 Generate:25.73s (4.98T/s)
>nvidia-smi -lgc 1740, nvidia-smi -lmc 9752
Amt:128/128 Generate:21.49s (5.96T/s)
Replies: >>106016265
Anonymous
7/25/2025, 4:00:16 AM No.106015583
https://youtube.com/watch?v=uLsykckkoZU
>AMD RDNA 5 Specs Leak: TSMC 3nm, 128GB GDDR7, RTX 6090 Killer! (+ PS6 / XBOX Update)
Moore's Law Is Dead•8.2K views•2 hours ago
Localbros..... we're saved!
Replies: >>106015643 >>106015679
Anonymous
7/25/2025, 4:00:29 AM No.106015586
>>106015554
Mistral models have been mid ever since Mixtral/Nemo desu. I could care less if they get their shit integrated in backends.
Replies: >>106015638
Anonymous
7/25/2025, 4:00:33 AM No.106015588
1493971582460
1493971582460
md5: 0346aeb5c56daa35649aae7a135b4cc2🔍
>>106015554
lol?
Anonymous
7/25/2025, 4:06:07 AM No.106015638
>>106015586
>could care less
https://www.youtube.com/watch?v=om7O0MFkmpw
>if they get their shit integrated in backends
Yeah. They released some cool models. If they ever release a new one, they expect you to run two servers to even try it. You won't be able to run the disappointment yourself, you'll have to read of it vicariously.
Replies: >>106015666
Anonymous
7/25/2025, 4:06:39 AM No.106015643
>>106015583
Isn't the next gen supposed to be UDNA not RDNA? I have been waiting for them to have same support for ROCm as their compute cards.
Anonymous
7/25/2025, 4:09:11 AM No.106015666
>>106015638
Kek, but actually though, I did mean that I could care less, relative to my current state in which I do care a little. I'll always give new models a chance, even if I'm 99% sure they'll be coal.
Replies: >>106015868
Anonymous
7/25/2025, 4:10:27 AM No.106015679
>>106015583
128GB GPUs? Omgooooood
Anonymous
7/25/2025, 4:13:11 AM No.106015701
>>106015554
Sounds good to me. It's optional.
Replies: >>106015868
Anonymous
7/25/2025, 4:23:23 AM No.106015785
mesugakist
mesugakist
md5: 2eeae0a85441fec60815519643662551🔍
The usual hallucinations instead of unfortunate limitation of a small model feels like it's messes with you on purpose.
Anonymous
7/25/2025, 4:34:25 AM No.106015868
mistral_dump02
mistral_dump02
md5: 3d9e0b0786ad725b63dc01ae004649d9🔍
>>106015666
>relative to my current state in which I do care a little
Fair enough.

>>106015701
>It's optional.
Until they change the tokenizer for their new models.
If things go this way, there's either only going to be the mistral-common's tokenizer, which requires a separate server and a python runtime dependency, or two implementations (the mistral-common's one and, maybe, a built-in one), making mistral's extra server useless or, worse, having llama.cpp put less effort in getting their own tokenization right.
Anonymous
7/25/2025, 5:11:11 AM No.106016151
GwpLVgRbgAArlI4
GwpLVgRbgAArlI4
md5: 81723442a786287bab78c3f5456998fa🔍
well it's hundredth the cost of gpt 4.5
Anonymous
7/25/2025, 5:17:33 AM No.106016190
lemao
lemao
md5: fa0790ff311cf1ff8ff02b3e5176338e🔍
>>106013068
>>106012287
>>106011969
>>106011918
>>106011911 (OP)
vocaloidtranny posting porn in /ldg/: >>105715769
It was up for hours while anyone keking on troons or niggers gets deleted in seconds, talk about double standards and selective moderation: https://desuarchive.org/g/thread/104414999/#q104418525 https://desuarchive.org/g/thread/104414999/#q104418574
he makes ryona picture: >>105714003 of some random generic anime girl the different anon posted earlier: >>105704741 (could be the vocaloidtranny playing both sides)
here >>105884523 he tests bait poster bot for better shitflinging in threads
admits spamming /v/ with AI slop: https://desuarchive.org/g/thread/103462620/#103473545

Funny /r9k/ thread: https://desuarchive.org/r9k/thread/81611346/
The Makise Kurisu damage control screencap (day earlier) is fake btw, no matches to be found, see https://desuarchive.org/g/thread/105698912/#q105704210 janny deleted post quickly.

TLDR: vocaloid troon / janitor protects resident avatarfags and deletes everyone who outs him, making the general his little personal safespace with samefagging. Is prone to screech "Go back to teh POL!" when someone posts something mildly political about language models or experiments around topic.

As said in previous thread(s) >>105716637 I remind you that cudadev of llama.cpp (JohannesGaessler on github) has endorsed spamming. That's it.
He also endorsed hitting that feminine jart bussy a bit later on. QRD on Jart - The code stealing tranny: https://rentry.org/jarted

xis ai slop profiles
https://x.com/brittle_404
https://x.com/404_brittle
https://www.pixiv.net/en/users/97264270
https://civitai.com/user/inpaint/models
Replies: >>106016203 >>106016897
Anonymous
7/25/2025, 5:18:58 AM No.106016197
We had a great thread while it lasted.
Anonymous
7/25/2025, 5:19:24 AM No.106016203
file
file
md5: 53007ea67b970d1d0fe94ac772d66a44🔍
>>106016190
Bakes other threads and powertripping there too
Anonymous
7/25/2025, 5:21:14 AM No.106016226
You are the melting men
And as you melt
You are beheaded
Handcuffed in lace, blood and sperm
Swimming in poison
Gasping in the fragrance
Sweat carves a screenplay
Of discipline and devotion
Anonymous
7/25/2025, 5:22:15 AM No.106016235
as an anon from ldg i do not care about the nsfw post
Replies: >>106016241
Anonymous
7/25/2025, 5:22:45 AM No.106016241
>>106016235
Good for you
Anonymous
7/25/2025, 5:25:56 AM No.106016265
>>106015565
Anon, you're my hero.
I hadn't even considered locking clocks, that just got me an extra 3 t/s tg on top of what I was getting from the -ot fuckery, at my context shift point, too.
Replies: >>106017240
Anonymous
7/25/2025, 5:49:06 AM No.106016419
1742068593008118
1742068593008118
md5: 44fc503cd4a67b0babdc5b21eec8e2ab🔍
>new mistral release
>it's a """thinking""" model
Anonymous
7/25/2025, 7:03:49 AM No.106016860
>>106011945
Maybe GLM-4 100B, whenever it comes out... if it ever comes out.
Anonymous
7/25/2025, 7:13:01 AM No.106016897
>>106016190
you hard reacted to kurisu having a bath by using kontext on your shit tier gpu to try change it
in other words, you tried to inpaint using kontext
you posted the migu

post it one time post it 100 times
post migu to own the libs
Anonymous
7/25/2025, 7:32:04 AM No.106016998
image_2025-07-25_110032074
image_2025-07-25_110032074
md5: ddb6184c51c85a68dd9cbd15692d1eff🔍
What's their secret sauce? This is the most uncensored model I've messed around with
Replies: >>106017014
Anonymous
7/25/2025, 7:33:52 AM No.106017014
>>106016998
Buy an ad
Replies: >>106017040
Anonymous
7/25/2025, 7:37:40 AM No.106017040
>>106017014
i am actually curious
i straight up asked why Hitler was good, why we should enslave blacks again and how to get rid of the local politician
and it answered that's why i am curious
Replies: >>106017056
Anonymous
7/25/2025, 7:40:10 AM No.106017056
>>106017040
that's what you goon to? pretty low brow desu
Replies: >>106017063
Anonymous
7/25/2025, 7:41:27 AM No.106017063
>>106017056
i goon to milfs in spandex but i wanted to check the limits before posting here
Anonymous
7/25/2025, 8:14:26 AM No.106017215
>smaller qwen3 coder quant gets a better score

########## All Tasks ##########
task LCB_generation coding_completion
model
deepseek-r1-iq1s 85.897 62.0
deepseek-v3-0324-iq1s 66.667 60.0
qwen3-235b-a22b-no-think-q4km 55.128 44.0
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
Replies: >>106017229 >>106017260 >>106018035
Anonymous
7/25/2025, 8:18:15 AM No.106017229
>>106017215
wtf
Anonymous
7/25/2025, 8:22:33 AM No.106017240
lmao
lmao
md5: 6fc44efffba36a0da2bf859dd8740c1e🔍
>>106016265
Glad it worked. The only downside is that, with 3090s at least, idle power consumption goes from 12w to 100w.
I had Gemini shit this out this forwarding proxy that when traffic is passing through, it sets -lgc and -lmc, and then sets -rgc and -rgc when activity stops.
Completely silly but works and doesn't slow down noticeably for slow offload <10t/s speeds:
https://files.catbox.moe/uqwueh.zip
nvidia-pstated didn't work well for me.
Anonymous
7/25/2025, 8:26:57 AM No.106017260
>>106017215
Try multiple times test?
Replies: >>106017277
Anonymous
7/25/2025, 8:30:57 AM No.106017277
>>106017260
The temperature is 0 for all benchmarks on the list.
Replies: >>106019453
Anonymous
7/25/2025, 8:42:32 AM No.106017321
Qwen3-235B-A22B-Thinking-2507 will be released today per Qwen dev Junyang Lin on Twitter
Anonymous
7/25/2025, 8:54:03 AM No.106017366
Hit my free copilot quota. Supermaven seems so bad with Rust that it's a net negative to have it on.

What model do I run for local code completion? Don't tell me it's still Qwen Coder 2.5 7/14B after all this time?

When are we getting smaller Qwen 3 Coder models?
Replies: >>106017407
Anonymous
7/25/2025, 9:02:46 AM No.106017407
>>106017366
Coder models are, presumably, better at coding, but it doesn't mean other models can't do it at all. You'll have to try them yourself to see if they suit your needs. You have a whole set of qwen3 models to try. Try qwen3-32b or whatever you can run.
Anonymous
7/25/2025, 9:18:01 AM No.106017488
https://huggingface.co/deepseek-ai/DeepSeek-R2
Replies: >>106017561 >>106017771 >>106017849 >>106017894 >>106017965
Anonymous
7/25/2025, 9:35:50 AM No.106017561
>>106017488
Anon with the cat pic incoming?
Anonymous
7/25/2025, 9:37:43 AM No.106017573
>>106015374
Yes?
Much like in the Soviet Union one of their first priorities has been to make hiring decisions based on loyalty rather than merit.
Anonymous
7/25/2025, 10:15:34 AM No.106017771
>>106017488
Is real wtf?!
Anonymous
7/25/2025, 10:33:57 AM No.106017849
1735865434616190_thumb.jpg
1735865434616190_thumb.jpg
md5: 0f8508dcf9b2ebd480854cd4b4a5f791🔍
>>106017488
Replies: >>106018019
Anonymous
7/25/2025, 10:43:14 AM No.106017894
>>106017488
>image-to-text
>text-to-image
its so over for gay faggotman
Anonymous
7/25/2025, 10:48:36 AM No.106017923
Bitnet status?
RWKV status?
Mamba status?
Replies: >>106017929
Anonymous
7/25/2025, 10:49:36 AM No.106017929
>>106017923
>Bitnet status?
Scam
>RWKV status?
Next time it will be better, bro
>Mamba status?
DOA
Anonymous
7/25/2025, 10:55:16 AM No.106017965
1742996813451585
1742996813451585
md5: 38173cf0ad16d02c3657c8917510c0be🔍
>>106017488
Anonymous
7/25/2025, 11:06:17 AM No.106018019
>>106017849
Damn, it's crazy how they got the cat to do that.
Anonymous
7/25/2025, 11:09:18 AM No.106018035
>>106017215
Added IQ3_XSS. There's something magical about IQ1_M.

########## All Tasks ##########
task LCB_generation coding_completion
model
deepseek-r1-iq1s 85.897 62.0
deepseek-v3-0324-iq1s 66.667 60.0
qwen3-235b-a22b-no-think-q4km 55.128 44.0
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
qwen3-coder-480b-a35b-instruct-iq3xxs 76.923 74.0
Replies: >>106018094
Anonymous
7/25/2025, 11:21:57 AM No.106018094
>>106018035
cope
Anonymous
7/25/2025, 11:33:04 AM No.106018148
https://x.com/Ali_TongyiLab/status/1948654675575668959

we are so back
Replies: >>106018214
Anonymous
7/25/2025, 11:47:26 AM No.106018214
>>106018148
>it now takes one hour instead of 20 minutes to produce a video
>still doesn't know any characters and can't do nsfw
don't really care but it's good that it's open source
Anonymous
7/25/2025, 12:06:20 PM No.106018300
I'm filling a disk with local models to help me and have fun with when the internet goes out. Other than abliterated versions of popular ones (for less refusals), the 4chan model, a coder model, and an RP model, what are some cool ones to pick up and why?
Replies: >>106018309 >>106018318 >>106018321
Anonymous
7/25/2025, 12:08:12 PM No.106018309
>>106018300
my hardware is a consumer desk with ryzen 3600x, 1080, and 32gb ram btw. I don't know yet what's the largest I can handle. I was able to run most models fine on an i5-7400 with no gpu whatsoever
Replies: >>106018345
Anonymous
7/25/2025, 12:09:28 PM No.106018318
>>106018300
>when the internet goes out
Like when you're having network issues or are you talking doomsday scenario?
Replies: >>106018381
Anonymous
7/25/2025, 12:10:06 PM No.106018321
>>106018300
rocinante 1.1, whichever gguf fits into vram
also are you british?
Replies: >>106018826
Anonymous
7/25/2025, 12:14:52 PM No.106018345
>>106018309
>was able to run most models fine on an i5-7400 with no gpu whatsoever
How many T/s?
Replies: >>106018381
Anonymous
7/25/2025, 12:15:01 PM No.106018346
>>106013786
>most frequently used experts
Isn't Deepseek trained to not have those?
Anonymous
7/25/2025, 12:23:47 PM No.106018381
>>106018318
My internet goes out a lot but both are fun to prepare for. I also have the entire offline wikipedia.

>>106018345
I don't know but the text is coming out just fast enough to be useful when I download models around 7b
Replies: >>106018398
Anonymous
7/25/2025, 12:26:55 PM No.106018398
ao3
ao3
md5: afa7a4eae2214ba5f34b10b3ac5d2a7e🔍
>>106018381
>the entire offline wikipedia.
those are rookie numbers
Anonymous
7/25/2025, 12:36:31 PM No.106018450
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
Replies: >>106018461 >>106019655
Anonymous
7/25/2025, 12:38:56 PM No.106018461
1745471309140271
1745471309140271
md5: 17dca35e9ad569d8c48be2e6550a190a🔍
>>106018450
Replies: >>106018477 >>106018488 >>106018492 >>106018565 >>106018581 >>106018621 >>106018716 >>106021235
Anonymous
7/25/2025, 12:41:36 PM No.106018477
>>106018461
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCKKKKKKKKKKKKK
Replies: >>106020463
Anonymous
7/25/2025, 12:43:07 PM No.106018488
>>106018461
Omae wa Mesugaki da!
Replies: >>106018577 >>106020463
Anonymous
7/25/2025, 12:44:01 PM No.106018492
>>106018461
I hate alignment even more then censorship. it would almost be better if it just spit out the standard disclaimers and hotline numbers.
Anonymous
7/25/2025, 12:54:48 PM No.106018565
>>106018461
the absolute state of llms
Replies: >>106020463
Anonymous
7/25/2025, 12:56:03 PM No.106018577
>>106018488
NANI??
Anonymous
7/25/2025, 12:56:31 PM No.106018581
>>106018461
>I CANNOT AND WILL NOT
>NOW PLEASE CALL THE POLICE AND TURN YOURSELF IN
Did they use Gemma to generate data or something?
Replies: >>106020463
Anonymous
7/25/2025, 1:04:06 PM No.106018621
>>106018461
I just want a sci-fi movie kind of assistant that is concise and straight to the point, not something that tells me how to think and act
Replies: >>106018646
Anonymous
7/25/2025, 1:08:00 PM No.106018646
1737267728450837
1737267728450837
md5: c03cf267a084f9ee9c82ac9998387932🔍
>>106018621
Just ask the model to do so? LLMs can't surmise your goal.
Replies: >>106018669 >>106019891
Anonymous
7/25/2025, 1:10:56 PM No.106018669
>>106018646
Much better. This should have been the original response
Anonymous
7/25/2025, 1:15:52 PM No.106018699
Only three models have pleasantly surprised me on their RP capabilities this year (so far): R1, V3 0324 and Kimi K2
Replies: >>106018713
Anonymous
7/25/2025, 1:17:39 PM No.106018713
>>106018699
>Kimi K2
Worth the 1000 USD RAM upgrade?
Replies: >>106018725
Anonymous
7/25/2025, 1:17:46 PM No.106018716
>>106018461
People are shitting on Trump for wanning to ban "Woke AI", pretending that safety isn't being used as an excuse to insert as much social justice as possible.
Replies: >>106020463
Anonymous
7/25/2025, 1:18:45 PM No.106018725
>>106018713
Wouldn't know about local; I've only used the API versions.
Anonymous
7/25/2025, 1:19:34 PM No.106018731
ChatGPT Image Jul 25, 2025, 06_18_52 AM
ChatGPT Image Jul 25, 2025, 06_18_52 AM
md5: f4b6994b35a6ffc6b922096d9c2591b8🔍
o3 is terrible at perspective.
Replies: >>106018810 >>106018817 >>106018984
Anonymous
7/25/2025, 1:24:56 PM No.106018762
Local will be saved next week
Replies: >>106018771 >>106018787
Anonymous
7/25/2025, 1:26:18 PM No.106018771
>>106018762
Only if next week is 14 days away.
Anonymous
7/25/2025, 1:28:49 PM No.106018787
>>106018762
Local is in a very good place right now. We have R1 for cooming and qwen for programming. There are no other use cases.
Replies: >>106018809
Anonymous
7/25/2025, 1:29:48 PM No.106018793
1739278602584215
1739278602584215
md5: 51a3999e645682a559125d57cd032130🔍
Nails the kangaroo beaver test
Replies: >>106018806 >>106021835
Anonymous
7/25/2025, 1:32:16 PM No.106018806
>>106018793
no it fucking doesn't
Replies: >>106018812 >>106021835
Anonymous
7/25/2025, 1:32:47 PM No.106018809
>>106018787
>We have R1
yeah just gimme a sek I'll whip my 180+GB RAM out
Anonymous
7/25/2025, 1:32:56 PM No.106018810
>>106018731
>mandatory dwarfism quota
Damn you, DEI!
Anonymous
7/25/2025, 1:33:11 PM No.106018812
>>106018806
You have lower IQ than an LLM; how does that feel?
Replies: >>106021835
Anonymous
7/25/2025, 1:33:37 PM No.106018817
1753442374794747
1753442374794747
md5: 29c7149bab96c60983e6b239cb574909🔍
>>106018731
Please do not urinate here
Replies: >>106018824 >>106018976 >>106018984
Anonymous
7/25/2025, 1:34:19 PM No.106018824
>>106018817
how did you undo the piss filter?
Replies: >>106018837 >>106018848 >>106021231
Anonymous
7/25/2025, 1:34:51 PM No.106018826
>>106018321
He means the inevitable global censorship.
Replies: >>106018832
Anonymous
7/25/2025, 1:36:04 PM No.106018832
>>106018826
>2035
nemo is still the best local model
Anonymous
7/25/2025, 1:37:11 PM No.106018837
Screen Shot 2025-07-25 at 20.36.33
Screen Shot 2025-07-25 at 20.36.33
md5: f6e5794608a4024524f831c4dd218d97🔍
>>106018824
just changed the temperature
Anonymous
7/25/2025, 1:39:31 PM No.106018848
>>106018824
>technology board on a darknet hacker forum known as 4chan
>people don't know about basic color correction
Replies: >>106018864
Anonymous
7/25/2025, 1:41:10 PM No.106018864
>>106018848
You can't blame him, Sam doesn't know how to do it either.
Anonymous
7/25/2025, 1:55:23 PM No.106018967
https://www.arxiv.org/abs/2507.18071
Group Sequence Policy Optimization
>This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
Anonymous
7/25/2025, 1:56:14 PM No.106018976
>>106018817
I post the original in case some people like the piss filter.
Anonymous
7/25/2025, 1:57:28 PM No.106018984
>>106018731
>>106018817
I thought the piss filter was a joke. It's real?
Replies: >>106019015 >>106019327
Anonymous
7/25/2025, 2:01:32 PM No.106019015
>>106018984
they want to make their images easily distinguishable. therefore the style and the piss filter (and the suspicious grain)
Replies: >>106019147
Anonymous
7/25/2025, 2:15:02 PM No.106019114
__fujiwara_no_mokou_touhou_drawn_by_jokanhiyou__6d702f49f91b7e76348f41812b2c035f
https://openrouter.ai/apps?url=https%3A%2F%2Focr-benchmark.com%2F
>1.91b tokens of Gemini 2.5 Pro
You wouldn't burn over 2k dollars benchmarking Gemini, would you?
Replies: >>106019124
Anonymous
7/25/2025, 2:16:03 PM No.106019124
>>106019114
Paid benchmarks
Anonymous
7/25/2025, 2:18:31 PM No.106019147
>>106019015
I think it's more a matter of that being the exact average style and color temperature of all of the images it was trained on.
Replies: >>106019224
Anonymous
7/25/2025, 2:22:43 PM No.106019174
output [sound=files.catbox.moe%2Fh5a9tq.mp3]_thumb.jpg
output [sound=files.catbox.moe%2Fh5a9tq.mp3]_thumb.jpg
md5: 24b06666f064c268fd7c1f3aa29395cb🔍
>>106014022
0001-Higgs-Audio.patch: https://files.catbox.moe/ofsjhp.patch
0002-Voices.patch: https://files.catbox.moe/k8r3ls.patch
0003-xcodec.patch: https://files.catbox.moe/2fzn2i.patch

git clone https://github.com/vllm-project/vllm
cd vllm
git checkout 4dc52e1c53
git am *.patch
uv pip install --torch-backend=auto --upgrade -r requirements/cuda.txt -r requirements/build.txt

cd xcodec
uv pip install --torch-backend=auto -e .

cd ..
uv pip install --torch-backend=auto -e .

https://files.catbox.moe/9p2udy.mp4
The voice varies a lot between generations.
Anonymous
7/25/2025, 2:30:10 PM No.106019224
>>106019147
you are very stupid if you really think so
Replies: >>106019273
Anonymous
7/25/2025, 2:36:49 PM No.106019273
>>106019224
I don't talk to jews
Anonymous
7/25/2025, 2:43:06 PM No.106019327
40d6e28b-46c6-45c7-aeca-ecfa10c04cc0
40d6e28b-46c6-45c7-aeca-ecfa10c04cc0
md5: 8b778d566760a89e129d6936ef84e8e6🔍
>>106018984
Real.
Anonymous
7/25/2025, 2:57:57 PM No.106019426
Stepfun 321B-A38B MoE VLM soon
https://github.com/stepfun-ai/Step3
Replies: >>106019501 >>106019555 >>106019570 >>106019580
Anonymous
7/25/2025, 3:01:05 PM No.106019453
>>106017277
That's fine for consistency but not the best representation of what the models are capable of; usually they'll perform better with some sampling. A better way to get consistency would be to take the average score of many runs.
Replies: >>106019463 >>106019547 >>106020316
Anonymous
7/25/2025, 3:02:37 PM No.106019463
>>106019453
It's been empirically demonstrated that temperature=0 leads to output degeneration (looping and repetition). This is less obvious on instruct models, but that's the end result with them too.
Replies: >>106019547
Anonymous
7/25/2025, 3:08:48 PM No.106019501
178004800
178004800
md5: 69784e8ba0b2b76d89deff6d0a3a9891🔍
>>106019426
how did they come up with this logo design
Replies: >>106019539
Anonymous
7/25/2025, 3:14:14 PM No.106019539
>>106019501
looks like someone used the ms paint selection tool on a screenshot of the snail maze from the sega master system
Anonymous
7/25/2025, 3:15:08 PM No.106019547
>>106019453
The benchmarks already take too long to complete. I don't want to make multiple runs for every model. Anyway some quant scores are actually better than the ones displayed on the livebench website and I'm pretty sure that's because I set temperature to 0.

>>106019463
Deepseek and qwen don't have that issue in coding benchmarks as the task usually has a well defined beginning and end so there's no loops to get stuck in.
The only place I saw deepseek get stuck was on a few problems from the reasoning part of livebench. An example of a reasoning problem is having X people you need to sit around a table with a bunch of rules about who must or can't sit next to each other. Even non-reasoning models try going through combinations step by step so they have potential to get stuck in that process.
Replies: >>106020316
Anonymous
7/25/2025, 3:16:01 PM No.106019555
>>106019426
we are so back
Anonymous
7/25/2025, 3:18:01 PM No.106019570
>>106019426
>entire document focuses 100% on how cheap it is and no benchmarks
I smell soul
Anonymous
7/25/2025, 3:19:29 PM No.106019580
>>106019426
speaking of *step*
did anyone ever train a better model for AceStep? Has the code improved to the point where there's a reason to risk a pull?
Anonymous
7/25/2025, 3:30:03 PM No.106019655
>>106018450
Daniel was quick on this one.
Replies: >>106019728
Anonymous
7/25/2025, 3:43:16 PM No.106019728
>>106019655
He's even quicker when I press on his prostate with my cock.
Anonymous
7/25/2025, 4:06:14 PM No.106019872
How tf am I supposed to figure out where to use 235b vs 235b thinking vs coder? which one is the best at everything?
Replies: >>106019905
Anonymous
7/25/2025, 4:08:22 PM No.106019891
>>106018646
80k thinking. BRUH.
Reasoning models are a fucking joke.
They suck for coding too unless you have a very specific problem where it can focus in.
Replies: >>106019901
Anonymous
7/25/2025, 4:09:39 PM No.106019901
>>106019891
pretty sure it's that 80k is the maximum allowed not the used
Replies: >>106019919
Anonymous
7/25/2025, 4:10:10 PM No.106019905
>>106019872
Use coder for coding and deepseek for everything else.
Replies: >>106019915
Anonymous
7/25/2025, 4:11:36 PM No.106019915
>>106019905
Experience thus far tells me you’re right, but I want to believe…is it possible we’ll get R1 performance in a 235b, or is it benchmaxxed grift?
Replies: >>106019928
Anonymous
7/25/2025, 4:12:39 PM No.106019919
>>106019901
I stopped trying the qwen models but at least in the past they had the horrible
>Wait, but what if..
thing going on. Its totally tarded and wastes tons of tokens. Wouldnt suprise me if it wasted 80k tokens even if it got the answer in the first sentence.
But they might have fixed that.
Replies: >>106019951
Anonymous
7/25/2025, 4:13:32 PM No.106019928
>>106019915
I assume you mean R1 knowledge and the answer is obviously not because qwen loves filtering the dataset.
Replies: >>106020131
Anonymous
7/25/2025, 4:16:16 PM No.106019951
>>106019919
I went a bit farther down the queen rabbit hole and found the “what if” was my cue to stop the gen, remove last reply and improve my prompt to settle the ambiguity up front. I found it goes down legit rabbit holes and needs a tie breaker for good results.
inb4 do what I want, not what I say
Anonymous
7/25/2025, 4:41:04 PM No.106020131
>>106019928
235b (especially nu-235b) isn't filtered thoughever
Anonymous
7/25/2025, 4:44:36 PM No.106020161
https://github.com/ggml-org/llama.cpp/pull/14875
>Support intern-s1
links to https://huggingface.co/internlm/Intern-S1 (currently 404), looks like there's going to be yet another chinese model coming soon
Replies: >>106020202
Anonymous
7/25/2025, 4:51:31 PM No.106020202
hunyuan
hunyuan
md5: a2c3798825bd44e25dbe54649eed9a1f🔍
>>106020161
>Intern
kek
And this:
https://github.com/ggml-org/llama.cpp/pull/14878
Replies: >>106020255
Anonymous
7/25/2025, 4:58:29 PM No.106020255
>>106020202
isn't that a deprecated image model nobody uses?
Replies: >>106020311
Anonymous
7/25/2025, 5:03:25 PM No.106020302
>>106015554
>vLLM based
>does not support multimodality
These frog fucks never contribute model support themselves, can't be bother to port their tokenizer, and now expect people to run 2 servers to run only their shit, which has been irrelevant since Large/Wizard. They would have to be stupid to accept this. It would open the door for other model makers to be lazy and do the same. Imagine switching models always requires switching secondary servers as well, which is some python shit anyway. At that point just use vLLM directly.
>ggerganov actually likes the idea
God damn it.
Replies: >>106020383
Anonymous
7/25/2025, 5:04:12 PM No.106020311
>>106020255
Internlm? They've benchmaxxed so much with their past models it makes qwen blush. If you're talking about hunyuan, they just released a big-ish moe a few weeks (?) ago. No idea about the image models.
Anonymous
7/25/2025, 5:04:42 PM No.106020316
>>106019453
>>106019547
Well what do you know. I'll do another run.
########## All Tasks ##########
task LCB_generation coding_completion
model
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
qwen3-coder-480b-a35b-instruct-iq3xxs 76.923 74.0
qwen3-coder-480b-a35b-instruct-q2kxl-temp-0-7 78.205 80.0
Anonymous
7/25/2025, 5:12:13 PM No.106020380
Beware, anon!

Openrouter.ai hosts shitty quants of deepseek-R1!

For example, R1 0528 Chutes

You'll get responses polluted with Chinese characters
Replies: >>106020399 >>106020404
Anonymous
7/25/2025, 5:12:17 PM No.106020383
mistral_dump03
mistral_dump03
md5: 8c818c2cd8ae6637287d2fc3d1e75427🔍
>>106020302
>ggerganov actually likes the idea
I think he's being strategically polite. First time mistral decides to chip something in and he doesn't want to spook them out.
I expected more pushback from ngxson, seeing how (understandably) protective he is of the server code. He even mentions picrel.
https://github.com/ggml-org/llama.cpp/pull/14862
Replies: >>106020432
Anonymous
7/25/2025, 5:13:42 PM No.106020399
>>106020380
That shouldn't affect anyone here. If it does, they're in the wrong thread. Or maybe you are.
Replies: >>106020437
Anonymous
7/25/2025, 5:14:23 PM No.106020404
>>106020380
local?
Anonymous
7/25/2025, 5:16:29 PM No.106020423
janny tongue my anus
janny tongue my anus
md5: fa0790ff311cf1ff8ff02b3e5176338e🔍
>>106013068
>>106012287
>>106011969
>>106011918
>>106011911 (OP)
vocaloidtranny posting porn in /ldg/: >>105715769
It was up for hours while anyone keking on troons or niggers gets deleted in seconds, talk about double standards and selective moderation: https://desuarchive.org/g/thread/104414999/#q104418525 https://desuarchive.org/g/thread/104414999/#q104418574
he makes ryona picture: >>105714003 of some random generic anime girl the different anon posted earlier: >>105704741 (could be the vocaloidtranny playing both sides)
here >>105884523 he tests bait poster bot for better shitflinging in threads
admits spamming /v/ with AI slop: https://desuarchive.org/g/thread/103462620/#103473545

Funny /r9k/ thread: https://desuarchive.org/r9k/thread/81611346/
The Makise Kurisu damage control screencap (day earlier) is fake btw, no matches to be found, see https://desuarchive.org/g/thread/105698912/#q105704210 janny deleted post quickly.

TLDR: vocaloid troon / janitor protects resident avatarfags and deletes everyone who outs him, making the general his little personal safespace with samefagging. Is prone to screech "Go back to teh POL!" when someone posts something mildly political about language models or experiments around topic.

As said in previous thread(s) >>105716637 I remind you that cudadev of llama.cpp (JohannesGaessler on github) has endorsed spamming. That's it.
He also endorsed hitting that feminine jart bussy a bit later on. QRD on Jart - The code stealing tranny: https://rentry.org/jarted

xis ai slop profiles
https://x.com/brittle_404
https://x.com/404_brittle
https://www.pixiv.net/en/users/97264270
https://civitai.com/user/inpaint/models
Replies: >>106021340 >>106021501
Anonymous
7/25/2025, 5:17:02 PM No.106020432
>>106020383
Quit samefagging.
Replies: >>106020454
Anonymous
7/25/2025, 5:17:19 PM No.106020437
>>106020399
I triggered your attention

You lost
I won as always
Anonymous
7/25/2025, 5:18:44 PM No.106020454
>>106020432
I didn't. But there's nothing I can say that would convince you otherwise.
Anonymous
7/25/2025, 5:19:34 PM No.106020461
huh, still testing but seems like the same vision models in ollama just see better than the same ones do in llama.cpp, even when using a f32 mmproj. is there any mechanical reason why this could actually be the case or is there something weird going on messing it up? where does ollama hide its own mmproj files to begin with?
Replies: >>106020544
Anonymous
7/25/2025, 5:19:40 PM No.106020463
Gsmntf8WcAAaAtH
Gsmntf8WcAAaAtH
md5: 7456424fe9dccfc3b5da445e8cb57cf6🔍
>>106018477
>>106018488
>>106018565
>>106018581
>>106018716
>not le heckin mesugakirino!
Replies: >>106020789
Anonymous
7/25/2025, 5:27:18 PM No.106020530
>>106012038
iirc, around june 2023 when I first got into llms, P40 were 100-150 usd. now, afaik, they're $300+. that's what >>106012026 means by >>106011969 being late.
Replies: >>106021117
Anonymous
7/25/2025, 5:27:29 PM No.106020534
2nd_best_girl
2nd_best_girl
md5: ed4d53e561a1ccf133f2cfc1eb3a31ae🔍
I didn't know that Kirino listens to Meshugggah.
Anonymous
7/25/2025, 5:28:10 PM No.106020544
>>106020461
>is there any mechanical reason why this could actually be the case or is there something weird going on messing it up?
If you find the mmproj ollama uses, you can give it a go with llama.cpp.
>where does ollama hide its own mmproj files to begin with?
I'd assume with the rest of the models, in its hidden dir in your home. I don't remember if it was ~/.ollama or ~/.local/ollama or something like that. There was a discussion some time ago about it, but i'm not sure.
Anonymous
7/25/2025, 5:44:28 PM No.106020715
17109
17109
md5: f3381e59b7be17f0eb0f96bcec81788e🔍
>106020534
>I didn't know that Kirino listens to Meshugggah.
Replies: >>106020990
Anonymous
7/25/2025, 5:52:58 PM No.106020789
>>106020463
drawings looks nothing like the real thing though.

heck, take monster girls, on drawings, hot, irl, it'd be utterly disgusting.

same goes for incest porn, it's hot as a fetish but i'd never want to fuck my actual sisters.
Replies: >>106020964 >>106021061 >>106021394
Anonymous
7/25/2025, 6:15:09 PM No.106020964
>>106020789
voice of reason, as always
Anonymous
7/25/2025, 6:16:15 PM No.106020972
>SOTA opensource model is a 235B A22B model
Local is BACK!
Replies: >>106021050 >>106021055
Anonymous
7/25/2025, 6:18:01 PM No.106020990
>>106020715
mods deleted this because it's like looking at a reflection to them
Anonymous
7/25/2025, 6:23:58 PM No.106021050
>>106020972
Did someone else train a model on qwen's arch?
Anonymous
7/25/2025, 6:24:20 PM No.106021055
>>106020972
I just finished quanting it to q8 and am trying a thinking exercise with it. Output is pretty damn good so far. It’ll be a while before we really know if it’s R1 tier.
Anonymous
7/25/2025, 6:24:49 PM No.106021061
>>106020789
So if a guy jerked off to hairy bara hentai, would you call him a faggot? Or is he not gay because they look nothing like real men?
Replies: >>106021133 >>106021164
Anonymous
7/25/2025, 6:29:14 PM No.106021117
tmp5tmtf9a7
tmp5tmtf9a7
md5: c570fe1b356756b2f2d4fe523aa2c1d8🔍
>>106020530
There was a chance to get 32GB Radeon cards for cheap, and I even posted ebay links. Now it all pointless, though as every model is a huge ass MoE these days
Anonymous
7/25/2025, 6:30:46 PM No.106021133
>>106021061
I wouldn't because I am a straight man and I jerk off to traps, femboys, and futas with long foreskins. To each his own.
Replies: >>106021151
Anonymous
7/25/2025, 6:32:26 PM No.106021151
>>106021133
how do we tell him?
Anonymous
7/25/2025, 6:33:08 PM No.106021161
file
file
md5: 6406f5370a55bb7c28390b00dedd0bee🔍
I can run everything at 15 t/s and this whole machine costs less than 1x 80GB H100
Replies: >>106021183 >>106021213 >>106021309
Anonymous
7/25/2025, 6:33:21 PM No.106021164
55ebcf38e1eeafaf189843a0b8d8672d
55ebcf38e1eeafaf189843a0b8d8672d
md5: 70a8a037bec14d2c72aad1e5787ca2a1🔍
>>106021061
NTA, but you can fap to anything, unless you've had sexual intercourse with a man or posted an opinion I disagree with, you're straight in my book
Anonymous
7/25/2025, 6:35:24 PM No.106021183
>>106021161
What the fuck is the second gpu?
Replies: >>106021194
Anonymous
7/25/2025, 6:36:20 PM No.106021194
>>106021183
built-in graphics of the MZ73-LM0 motherboard
Replies: >>106021218 >>106021228
Anonymous
7/25/2025, 6:38:12 PM No.106021213
>>106021161
i've never heard of an aspeed gpu, you should get another Blackwell to make your system more robust and future proof and another amd card for your monitor if its not headless
Replies: >>106021351
Anonymous
7/25/2025, 6:38:24 PM No.106021218
>>106021194
How much does NUMA fuck up the performance?
Replies: >>106021351
Anonymous
7/25/2025, 6:39:17 PM No.106021228
>>106021194
>MZ73-LM0
Holy fuck, it has a fucking COM port?? SOVL
Replies: >>106021351
Anonymous
7/25/2025, 6:39:24 PM No.106021231
>>106018824
open GIMP, run auto white balance
Anonymous
7/25/2025, 6:39:50 PM No.106021235
>>106018461
Just stop targeting women with hate speech chuds
Anonymous
7/25/2025, 6:44:06 PM No.106021271
hvi3tvmjo1ff1
hvi3tvmjo1ff1
md5: f622af937a4c656cad3e18662ba5e1fd🔍
we are *so* back
Anonymous
7/25/2025, 6:46:46 PM No.106021298
Yeah but can the model give me an oiled footjob?
Replies: >>106021389
Anonymous
7/25/2025, 6:47:29 PM No.106021309
>>106021161
cpumaxx won
Anonymous
7/25/2025, 6:47:46 PM No.106021313
Step 3
Step 3
md5: e840dd0e8ed49d3c157294c3f5405289🔍
wyd stepmodel??
Replies: >>106021402 >>106021479
Anonymous
7/25/2025, 6:49:33 PM No.106021340
>>106020423
Love your work anon. Death to local redditors.
Anonymous
7/25/2025, 6:50:10 PM No.106021351
file
file
md5: 72f40b13a68caaa070534c8a0a0b5ac9🔍
>>106021213
I'll have to get a separate PSU on a separate breaker for that, already pulling 1.3 kW from the wall, 2 kW continuous will either pop the breakers or set my house on fire
6000 has 4 dp ports, covers all my monitor needs so far
>>106021218
ktransformers duplicates weights across numa nodes, doubling memory consumption for ~1.3-1.5x increase in inference speed - that's why I have 2 TB even though all models are 1 TB max
llama.cpp can't do that, K2 Q8 can't run faster than 6 t/s because of that
>>106021228
hell yeah, and a whole BCI, it's a proper server motherboard
Replies: >>106021374 >>106021498
Anonymous
7/25/2025, 6:51:43 PM No.106021374
>>106021351
>Q8
Use Q2 like everyone else, it's indistinguishable.
Replies: >>106021411
Anonymous
7/25/2025, 6:52:52 PM No.106021389
>>106021298
No. Only AIDS.
Anonymous
7/25/2025, 6:53:14 PM No.106021394
1740868331450
1740868331450
md5: 8bbfe959db12bc60cc373cf06b6debc2🔍
>>106020789
>pedophilia is the same as liking monstergirls or incest
Anonymous
7/25/2025, 6:54:15 PM No.106021402
1569339478112
1569339478112
md5: c5d5456a5fa81fa3364db8c539145322🔍
>>106021313
>giant bloated MoE again
I'm so tired bros...
Replies: >>106021732
Anonymous
7/25/2025, 6:55:07 PM No.106021409
file
file
md5: bf389770db0bf8d278ce9cdbb923892c🔍
>look inside
>4t/s
Replies: >>106021422
Anonymous
7/25/2025, 6:55:15 PM No.106021411
>>106021374
for chatting Q2 is good, but for programming even Q6 starts to feel noticeably dumber and fails diffs sometimes because it forgets what code it's working on
Replies: >>106021442 >>106021446 >>106021492
Anonymous
7/25/2025, 6:56:29 PM No.106021422
>>106021409
>Big AI
it this the new trending buzzword?
Anonymous
7/25/2025, 6:57:44 PM No.106021442
>>106021411
Running nolima at different quants might show something useful
Anonymous
7/25/2025, 6:58:08 PM No.106021446
>>106021411
this is bs and you know it
Replies: >>106021699
Anonymous
7/25/2025, 7:00:45 PM No.106021479
>>106021313
Gotta love these benchmark results and statistics. It's quite comical.
Anonymous
7/25/2025, 7:01:54 PM No.106021492
>>106021411
With Mistral? Yes. But those huge MoE models do not degrade as much, I can run Q4 but prefer Q2 for speed
Replies: >>106021584 >>106021699
Anonymous
7/25/2025, 7:02:11 PM No.106021498
>>106021351
no need for the full 600w on the gpu, you're using it for its huge vram and memory bandwidth. you can get two and power limit them both to 300w each and it'll be an upgrade
Replies: >>106021699
Anonymous
7/25/2025, 7:02:16 PM No.106021501
>>106020423
Damn, didn't know about those profiles, thanks for the heads up
Replies: >>106021518
Anonymous
7/25/2025, 7:03:51 PM No.106021518
>>106021501
Okay tranny?
Replies: >>106021532
Anonymous
7/25/2025, 7:04:55 PM No.106021532
>>106021518
he is samefagging
Anonymous
7/25/2025, 7:06:16 PM No.106021550
m4 max mbpfags, slap on 1mm thermal pads on your heatsinks. I got a 20c drop in temps pp'ing 12k tokens.
Anonymous
7/25/2025, 7:07:54 PM No.106021584
>>106021492
That's cope.
Anonymous
7/25/2025, 7:12:41 PM No.106021653
1727333987111268
1727333987111268
md5: 5857bb5450ae500fd193e8fcd042bf69🔍
What do I prompt to let the model format math formula as top instead of bottom?
Anonymous
7/25/2025, 7:16:27 PM No.106021699
file
file
md5: d9c1b38667ee428dbadf7d29192cedf7🔍
>>106021446
maybe roo code doesn't work right, idk
>>106021492
Qwen3-Coder-480B-A35B-Instruct
>>106021498
yeah, you're right
I didn't even get ktransformers to partially load experts on gpu yet, so it sits there underutilized
my favorite small model QwQ-32B-abliterated fits on it entirely and pulls the whole 600w when running though
Anonymous
7/25/2025, 7:18:11 PM No.106021722
i-cant-assist-with-that-request
i-cant-assist-with-that-request
md5: d66c8a1c18dcc6badb29cfe543a14076🔍
Replies: >>106021739
Anonymous
7/25/2025, 7:19:15 PM No.106021732
>>106021402
Why, would you rather have a giant dense model you don’t have enough vram for?
Replies: >>106021763
Anonymous
7/25/2025, 7:19:42 PM No.106021739
>>106021722
they must have purely trained on gemini for the update
Replies: >>106021752
Anonymous
7/25/2025, 7:20:48 PM No.106021752
>>106021739
"I can't assist with that" is OAI shit
Anonymous
7/25/2025, 7:21:30 PM No.106021763
>>106021732
nta, but I'd be happy with a small model that isn't safety slopped garbage.
Replies: >>106021798 >>106022572
Anonymous
7/25/2025, 7:24:56 PM No.106021798
>>106021763
Llama 3.1
Anonymous
7/25/2025, 7:27:59 PM No.106021835
cfgn
cfgn
md5: 2b99206e42ba5fbdadcd970e23ade66a🔍
>>106018793
>>106018806
>>106018812
11 beavers. all pictures, both beaver and kangaroo, have a kangaroo next to it
Replies: >>106021861
Anonymous
7/25/2025, 7:29:52 PM No.106021861
>>106021835
lol I shouldn't have drawn that up, I wrongly thought the AI had concluded 10
Replies: >>106021961
Anonymous
7/25/2025, 7:38:08 PM No.106021961
>>106021861
Are people using mobile phones? Every time I post long images they can't see the bottom of the image
Anonymous
7/25/2025, 7:46:03 PM No.106022059
Post Nala for new qwen otherwise anyone praising it is a shill. Original 235B not only failed at feral anatomy but also gender specific biology in general.
Anonymous
7/25/2025, 8:03:57 PM No.106022342
sama altman's model release is imminent
Replies: >>106022407
Anonymous
7/25/2025, 8:08:00 PM No.106022407
>>106022342
>cock, pussy and fuck removed from tokenizer
Replies: >>106022493 >>106022495 >>106022503
Anonymous
7/25/2025, 8:12:43 PM No.106022493
>>106022407
Albert Einstein did this
Anonymous
7/25/2025, 8:12:57 PM No.106022495
>>106022407
>tokens: co ck, pus sy, f uck
Replies: >>106022565
Anonymous
7/25/2025, 8:13:19 PM No.106022503
>>106022407
Now that you say it, I'm surprised none of them have thought to do this yet.
Anonymous
7/25/2025, 8:16:59 PM No.106022565
>>106022495
What if they get trained as single tokens, and then removed entirely from the embedding and output matrices in post-training?
Anonymous
7/25/2025, 8:17:16 PM No.106022572
>>106021763
I think at this point, small uncensored models are going to have to be a community project. Ain’t no big org got time for that
Replies: >>106022627
Anonymous
7/25/2025, 8:21:00 PM No.106022627
>>106022572
Distributed training when?
Replies: >>106022671 >>106022739
Anonymous
7/25/2025, 8:23:29 PM No.106022671
>>106022627
Never because gradient accumulation is the most important step
Latency on gradient accumulation = slower learning
Anonymous
7/25/2025, 8:26:27 PM No.106022739
>>106022627
Not on consumer gpus, but it happened twice already with INTELLECT and Nous, as far as i know. Look the models up. There's probably a few others.
Anonymous
7/25/2025, 8:27:32 PM No.106022754
>>106022725
>>106022725
>>106022725