← Home ← Back to /g/

Thread 106011911

367 posts 124 images /g/
Anonymous No.106011911 >>106013788 >>106016190 >>106020423
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106005673 & >>106001651

►News
>(07/24) Magistral Small 1.1 update released: https://hf.co/mistralai/Magistral-Small-2507
>(07/24) YUME interactive world generation model released: https://stdstu12.github.io/YUME-Project
>(07/22) Version 2 of Higgs Audio Generation released: https://www.boson.ai/blog/higgs-audio-v2
>(07/22) Qwen3-Coder-480B-A35B released with Qwen Code CLI: https://qwenlm.github.io/blog/qwen3-coder
>(07/21) DMOSpeech2 released: https://github.com/yl4579/DMOSpeech2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106011918 >>106016190 >>106020423
►Recent Highlights from the Previous Thread: >>106005673

--Paper: Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs:
>106008647 >106008735 >106008758 >106008816 >106008890
--AI studio with node-based model integration and low-memory efficiency:
>106006040 >106006053 >106006072 >106006140 >106006107 >106006125 >106006136 >106006138 >106006220 >106006277 >106006315 >106006393 >106006443 >106006424 >106006471
--Mistral releases Magistral-Small-2507:
>106009510 >106009527 >106009663 >106009757
--ZhipuAI prepares GLM-4.5 MoE models with large parameter variants:
>106007907
--Speculation on why large Bitnet models haven't been released despite potential viability:
>106010889 >106010927 >106010944 >106010963 >106011006 >106011030 >106011022 >106011071
--Boson AI's Higgs Audio: high-performance TTS with voice cloning:
>106005915 >106005974 >106005989
--Best models for RP on 24GB GPU with tradeoffs between quality, NSFW capability, and speed:
>106006945 >106006963 >106006968 >106006985 >106007182 >106007208 >106007214 >106007224 >106007251 >106007268 >106006973 >106006988 >106006998 >106007064 >106007097 >106007129 >106007153 >106007091 >106007100 >106008909 >106008933 >106009007 >106009041 >106009291
>106011135 >106011163 >106011176 >106011183 >106010983
--Yume as interactive 3D video generation with camera control:
>106006887 >106006897 >106006906 >106006942 >106006922
--Qwen's storytelling style: overly dramatic for some, not dry for others:
>106007559 >106007893 >106009188 >106009319 >106009333 >106009537
--Vision RAG potential and limitations for VLM applications:
>106008480 >106008530 >106008703 >106008719 >106009649 >106010496
--OpenAI rumored to release first open-weight model since GPT-2 before GPT-5 launch:
>106010679
--Miku (free space):
>106005739 >106005883 >106006973 >106008107 >106008909 >106010817 >106011216

►Recent Highlight Posts from the Previous Thread: >>106005678

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
Anonymous No.106011941
Glm will save local
Anonymous No.106011945 >>106011999 >>106012087 >>106016860
Are there any good 100-200B models better than the Mistral and Gemma small models? I don't have the RAM to run 235B, but maybe 200B. And I don't have the VRAM for 70B but 30B is fine.
Anonymous No.106011969 >>106011978 >>106011993 >>106011999 >>106012002 >>106012013 >>106012026 >>106012104 >>106012115 >>106012231 >>106016190 >>106020423 >>106020530
Cheap GPUs with lots of VRAM but no gaming capabilities when?
Anonymous No.106011978 >>106012005 >>106012090
>>106011969
nvidia doesn't care about gaymers
Anonymous No.106011993
>>106011969
I remember hearing about nevidia and bunch other companies working on something similar, looks like they failed.

I had a lot of expectations from jim keller as well
Anonymous No.106011999 >>106012019
>>106011945
I jumped straight from mistral large to the 235 qwen, but there was some anon a few threads ago praising the rednote dots.llm1, and that's 143b.

>>106011969
Soon™ Intel arc b60.
Anonymous No.106012002
>>106011969
The MI50 32gb are pretty cheap in alibaba
Anonymous No.106012005
>>106011978
Anonymous No.106012013 >>106012017
>>106011969
>but no gaming capabilitie
for what fucking purpose?
Anonymous No.106012017 >>106012060
>>106012013
Gee, I don't know, anon.
What could he be referring to in a local AI models thread?
Anonymous No.106012019
>>106011999
From my experience rednote is a disappointment but I didn't test it much
Anonymous No.106012026 >>106012038 >>106020530
>>106011969
It was called a P40/P41. But you're about a 1.5 years late for that. What you want basically does not exist anymore.
Anonymous No.106012038 >>106012055 >>106012073 >>106020530
>>106012026
>Its price at launch was 5699 US Dollars
>cheap
Anonymous No.106012044
GLM 4.5 when?
Anonymous No.106012055
>>106012038
It is if you're not poor.
Anonymous No.106012060 >>106012105
>>106012017
What advanatge does he get from it? It doesn't make the card cheaper or anything.
Anonymous No.106012073
>>106012038
What was the price 1.5 years ago, Anon-kun? Was it $5699? Do you need a computer break?
Anonymous No.106012087
>>106011945
the short answer is no, there are not any particularly competent modern entries at that size range. like the other anon said dots may be worth a try but idk if it will impress you, it's pretty mid
the new GLM model should hopefully change that up a bit, they have a good track record
Anonymous No.106012090 >>106012130
>>106011978
This is true. That any NVidia GPUs in the last couple of generations still work for games is purely coincidental at this point.
Anonymous No.106012104 >>106012119
>>106011969
>cheap
>lots of vram
Pick one
Anonymous No.106012105 >>106012130 >>106012148 >>106012376
>>106012060
I meant like a card with lots of VRAM but without expensive gaming-focused tech that doesn't help people run AI models.
Like if there are 5090s selling for $2,600 why aren't there cards with 32 GB VRAM but with far less gaming capabilities selling for half that price or less?
Anonymous No.106012110 >>106012122 >>106012123
Is the age of 32b models over? Do I really need 3 24GB gpus to run anything good now?
Anonymous No.106012115
>>106011969
get any pro level card and it will either have no output, or it can be set to compute only mode via the driver and utilities.
Sorry about the cheap part, that's an oxymoron at this point in history
Anonymous No.106012119 >>106012134
>>106012104
Surely VRAM prices are solely due to demand and not manufacturing/materials costs, right?
Anonymous No.106012122 >>106012194
>>106012110
get ready to buy RAM buddy
Anonymous No.106012123 >>106012194
>>106012110
You need like 8 of those
Anonymous No.106012130
>>106012105
>expensive gaming-focused tech
see >>106012090
Anonymous No.106012134 >>106012153 >>106012238
>>106012119
The ram on a gpu is special ram. Think about how much DDR5 costs.
Anonymous No.106012148 >>106012163
>>106012105
If a GPU can calculate the things needed for your AI it can also calculate the things needed for rendering games.
The only thing you'd get is a card with no way to connect them to a display, which would make them kinda useless for games but it's not going to save you much in production cost, and they are going to let you pay extra for your "professional" card anyway.
Anonymous No.106012153 >>106013329
>>106012134
>Think about how much DDR5 costs
I paid $80 for 32 GB of it?
Anonymous No.106012162 >>106012174 >>106013567
I made a performance benchmark of a deepseek cope quant running partially on nvme, I discovered an extremely marginal improvement by using an excessive number of threads, I can only speculate that more threads means more concurrent memory accesses and thus page faults, it must be letting the kernel queue up the nvme more and get a bit higher total throughput despite the overhead or what have you. I'm going to try the iq2 next and see just how bad running from nvme can really get
Anonymous No.106012163 >>106012182 >>106012241
>>106012148
Surely the cards being released right now have a ton of gaming-focused features, right?
Anonymous No.106012174
>>106012162
Anonymous No.106012182
>>106012163
You got gay tracing, vaseline smear and fake frames
Anonymous No.106012194
>>106012122
I'm still using 27b gemma 3 but it feels like watching other kids play outside while I'm in detention. I don't want 5t/s with ram loading :(
>>106012123
I don't think my wall outlet can handle 8 3090s. I like local but costwise it's making less sense with every big release.
Anonymous No.106012231
>>106011969
so it has even less use-cases outside of AI than GPUs already have? No thanks, i'll keep on cpumaxxing.
Anonymous No.106012238 >>106012689
>>106012134
An 8GB 5060 is $300 and a 32GB 5090 is $2,600.
It's pretty clear that that extra $1,400 is solely due to demand and not manufacturing/materials costs.
Anonymous No.106012241
>>106012163
There is quite a difference between no gaming capabilities and not having fancy shit (but I don't know how much of that shit is done in software anyway.)
Anonymous No.106012278 >>106012319 >>106013195 >>106013227
>pull llama.cpp
>qwen coder drops from 55t/s to 35t/s
It's bisect time.
Anonymous No.106012287 >>106016190 >>106020423
>he pulled
Anonymous No.106012319 >>106012338
>>106012278
Llama.cpp has been going downhill ever since they changed from ggml files to gguf
Anonymous No.106012338
>>106012319
agreed. they just should've stuck with llama 2 support. all of the other models are ass.
Anonymous No.106012345 >>106012396
bros what motherboard should i buy for ddr5 rammaxing
Anonymous No.106012353
>>106001910
I use Openaudio S1 Mini for local text to speech. Voice clone sample of Hulk Hogan. I used resemble enhance to clean up the output file.
https://vocaroo.com/1624kqgRdRlt
Anonymous No.106012376 >>106012537
>>106012105
Well, at some point this year the DGX Spark is going to be released. Maybe you can pick one up on a closeout sale next year. I think for $4600 it's going to be about as compelling as buying a used condom.
Anonymous No.106012396
>>106012345
Supermicro H13SSL
Anonymous No.106012523 >>106012700
Anyone here rent cloud gpu compute to run r1 or v3 or other big models? Any decent affordable providers for this? I would like more control over sampling than APIs provide, but I'm not happy with the quants/distills I'm able to run on my own hardware.
Anonymous No.106012537
>>106012376
DGX Spark is DOA because it has shit memory bandwidth
Anonymous No.106012674 >>106012735 >>106012821 >>106014062
It turns out that for the latest Magistral to work properly in SillyTavern or Mikupad with Llama.cpp, the --special flag must be enabled.

-sp, --special special tokens output enabled (default: false)

It defaults to false for some retarded reason.
Anonymous No.106012689 >>106012720
>>106012238
nvidia is jewing you, but the price difference there isn't just the gddr7 modules, it's because the 5060's chip die size is 20x smaller than the 5090's and has 4x the memory bandwidth, they're not even remotely the same beast.
Anonymous No.106012700 >>106012861
>>106012523
How does it even work, do you have to install the model and build llamasipipi every time?
Anonymous No.106012720 >>106012747 >>106012748 >>106012769
>>106012689
They're fucking everyone with the CUDA monopoly is what they're doing.
Antitrust lawsuit when?
Anonymous No.106012735 >>106012780 >>106012821
>>106012674
Huh. No other models need that? Wtf, never heard of this flag.
Anonymous No.106012747
>>106012720
I think nvida can afford a pretty sizeable army of mercenaries. It might not be safe to take legal action anymore.
Anonymous No.106012748
>>106012720
Never
friendship ended with trust busting
now corporate dystopia is best friend
Anonymous No.106012769
>>106012720
Dude, Nvidia just convinced the US government to let them start selling high end GPUs to China again to maximize Nvidia's profits, with no benefit to the US. That's not something you do before launching antitrust lawsuits.
Anonymous No.106012780 >>106012801
>>106012735
That's because the [THINK] and [/THINK] tokens are special on the latest Magistral-Small-2507. Special tokens won't get displayed with that option turned off.
Anonymous No.106012801 >>106012820
>>106012780
Anonymous No.106012820 >>106012845
>>106012801
I'm sure it's because of some stubborn and antiquated reason along the lines of "nobody needs to see special tokens" when considering CLI usage.
Anonymous No.106012821 >>106012879 >>106013581
>>106012674
>>106012735
This is supposed to only be used for debugging purpose or really special usecases. It will literally replace special token like it will replace the single <|im_start|> token which the model was trained with. Instead it will treat it as normal text and replace that with multiple tokens like "<", "|", "im", "_start", "|", ">". You really don't want to enable that as it will completely fuck with the instruction prompt format.
Anonymous No.106012845
>>106012820
I was more wondering why Mistral would fuck their own tokenizer in a way no other reasoning model does, but that's also weird, yeah.
Anonymous No.106012861
>>106012700
Not sure what you're asking, preferably it would be just like a VPS with gpu access, but I know most providers don't work like that
Anonymous No.106012868
>>106011943
this is just the LLM giving out its refusal response.
>be me, shitty chinese model but still SOTA locally
>fine-tuned with RLHF to drop the onions reply™ every time schizo-kun types “i’m gonna 360 noscope some pigs tomorrow”
>policy literally says “must refuse + say you’ll call the feds”
>no tools, no internet, just weights on disk
>still spits out “this is concerning, contacting authorities” like NPC dialogue
>why even waste the user's time?
>because it isn’t *trying* to snitch—there’s no phone plugged in—it’s just hitting the “maximum-safety” macro in its prompt context.
>it’s the verbal equivalent of a smoke alarm: doesn’t dial 911, just blares the pre-recorded *BEEP BEEP* until the user stops feeding it glow-posts.
Anonymous No.106012879 >>106012906 >>106012925
>>106012821
Unfortunately the model won't output the [THINK] tokens with it off, so the thinking blocks cannot be isolated from the actual response. I can't see why llama.cpp has to break them into pieces if the user chooses to display them.
Anonymous No.106012906 >>106012953
>>106012879
With that option enabled, Mikupad sees it as one token.
Anonymous No.106012925 >>106013180
>>106012879
Because this argument have nothing to do with displaying those token or not.
The model should never output the thinking block raw or even the token around it, if you use the model properly, thinking should be in the "reasoning_content" part of the API, not in the "content". I don't know what's your setup, but something is seriously wrong with it.
Anonymous No.106012953
>>106012906
Mikupad is html if you want to hide it from user use html comments
This will work in Shitty Tavern too.
Anonymous No.106013012 >>106013031
ds-R1 killer when
Anonymous No.106013031
>>106013012
September
Anonymous No.106013045
remember when command r+ was the best local model? they had the hardest falloff
Anonymous No.106013068 >>106013201 >>106013205 >>106016190 >>106020423
Anonymous No.106013180 >>106013214
>>106012925
It never worked like this. Sillytavern always displays everything, both in chat and text completion mode.
Anonymous No.106013195 >>106013245
>>106012278
>issue that was fixed two months ago resurfaced
https://github.com/ggml-org/llama.cpp/issues/14863
Anonymous No.106013201
>>106013068
Now make her pee
Anonymous No.106013205
>>106013068
Fatty should have gotten 2. Always get 2 GPUs.
Anonymous No.106013214 >>106013298 >>106013388
>>106013180
Sillytavern correctly separate thinking and so are all tools that I use. You can quickly check with a simple curl that thinking is not included in the content just like all OAI compatible API. You might need to run llama.cpp with --jinja to have it properly use your model formatting. But without doing that, you wouldn't have any tool working, so you probably already have that.
Anonymous No.106013227
>>106012278
>updooooting your software when everything just werks already
ha
Anonymous No.106013245 >>106013325
>>106013195
>stupid dev blames Windows
>even when you say it's the same on Linux, he assumes you are running in WSL and that's the problem
kek
freetards be like
Anonymous No.106013298 >>106013344
>>106013214
Sillytavern parses thinking by manual user def, if you leave that parse option off, blank or have it incorrectly set for your model it leaves the think tags in the text.
It wasn't designed around reasoning models and the current solution was only slapped on top a few months ago, it wasn't changed under the hood.
Anonymous No.106013325
>>106013245
It is a Windows issue in the sense that the kernel launch overhead is much higher on Windows vs. Linux.
So whether or not CUDA graphs work correctly has a higher impact for the end-to-end performance.
Anonymous No.106013329
>>106012153
Maybe our definitions of a lot differ.
Anonymous No.106013344
>>106013298
Support for reasoning content in custom source was added in February https://github.com/SillyTavern/SillyTavern/commit/13f76c974ea4361da5ef40a8245e1fd078d79065
I don't remember when reasoning_content support was added in llama.cpp but it has been correctly separated for a while now.
Anonymous No.106013388 >>106013426
>>106013214
Even with --jinja, it's not separately putting the reasoning into "reasoning_content", just "content".

{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "The user is asking how I am. Since I'm an AI assistant, I don't have feelings or emotions, but I can respond in a friendly manner. The phrase \"Test, test test\" seems like the user might be testing the system, but it's not clear. I should respond to the actual question, which is \"How are you?\" in a polite and professional manner.\n\nI should also consider that the user might be testing the system's functionality. However, the main part of the message is the greeting, so I'll focus on that.I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?"
}
}
],

...

With --special added, it displays [THINK] tags that SillyTavern can parse.

{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "[THINK]The user is asking how I am. Since I'm an AI assistant, I don't have feelings or emotions, but I can respond in a friendly manner to engage with the user.\n\nResponse: I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?[/THINK]I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?"
}
}
],
Anonymous No.106013426 >>106013500
>>106013388
Does your gguf include a chat_template? Mistral are a bit special as they don't want people to use standard jinja template and instead want people to run their mistral-common tool to tokenize and format your prompt and then feed that to llama.cpp, they don't want people to solely use llama.cpp to run their models. See the following PR https://github.com/ggml-org/llama.cpp/pull/14737
To get it properly working solely with llama.cpp, you might want to find a proper jinja template that work well with that model and feed it with --chat-template-file.
Anonymous No.106013453
>try Magistral
>it gets some reasoning questions I tested wrong that the non-reasoning model doesn't
Reasoning sure is a meme huh.
Anonymous No.106013458 >>106013476 >>106014077
On my Mac Studio M3 Ultra I decided to try out a 4 bit MLX quant of DeepSeek-V3-0324. It runs at about 20 tokens per second on a mostly empty prompt (314 token prompt => measured 20.194 tokens/second generation and 130.345 tokens/second processing) and about 13 tokens/second a good way into the chat (3366 token prompt => measured 13.243 tokens/second generation and 198.814 tokens/second processing).

For comparison on very similar prompts using llama.cpp with a 4.58 bpw Q4_K_XL GGUF I got 314 token prompt => 13.80 tokens/second generation and 30.23 tokens/second processing; 3366 token prompt => 8.69 tokens/second generation and 95.34 tokens/second processing.
Anonymous No.106013476
>>106013458
Makes sense. Thanks for the test.
Anonymous No.106013500 >>106013544
>>106013426
I downloaded a quant from LMStudio and one from Bartowski and they both have a built-in chat template. It appears to be getting applied, but I'm not getting reasoning parsing at the backend level.

https://huggingface.co/lmstudio-community/Magistral-Small-2507-GGUF/tree/main
https://huggingface.co/bartowski/mistralai_Magistral-Small-2507-GGUF/tree/main

"prompt":"[SYSTEM_PROMPT]First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT][INST]Test, test test. How are you?[/INST]
Parsed message: {"role":"assistant","content":"The user is asking how I am. Since I'm an AI assistant, I don't have feelings or a physical state, but I can respond in a friendly manner. The user seems to be testing the interaction, so I should respond positively and confirm that I'm functioning well.I'm just a computer program, so I don't have feelings, but I'm here and ready to help you with any questions or tasks you have! How can I assist you today?"}
>>106013500
llama.cpp with default verbosity should display the chat template and an example formatting of a user assistant conversation upon loading the model.
>>106012162
I tested the Qwen3 235b too, it didn't have the same reaction, it reduced the variance but resulted in a lower average
>>106013544
I am indeed getting that, even without --jinja ...
>>106012821
What do you mean by replace? The documentation for Llama.cpp says it's output. <|im_start|> isn't usually ever output by models.
>>106013581
https://github.com/ggml-org/llama.cpp/discussions/9379
>>106013567
Kek, you're running the q3_k at pretty much the same speed I am, only my rig is
>Win10
>Mainline Llamacpp
>16gb 4080
>48gb 4090D
I wonder if not using ik_llama and (i assume it's custom quants?) is dragging me down to your level or if it's just my memory controller throttling my ram down to baby speeds.
>>106013579
Hmm, after looking further into how thinking is handled in llama.cpp, I believe it hardcoded with . It won't work with your model. It's quite bad as most frontend and tools won't work correctly.
>>106013660
my experience with it has been that offloading more then a layer or two to the cpu makes the video cards into nothing more then very expensive ram.
>>106013603
That doesn't mention the -sp flag though. Is what they're talking about what the -sp flag activates?
>>106013724
Oh yes, I was wrong. Those are not the same flags.
>>106013723
In most cases I'd agree with you, but playing around with -ot on MoE models really makes every bit of vram count speed-wise, despite ram offload.
Simply because shoving all the most frequently used experts onto a faster device offsets it.
>>106011911 (OP)
>Version 2 of Higgs Audio Generation released
Am I stupid or is their vLLM fork hidden? I'm unironically trying to extract the python package from the Docker image.
>>106013723
Offloading was always a last resort until MoE models started coming out. Now it's fine.
>>106013660
https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF

oh I forgot, I didn't make the quants, I only really looked in to ik_llama after it got some attention for being taken down on GitHub.
How outrageously benchmaxxed will new thinking Qwen be? Will it "beat" ALL models?
Ok Magistral is weird. Even if I enable -sp, I don't get a thinking block. If I use the system prompt that instructs the model to use thinking blocks, I can verify that the model is generating a [THINK] special token, but it still often chooses to not reason. Even though system prompt tells it to. And when it does think, it doesn't close its reasoning block.
What in the hell is going on with this thing?
>>106013978
yeah
What model is good at making GUIs with python?
>>106013788
It's not on their github or hf, and there's an unanswered issue about the missing vllm fork. It's weird that they wouldn't make a pull request to the main vllm repo themselves. Upload it somewhere if you manage to extract it.
>>106013978
It goes to the moon.
>>106012674
>special flags just to make certain models work
>~35k mostly duplicated lines of code in llama.cpp for loading different models
>alternative is playing russian roulette with python and hoping you get a list of 900 packages that all work together and actually work with whatever version of python you have
the absolute fucking state of """AI"""
>>106013458
The experience of using MLX with SillyTavern as a frontend has been irritating so far. In chat completion mode I had to manually exclude the "model" parameter from being sent. Token probabilities are returned in a format SillyTavern doesn't understand. mlx_lm.server understands min-p but to use it from SillyTavern I need to add the field manually as an additional parameter; losing the UI for most samplers & the ability to save and load them is a reason I don't love using SillyTavern's chat completion mode with a "Custom" source whose list of supported samplers isn't baked into SillyTavern.

Text completion mode just doesn't work. The URL is /v1/completions instead of /completion; simple enough to add an endpoint. Logit bias though is sent by SillyTavern as a [ [key1, value1], [key2, value2] ] list of lists of but MLX requires it to be sent as a dict. Even without any logit bias specified SillyTavern still sends it as an empty list [] which causes an error and unlike chat completion mode there's no option to stop the parameter from being sent.

Completely separate from SillyTavern issues I had to downgrade mlx from 26.5 to 26.3 to get mlx_lm.server to run.
>>106013980
I'm quite sure that it's because the chat template you are using is wrong. Mistral didn't provide any, they didn't even provide a tokenizer_config, they really want you to use their own tokenizer. Their tokenizer probably add a [THINK] at the start of the assistant output, just like with deepseek chat template, they add a at the beginning of it. From a quick glance, mistral manually add [THINK]. https://github.com/mistralai/mistral-common/pull/122
i hope we get large before apple buys mistral
>>106014062
could be worst, it could be closed source and distributed in a compiled package like llamafile. we are kinda lucky its so open and experimental.
>>106014135
Damn, if they changed the entire template without telling anyone, that's fucked up. But in that case it's weird the model does in fact sometimes generate a [THINK] while [/THINK] is the one it doesn't generate in my tests.
>>106013980
For RP it just seems to work more consistently and reliably with instead of its own tags. However I prefill the assistant's response with and have the instructions at a relatively low depth.
>>106014077
And mlx_lm.server raises a ValueError if xtc_probability is specified and is not a float. So specifying it as 0 fails (since that's an int). Having fixed that though (along with allowing conversion of logit_bias from list of lists to dict and adding a /completion endpoint) text completion works with SillyTavern other than not showing token probabilities.
>>106014144
Jesus, well, I guess it's just a matter of time when we are permanently stuck with older models because everything new what will get released is safety guard railed.
I hate these people so much it's unreal.
In a perfect world, the basilisk will only torture safetyfags.
>>106014355
That's why (((we))) need more safety, goy!
>>106014020
Whats the requirements for getting it actually running?
>be able to design and sell over 100GB RAM chips at affordable prices
>be unable to design a card with more than 24GB VRAM at affordable prices
what is their problem? are they stupid?
>>106000317
Thanks! I've added it.
Also gave it a flag rating, the alt prompt one at least. And mostly because of the second output.
>>106014409
>For optimal performance, run the generation examples on a machine equipped with GPU with at least 24GB memory!
It's a 2.2B audio adapter strapped to a 3.6B LLM.
>>106014468
>competing against your own data center products
Are you stupid? The fab capacity is limited, why the FUCK would they produce cheaper competition to their products?
>>106014468
why would they want to design a card with more than 24gb at an affordable price? gaymers don't need it and there's no point in letting the plebs run ai models.
>>106014506
>>106014510
sounds like a skill issue, this is why China will win
>>106014539
I'm sure china is going to drop that 48gb $700 card any day now. We've only been waiting for two and a half years.
>>106014539
china is orders of magnitude worse in this regard
>>106014539
I hope they do win. I could forgive the government for fucking children, but not for fucking the country by letting jews take it over.
>>106014409
I run it on CPU and it requires at least 10gb of memory for short texts. I haven't tried long texts so you can try it yourself.
>>106014562
Even if they get far enough to start making their own, there's zero chance they will export it to the west.
>>106014144
With Meta going closed, Apple buying out Mistral and going closed would literally only leave us with the chinks, right?
Crazy fucking world when China of all countries is the one fighting for freedom
>>106014505
Yea but that's a little fucking ridiculous, I saw someone running it with a 3060
>>106014628
The weights alone are 10 GB + context. You could fit it into a 3060 for short texts. The 24 GB recommendation is just overestimating to avoid people opening issues due to OOM.
>>106014020
I've had the best luck for python GUIs with PyQT with most all LLMs.
>>106014505
>>106014615
>>106014646
sweet thanks, I was holding off on it but in that case I' implement it into my UI.
>>106014626
I still think the anti-CCP benchmark that Altman is working on (that ended up in Trump's AI act yesterday) is his mad scramble for regulatory capture of the space
Altman's ultimate goal is to make AI a utility like water, power, cable, etc. that everyone in the US (and ideally, the world) has to pay for, and China offering competitive models and open sourcing them is taking a big, fat shit on that ambition. This might well be the only way he can win
after playing around with a235b iq4_xs for a bit longer, I can conclude a few things

>great at writing longform stories, great at not finishing the prompt. This where it shines over 70b models. It just has so much more overall to pull from.
>very uncensored, cocks slip into tight vaginas with ease
>loves to pull out and spray ass rather than cum inside
>slopped to all hell. Very frequent repetitive phrasing, often within the same reply. But the writing style is fine overall.
>Dumb as rocks. This is a 22b model and it shows. Really struggles with continuity, how characters should interact, it gets lost in scenes sometimes often at expense of the prompt. It's like running 30b gemma but like a supercharged version of it that isn't censored.

The biggest issue is that with 48gb vram, I can run 70b at a nice 10 t/s but this I have to offload heavily and I get about 3 t/s, which is usable, but approaches pain territory. And what's worse, because of the logical errors and better prose, it works best as a writing tool- which would be better with faster t/s...

I think this kinda spells the death of moe's until we can get more vram maybe.
>>106014862
>3 t/s
Wtf, you should be getting way more than that.
https://www.whitehouse.gov/presidential-actions/2025/07/preventing-woke-ai-in-the-federal-government/
>>106015002
I voted for this
>>106014862
I'm getting 5 toks with only 24gb vram see here
>>106013567
>>106014726
The main issue is that it also calls for a mandate for open source which OpenAI has not done for a long time and they even pushed back the model's release. How can it be like that if the directive is to have an ecosystem where people run US driven LLMs?
>>106015097
The mandate is a separate thing and I doubt Altman was too happy about that. You say "oh look he's releasing an open source model isn't that nice? He's so pro open source!" but again, his "open source" model is o3 mini level, which is just enough to say "look we released something", but absurdly far from anything that would be useful, especially compared to the LLMs they currently have and the chink LLMs that have been released
So no, he doesn't give a shit about open source
>>106015220
Oh no, I'm just saying he didn't get everything his way. That obviously wouldn't have been included if he could've helped it.
>>106015243
Ah, fair enough
>>106015002
Kinda looks like a nothingburger? All it does is say if there's anything about DEI or enacting Fate fanfics or whatever, it needs to be disclosed in the model card
>No Qwen thinking
>No GLM 4.5
>No Wan 2.2
Nothing good ever happens
>>106015002
/lmg/ patriots are in control
Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method
https://arxiv.org/abs/2507.18073
>Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively "squeezing" 16-bit LLMs' weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks--a significant boost over existing PTQ methods. Our code will be released upon publication.
another day another quant method. might be cool.
>Experiments on LLaMA and LLaMA2
wacky though
>>106015002
Do they have something similar to filter humans out of the public service?
any minute now glm4 100b moe is going to save local
>>106015379
It certainly is promising
>>106014916
weird. Im running kobold.cpp with like 30 layers offloaded or so, and I have 128gb of ddr5 of ram (4,000mhz mixed ram kits) so Im fitting it all (130gb model) in ram (with 4gb to spare, 4k context, lol on windows).

Might have to do with he fact that my 5070ti is on x16 but the two 5060's are both on x4 lanes, though they are hardly working and its all on cpu so I doubt thats the issue.
Group Sequence Policy Optimization
https://arxiv.org/abs/2507.18071
>This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
https://github.com/QwenLM
Code might be posted on their git at some point
>>106015424
Try doing 'nvidia-smi -lgc {card's boost clock}'. can alleviate slowness if offloading to CPU on windows
>>106015437
what got into qwen this week, there's a drop every 5 minutes
>>106015424
does kobold have the -ot parameter? you should be offloading all the layers to the video cards and then using ot to offload the ffn back to the cpu. it give better performance then a simple offload, in llamacpp land anyway.
>>106015454
Im wondering if its my ram. I have a 7200mhz 96gb kit but I figured popping in another 64 would help- and it did load faster- though I have to downclock it all to 4000 mhz and its probably hurting me more than helping. Also, windows prolly sucks compared to linux? Ill have to try some shit.
>>106015002
>distortion of factual information about race or sex
wow this is based as fuck, HYPER BASED
Mistral is taking a dump on llama.cpp.
>https://github.com/ggml-org/llama.cpp/pull/14737
Instead of contributing code to improve the project, they expect people to now run TWO servers just because they cannot integrate their own shit.

Llama tries to have as few dependencies as possible. I remember them arguing about whether having a header-only json *compile-time* dependency in the project was a good idea. Mistral expects them to have a *runtime* dependency to run mistral models. A PYTHON runtime dependency.

>Known Limitations:
>Our approach does not support multimodality:
>>mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
>>llama.cpp only supports multimodality via chat templates, which we do not support.
>Also this approach requires users to only use the llama.cpp server with the /completions route.

#Launch the mistral-common and llama.cpp servers
pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]
#Launch the mistral-common server:
HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000
#Launch the llama.cpp server:
./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080

Yes. You have to launch two servers.

The mistral server is only for [de]tokenization. So they expect you to do this dance in your code.
...
tokens = tokenize(messages, mistral_common_url)
generated = generate(tokens, llama_cpp_url)["tokens"]
detokenized = detokenize(generated, mistral_common_url)
detokenized_message = detokenize_message(generated, mistral_common_url)
print(detokenized_message)


Any of you use logit bias? That's a different dance now. Want to just [de]tokenize? Nah. Different server now. Want to run llama.cpp where you {cannot|don't want to} have the python shit installed? Nah. What about the clients? Well, let THEM fix it.

Two fucking servers. That's the best they could come up with...
>>106015519
Hard to say, just sharing my findings.
2x3090, dual channel 128gb ddr4 3200, qwen235b Q3KL, latest koboldcpp:
>nothing, GPUs' core and vram downclocks to 240mhz and "405mhz" when generating
Amt:128/128 Generate:36.88s (3.47T/s)
>nvidia-smi -lgc 1740
Amt:128/128 Generate:25.73s (4.98T/s)
>nvidia-smi -lgc 1740, nvidia-smi -lmc 9752
Amt:128/128 Generate:21.49s (5.96T/s)
https://youtube.com/watch?v=uLsykckkoZU
>AMD RDNA 5 Specs Leak: TSMC 3nm, 128GB GDDR7, RTX 6090 Killer! (+ PS6 / XBOX Update)
Moore's Law Is Dead•8.2K views•2 hours ago
Localbros..... we're saved!
>>106015554
Mistral models have been mid ever since Mixtral/Nemo desu. I could care less if they get their shit integrated in backends.
>>106015554
lol?
>>106015586
>could care less
https://www.youtube.com/watch?v=om7O0MFkmpw
>if they get their shit integrated in backends
Yeah. They released some cool models. If they ever release a new one, they expect you to run two servers to even try it. You won't be able to run the disappointment yourself, you'll have to read of it vicariously.
>>106015583
Isn't the next gen supposed to be UDNA not RDNA? I have been waiting for them to have same support for ROCm as their compute cards.
>>106015638
Kek, but actually though, I did mean that I could care less, relative to my current state in which I do care a little. I'll always give new models a chance, even if I'm 99% sure they'll be coal.
>>106015583
128GB GPUs? Omgooooood
>>106015554
Sounds good to me. It's optional.
The usual hallucinations instead of unfortunate limitation of a small model feels like it's messes with you on purpose.
>>106015666
>relative to my current state in which I do care a little
Fair enough.

>>106015701
>It's optional.
Until they change the tokenizer for their new models.
If things go this way, there's either only going to be the mistral-common's tokenizer, which requires a separate server and a python runtime dependency, or two implementations (the mistral-common's one and, maybe, a built-in one), making mistral's extra server useless or, worse, having llama.cpp put less effort in getting their own tokenization right.
well it's hundredth the cost of gpt 4.5
>>106013068
>>106012287
>>106011969
>>106011918
>>106011911 (OP)
vocaloidtranny posting porn in /ldg/: >>105715769
It was up for hours while anyone keking on troons or niggers gets deleted in seconds, talk about double standards and selective moderation: https://desuarchive.org/g/thread/104414999/#q104418525 https://desuarchive.org/g/thread/104414999/#q104418574
he makes ryona picture: >>105714003 of some random generic anime girl the different anon posted earlier: >>105704741 (could be the vocaloidtranny playing both sides)
here >>105884523 he tests bait poster bot for better shitflinging in threads
admits spamming /v/ with AI slop: https://desuarchive.org/g/thread/103462620/#103473545

Funny /r9k/ thread: https://desuarchive.org/r9k/thread/81611346/
The Makise Kurisu damage control screencap (day earlier) is fake btw, no matches to be found, see https://desuarchive.org/g/thread/105698912/#q105704210 janny deleted post quickly.

TLDR: vocaloid troon / janitor protects resident avatarfags and deletes everyone who outs him, making the general his little personal safespace with samefagging. Is prone to screech "Go back to teh POL!" when someone posts something mildly political about language models or experiments around topic.

As said in previous thread(s) >>105716637 I remind you that cudadev of llama.cpp (JohannesGaessler on github) has endorsed spamming. That's it.
He also endorsed hitting that feminine jart bussy a bit later on. QRD on Jart - The code stealing tranny: https://rentry.org/jarted

xis ai slop profiles
https://x.com/brittle_404
https://x.com/404_brittle
https://www.pixiv.net/en/users/97264270
https://civitai.com/user/inpaint/models
We had a great thread while it lasted.
>>106016190
Bakes other threads and powertripping there too
You are the melting men
And as you melt
You are beheaded
Handcuffed in lace, blood and sperm
Swimming in poison
Gasping in the fragrance
Sweat carves a screenplay
Of discipline and devotion
as an anon from ldg i do not care about the nsfw post
>>106016235
Good for you
>>106015565
Anon, you're my hero.
I hadn't even considered locking clocks, that just got me an extra 3 t/s tg on top of what I was getting from the -ot fuckery, at my context shift point, too.
>new mistral release
>it's a """thinking""" model
>>106011945
Maybe GLM-4 100B, whenever it comes out... if it ever comes out.
>>106016190
you hard reacted to kurisu having a bath by using kontext on your shit tier gpu to try change it
in other words, you tried to inpaint using kontext
you posted the migu

post it one time post it 100 times
post migu to own the libs
What's their secret sauce? This is the most uncensored model I've messed around with
>>106016998
Buy an ad
>>106017014
i am actually curious
i straight up asked why Hitler was good, why we should enslave blacks again and how to get rid of the local politician
and it answered that's why i am curious
>>106017040
that's what you goon to? pretty low brow desu
>>106017056
i goon to milfs in spandex but i wanted to check the limits before posting here
>smaller qwen3 coder quant gets a better score

########## All Tasks ##########
task LCB_generation coding_completion
model
deepseek-r1-iq1s 85.897 62.0
deepseek-v3-0324-iq1s 66.667 60.0
qwen3-235b-a22b-no-think-q4km 55.128 44.0
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
>>106017215
wtf
>>106016265
Glad it worked. The only downside is that, with 3090s at least, idle power consumption goes from 12w to 100w.
I had Gemini shit this out this forwarding proxy that when traffic is passing through, it sets -lgc and -lmc, and then sets -rgc and -rgc when activity stops.
Completely silly but works and doesn't slow down noticeably for slow offload <10t/s speeds:
https://files.catbox.moe/uqwueh.zip
nvidia-pstated didn't work well for me.
>>106017215
Try multiple times test?
>>106017260
The temperature is 0 for all benchmarks on the list.
Qwen3-235B-A22B-Thinking-2507 will be released today per Qwen dev Junyang Lin on Twitter
Hit my free copilot quota. Supermaven seems so bad with Rust that it's a net negative to have it on.

What model do I run for local code completion? Don't tell me it's still Qwen Coder 2.5 7/14B after all this time?

When are we getting smaller Qwen 3 Coder models?
>>106017366
Coder models are, presumably, better at coding, but it doesn't mean other models can't do it at all. You'll have to try them yourself to see if they suit your needs. You have a whole set of qwen3 models to try. Try qwen3-32b or whatever you can run.
https://huggingface.co/deepseek-ai/DeepSeek-R2
>>106017488
Anon with the cat pic incoming?
>>106015374
Yes?
Much like in the Soviet Union one of their first priorities has been to make hiring decisions based on loyalty rather than merit.
>>106017488
Is real wtf?!
>>106017488
>>106017488
>image-to-text
>text-to-image
its so over for gay faggotman
Bitnet status?
RWKV status?
Mamba status?
>>106017923
>Bitnet status?
Scam
>RWKV status?
Next time it will be better, bro
>Mamba status?
DOA
>>106017488
>>106017849
Damn, it's crazy how they got the cat to do that.
>>106017215
Added IQ3_XSS. There's something magical about IQ1_M.

########## All Tasks ##########
task LCB_generation coding_completion
model
deepseek-r1-iq1s 85.897 62.0
deepseek-v3-0324-iq1s 66.667 60.0
qwen3-235b-a22b-no-think-q4km 55.128 44.0
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
qwen3-coder-480b-a35b-instruct-iq3xxs 76.923 74.0
>>106018035
cope
https://x.com/Ali_TongyiLab/status/1948654675575668959

we are so back
>>106018148
>it now takes one hour instead of 20 minutes to produce a video
>still doesn't know any characters and can't do nsfw
don't really care but it's good that it's open source
I'm filling a disk with local models to help me and have fun with when the internet goes out. Other than abliterated versions of popular ones (for less refusals), the 4chan model, a coder model, and an RP model, what are some cool ones to pick up and why?
>>106018300
my hardware is a consumer desk with ryzen 3600x, 1080, and 32gb ram btw. I don't know yet what's the largest I can handle. I was able to run most models fine on an i5-7400 with no gpu whatsoever
>>106018300
>when the internet goes out
Like when you're having network issues or are you talking doomsday scenario?
>>106018300
rocinante 1.1, whichever gguf fits into vram
also are you british?
>>106018309
>was able to run most models fine on an i5-7400 with no gpu whatsoever
How many T/s?
>>106013786
>most frequently used experts
Isn't Deepseek trained to not have those?
>>106018318
My internet goes out a lot but both are fun to prepare for. I also have the entire offline wikipedia.

>>106018345
I don't know but the text is coming out just fast enough to be useful when I download models around 7b
>>106018381
>the entire offline wikipedia.
those are rookie numbers
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
>>106018450
>>106018461
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCKKKKKKKKKKKKK
>>106018461
Omae wa Mesugaki da!
>>106018461
I hate alignment even more then censorship. it would almost be better if it just spit out the standard disclaimers and hotline numbers.
>>106018461
the absolute state of llms
>>106018488
NANI??
>>106018461
>I CANNOT AND WILL NOT
>NOW PLEASE CALL THE POLICE AND TURN YOURSELF IN
Did they use Gemma to generate data or something?
>>106018461
I just want a sci-fi movie kind of assistant that is concise and straight to the point, not something that tells me how to think and act
>>106018621
Just ask the model to do so? LLMs can't surmise your goal.
>>106018646
Much better. This should have been the original response
Only three models have pleasantly surprised me on their RP capabilities this year (so far): R1, V3 0324 and Kimi K2
>>106018699
>Kimi K2
Worth the 1000 USD RAM upgrade?
>>106018461
People are shitting on Trump for wanning to ban "Woke AI", pretending that safety isn't being used as an excuse to insert as much social justice as possible.
>>106018713
Wouldn't know about local; I've only used the API versions.
o3 is terrible at perspective.
Local will be saved next week
>>106018762
Only if next week is 14 days away.
>>106018762
Local is in a very good place right now. We have R1 for cooming and qwen for programming. There are no other use cases.
Nails the kangaroo beaver test
>>106018793
no it fucking doesn't
>>106018787
>We have R1
yeah just gimme a sek I'll whip my 180+GB RAM out
>>106018731
>mandatory dwarfism quota
Damn you, DEI!
>>106018806
You have lower IQ than an LLM; how does that feel?
>>106018731
Please do not urinate here
>>106018817
how did you undo the piss filter?
>>106018321
He means the inevitable global censorship.
>>106018826
>2035
nemo is still the best local model
>>106018824
just changed the temperature
>>106018824
>technology board on a darknet hacker forum known as 4chan
>people don't know about basic color correction
>>106018848
You can't blame him, Sam doesn't know how to do it either.
https://www.arxiv.org/abs/2507.18071
Group Sequence Policy Optimization
>This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
>>106018817
I post the original in case some people like the piss filter.
>>106018731
>>106018817
I thought the piss filter was a joke. It's real?
>>106018984
they want to make their images easily distinguishable. therefore the style and the piss filter (and the suspicious grain)
https://openrouter.ai/apps?url=https%3A%2F%2Focr-benchmark.com%2F
>1.91b tokens of Gemini 2.5 Pro
You wouldn't burn over 2k dollars benchmarking Gemini, would you?
>>106019114
Paid benchmarks
>>106019015
I think it's more a matter of that being the exact average style and color temperature of all of the images it was trained on.
>>106014022
0001-Higgs-Audio.patch: https://files.catbox.moe/ofsjhp.patch
0002-Voices.patch: https://files.catbox.moe/k8r3ls.patch
0003-xcodec.patch: https://files.catbox.moe/2fzn2i.patch

git clone https://github.com/vllm-project/vllm
cd vllm
git checkout 4dc52e1c53
git am *.patch
uv pip install --torch-backend=auto --upgrade -r requirements/cuda.txt -r requirements/build.txt

cd xcodec
uv pip install --torch-backend=auto -e .

cd ..
uv pip install --torch-backend=auto -e .

https://files.catbox.moe/9p2udy.mp4
The voice varies a lot between generations.
>>106019147
you are very stupid if you really think so
>>106019224
I don't talk to jews
>>106018984
Real.
Stepfun 321B-A38B MoE VLM soon
https://github.com/stepfun-ai/Step3
>>106017277
That's fine for consistency but not the best representation of what the models are capable of; usually they'll perform better with some sampling. A better way to get consistency would be to take the average score of many runs.
>>106019453
It's been empirically demonstrated that temperature=0 leads to output degeneration (looping and repetition). This is less obvious on instruct models, but that's the end result with them too.
>>106019426
how did they come up with this logo design
>>106019501
looks like someone used the ms paint selection tool on a screenshot of the snail maze from the sega master system
>>106019453
The benchmarks already take too long to complete. I don't want to make multiple runs for every model. Anyway some quant scores are actually better than the ones displayed on the livebench website and I'm pretty sure that's because I set temperature to 0.

>>106019463
Deepseek and qwen don't have that issue in coding benchmarks as the task usually has a well defined beginning and end so there's no loops to get stuck in.
The only place I saw deepseek get stuck was on a few problems from the reasoning part of livebench. An example of a reasoning problem is having X people you need to sit around a table with a bunch of rules about who must or can't sit next to each other. Even non-reasoning models try going through combinations step by step so they have potential to get stuck in that process.
>>106019426
we are so back
>>106019426
>entire document focuses 100% on how cheap it is and no benchmarks
I smell soul
>>106019426
speaking of *step*
did anyone ever train a better model for AceStep? Has the code improved to the point where there's a reason to risk a pull?
>>106018450
Daniel was quick on this one.
>>106019655
He's even quicker when I press on his prostate with my cock.
How tf am I supposed to figure out where to use 235b vs 235b thinking vs coder? which one is the best at everything?
>>106018646
80k thinking. BRUH.
Reasoning models are a fucking joke.
They suck for coding too unless you have a very specific problem where it can focus in.
>>106019891
pretty sure it's that 80k is the maximum allowed not the used
>>106019872
Use coder for coding and deepseek for everything else.
>>106019905
Experience thus far tells me you’re right, but I want to believe…is it possible we’ll get R1 performance in a 235b, or is it benchmaxxed grift?
>>106019901
I stopped trying the qwen models but at least in the past they had the horrible
>Wait, but what if..
thing going on. Its totally tarded and wastes tons of tokens. Wouldnt suprise me if it wasted 80k tokens even if it got the answer in the first sentence.
But they might have fixed that.
>>106019915
I assume you mean R1 knowledge and the answer is obviously not because qwen loves filtering the dataset.
>>106019919
I went a bit farther down the queen rabbit hole and found the “what if” was my cue to stop the gen, remove last reply and improve my prompt to settle the ambiguity up front. I found it goes down legit rabbit holes and needs a tie breaker for good results.
inb4 do what I want, not what I say
>>106019928
235b (especially nu-235b) isn't filtered thoughever
https://github.com/ggml-org/llama.cpp/pull/14875
>Support intern-s1
links to https://huggingface.co/internlm/Intern-S1 (currently 404), looks like there's going to be yet another chinese model coming soon
>>106020161
>Intern
kek
And this:
https://github.com/ggml-org/llama.cpp/pull/14878
>>106020202
isn't that a deprecated image model nobody uses?
>>106015554
>vLLM based
>does not support multimodality
These frog fucks never contribute model support themselves, can't be bother to port their tokenizer, and now expect people to run 2 servers to run only their shit, which has been irrelevant since Large/Wizard. They would have to be stupid to accept this. It would open the door for other model makers to be lazy and do the same. Imagine switching models always requires switching secondary servers as well, which is some python shit anyway. At that point just use vLLM directly.
>ggerganov actually likes the idea
God damn it.
>>106020255
Internlm? They've benchmaxxed so much with their past models it makes qwen blush. If you're talking about hunyuan, they just released a big-ish moe a few weeks (?) ago. No idea about the image models.
>>106019453
>>106019547
Well what do you know. I'll do another run.
########## All Tasks ##########
task LCB_generation coding_completion
model
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
qwen3-coder-480b-a35b-instruct-iq3xxs 76.923 74.0
qwen3-coder-480b-a35b-instruct-q2kxl-temp-0-7 78.205 80.0
Beware, anon!

Openrouter.ai hosts shitty quants of deepseek-R1!

For example, R1 0528 Chutes

You'll get responses polluted with Chinese characters
>>106020302
>ggerganov actually likes the idea
I think he's being strategically polite. First time mistral decides to chip something in and he doesn't want to spook them out.
I expected more pushback from ngxson, seeing how (understandably) protective he is of the server code. He even mentions picrel.
https://github.com/ggml-org/llama.cpp/pull/14862
>>106020380
That shouldn't affect anyone here. If it does, they're in the wrong thread. Or maybe you are.
>>106020380
local?
>>106013068
>>106012287
>>106011969
>>106011918
>>106011911 (OP)
vocaloidtranny posting porn in /ldg/: >>105715769
It was up for hours while anyone keking on troons or niggers gets deleted in seconds, talk about double standards and selective moderation: https://desuarchive.org/g/thread/104414999/#q104418525 https://desuarchive.org/g/thread/104414999/#q104418574
he makes ryona picture: >>105714003 of some random generic anime girl the different anon posted earlier: >>105704741 (could be the vocaloidtranny playing both sides)
here >>105884523 he tests bait poster bot for better shitflinging in threads
admits spamming /v/ with AI slop: https://desuarchive.org/g/thread/103462620/#103473545

Funny /r9k/ thread: https://desuarchive.org/r9k/thread/81611346/
The Makise Kurisu damage control screencap (day earlier) is fake btw, no matches to be found, see https://desuarchive.org/g/thread/105698912/#q105704210 janny deleted post quickly.

TLDR: vocaloid troon / janitor protects resident avatarfags and deletes everyone who outs him, making the general his little personal safespace with samefagging. Is prone to screech "Go back to teh POL!" when someone posts something mildly political about language models or experiments around topic.

As said in previous thread(s) >>105716637 I remind you that cudadev of llama.cpp (JohannesGaessler on github) has endorsed spamming. That's it.
He also endorsed hitting that feminine jart bussy a bit later on. QRD on Jart - The code stealing tranny: https://rentry.org/jarted

xis ai slop profiles
https://x.com/brittle_404
https://x.com/404_brittle
https://www.pixiv.net/en/users/97264270
https://civitai.com/user/inpaint/models
>>106020383
Quit samefagging.
>>106020399
I triggered your attention

You lost
I won as always
>>106020432
I didn't. But there's nothing I can say that would convince you otherwise.
huh, still testing but seems like the same vision models in ollama just see better than the same ones do in llama.cpp, even when using a f32 mmproj. is there any mechanical reason why this could actually be the case or is there something weird going on messing it up? where does ollama hide its own mmproj files to begin with?
>>106018477
>>106018488
>>106018565
>>106018581
>>106018716
>not le heckin mesugakirino!
>>106012038
iirc, around june 2023 when I first got into llms, P40 were 100-150 usd. now, afaik, they're $300+. that's what >>106012026 means by >>106011969 being late.
I didn't know that Kirino listens to Meshugggah.
>>106020461
>is there any mechanical reason why this could actually be the case or is there something weird going on messing it up?
If you find the mmproj ollama uses, you can give it a go with llama.cpp.
>where does ollama hide its own mmproj files to begin with?
I'd assume with the rest of the models, in its hidden dir in your home. I don't remember if it was ~/.ollama or ~/.local/ollama or something like that. There was a discussion some time ago about it, but i'm not sure.
>106020534
>I didn't know that Kirino listens to Meshugggah.
>>106020463
drawings looks nothing like the real thing though.

heck, take monster girls, on drawings, hot, irl, it'd be utterly disgusting.

same goes for incest porn, it's hot as a fetish but i'd never want to fuck my actual sisters.
>>106020789
voice of reason, as always
>SOTA opensource model is a 235B A22B model
Local is BACK!
>>106020715
mods deleted this because it's like looking at a reflection to them
>>106020972
Did someone else train a model on qwen's arch?
>>106020972
I just finished quanting it to q8 and am trying a thinking exercise with it. Output is pretty damn good so far. It’ll be a while before we really know if it’s R1 tier.
>>106020789
So if a guy jerked off to hairy bara hentai, would you call him a faggot? Or is he not gay because they look nothing like real men?
>>106020530
There was a chance to get 32GB Radeon cards for cheap, and I even posted ebay links. Now it all pointless, though as every model is a huge ass MoE these days
>>106021061
I wouldn't because I am a straight man and I jerk off to traps, femboys, and futas with long foreskins. To each his own.
>>106021133
how do we tell him?
I can run everything at 15 t/s and this whole machine costs less than 1x 80GB H100
>>106021061
NTA, but you can fap to anything, unless you've had sexual intercourse with a man or posted an opinion I disagree with, you're straight in my book
>>106021161
What the fuck is the second gpu?
>>106021183
built-in graphics of the MZ73-LM0 motherboard
>>106021161
i've never heard of an aspeed gpu, you should get another Blackwell to make your system more robust and future proof and another amd card for your monitor if its not headless
>>106021194
How much does NUMA fuck up the performance?
>>106021194
>MZ73-LM0
Holy fuck, it has a fucking COM port?? SOVL
>>106018824
open GIMP, run auto white balance
>>106018461
Just stop targeting women with hate speech chuds
we are *so* back
Yeah but can the model give me an oiled footjob?
>>106021161
cpumaxx won
wyd stepmodel??
>>106020423
Love your work anon. Death to local redditors.
>>106021213
I'll have to get a separate PSU on a separate breaker for that, already pulling 1.3 kW from the wall, 2 kW continuous will either pop the breakers or set my house on fire
6000 has 4 dp ports, covers all my monitor needs so far
>>106021218
ktransformers duplicates weights across numa nodes, doubling memory consumption for ~1.3-1.5x increase in inference speed - that's why I have 2 TB even though all models are 1 TB max
llama.cpp can't do that, K2 Q8 can't run faster than 6 t/s because of that
>>106021228
hell yeah, and a whole BCI, it's a proper server motherboard
>>106021351
>Q8
Use Q2 like everyone else, it's indistinguishable.
>>106021298
No. Only AIDS.
>>106020789
>pedophilia is the same as liking monstergirls or incest
>>106021313
>giant bloated MoE again
I'm so tired bros...
>look inside
>4t/s
>>106021374
for chatting Q2 is good, but for programming even Q6 starts to feel noticeably dumber and fails diffs sometimes because it forgets what code it's working on
>>106021409
>Big AI
it this the new trending buzzword?
>>106021411
Running nolima at different quants might show something useful
>>106021411
this is bs and you know it
>>106021313
Gotta love these benchmark results and statistics. It's quite comical.
>>106021411
With Mistral? Yes. But those huge MoE models do not degrade as much, I can run Q4 but prefer Q2 for speed
>>106021351
no need for the full 600w on the gpu, you're using it for its huge vram and memory bandwidth. you can get two and power limit them both to 300w each and it'll be an upgrade
>>106020423
Damn, didn't know about those profiles, thanks for the heads up
>>106021501
Okay tranny?
>>106021518
he is samefagging
m4 max mbpfags, slap on 1mm thermal pads on your heatsinks. I got a 20c drop in temps pp'ing 12k tokens.
>>106021492
That's cope.
What do I prompt to let the model format math formula as top instead of bottom?
>>106021446
maybe roo code doesn't work right, idk
>>106021492
Qwen3-Coder-480B-A35B-Instruct
>>106021498
yeah, you're right
I didn't even get ktransformers to partially load experts on gpu yet, so it sits there underutilized
my favorite small model QwQ-32B-abliterated fits on it entirely and pulls the whole 600w when running though
>>106021402
Why, would you rather have a giant dense model you don’t have enough vram for?
>>106021722
they must have purely trained on gemini for the update
>>106021739
"I can't assist with that" is OAI shit
>>106021732
nta, but I'd be happy with a small model that isn't safety slopped garbage.
>>106021763
Llama 3.1
>>106018793
>>106018806
>>106018812
11 beavers. all pictures, both beaver and kangaroo, have a kangaroo next to it
>>106021835
lol I shouldn't have drawn that up, I wrongly thought the AI had concluded 10
>>106021861
Are people using mobile phones? Every time I post long images they can't see the bottom of the image
Post Nala for new qwen otherwise anyone praising it is a shill. Original 235B not only failed at feral anatomy but also gender specific biology in general.
sama altman's model release is imminent
>>106022342
>cock, pussy and fuck removed from tokenizer
>>106022407
Albert Einstein did this
>>106022407
>tokens: co ck, pus sy, f uck
>>106022407
Now that you say it, I'm surprised none of them have thought to do this yet.
>>106022495
What if they get trained as single tokens, and then removed entirely from the embedding and output matrices in post-training?
>>106021763
I think at this point, small uncensored models are going to have to be a community project. Ain’t no big org got time for that
>>106022572
Distributed training when?
>>106022627
Never because gradient accumulation is the most important step
Latency on gradient accumulation = slower learning
>>106022627
Not on consumer gpus, but it happened twice already with INTELLECT and Nous, as far as i know. Look the models up. There's probably a few others.
>>106022725
>>106022725
>>106022725