/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads:
>>106005673 &
>>106001651►News
>(07/24) Magistral Small 1.1 update released: https://hf.co/mistralai/Magistral-Small-2507>(07/24) YUME interactive world generation model released: https://stdstu12.github.io/YUME-Project>(07/22) Version 2 of Higgs Audio Generation released: https://www.boson.ai/blog/higgs-audio-v2>(07/22) Qwen3-Coder-480B-A35B released with Qwen Code CLI: https://qwenlm.github.io/blog/qwen3-coder>(07/21) DMOSpeech2 released: https://github.com/yl4579/DMOSpeech2►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread:
>>106005673--Paper: Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs:
>106008647 >106008735 >106008758 >106008816 >106008890--AI studio with node-based model integration and low-memory efficiency:
>106006040 >106006053 >106006072 >106006140 >106006107 >106006125 >106006136 >106006138 >106006220 >106006277 >106006315 >106006393 >106006443 >106006424 >106006471--Mistral releases Magistral-Small-2507:
>106009510 >106009527 >106009663 >106009757--ZhipuAI prepares GLM-4.5 MoE models with large parameter variants:
>106007907--Speculation on why large Bitnet models haven't been released despite potential viability:
>106010889 >106010927 >106010944 >106010963 >106011006 >106011030 >106011022 >106011071 --Boson AI's Higgs Audio: high-performance TTS with voice cloning:
>106005915 >106005974 >106005989--Best models for RP on 24GB GPU with tradeoffs between quality, NSFW capability, and speed:
>106006945 >106006963 >106006968 >106006985 >106007182 >106007208 >106007214 >106007224 >106007251 >106007268 >106006973 >106006988 >106006998 >106007064 >106007097 >106007129 >106007153 >106007091 >106007100 >106008909 >106008933 >106009007 >106009041 >106009291>106011135 >106011163 >106011176 >106011183 >106010983--Yume as interactive 3D video generation with camera control:
>106006887 >106006897 >106006906 >106006942 >106006922--Qwen's storytelling style: overly dramatic for some, not dry for others:
>106007559 >106007893 >106009188 >106009319 >106009333 >106009537--Vision RAG potential and limitations for VLM applications:
>106008480 >106008530 >106008703 >106008719 >106009649 >106010496--OpenAI rumored to release first open-weight model since GPT-2 before GPT-5 launch:
>106010679--Miku (free space):
>106005739 >106005883 >106006973 >106008107 >106008909 >106010817 >106011216►Recent Highlight Posts from the Previous Thread:
>>106005678Why?: 9 reply limit
>>102478518Fix: https://rentry.org/lmg-recap-script
Are there any good 100-200B models better than the Mistral and Gemma small models? I don't have the RAM to run 235B, but maybe 200B. And I don't have the VRAM for 70B but 30B is fine.
Cheap GPUs with lots of VRAM but no gaming capabilities when?
>>106011969nvidia doesn't care about gaymers
>>106011969I remember hearing about nevidia and bunch other companies working on something similar, looks like they failed.
I had a lot of expectations from jim keller as well
>>106011945I jumped straight from mistral large to the 235 qwen, but there was some anon a few threads ago praising the rednote dots.llm1, and that's 143b.
>>106011969Soon™ Intel arc b60.
>>106011969The MI50 32gb are pretty cheap in alibaba
>>106011969>but no gaming capabilitiefor what fucking purpose?
>>106012013Gee, I don't know, anon.
What could he be referring to in a local AI models thread?
>>106011999From my experience rednote is a disappointment but I didn't test it much
>>106011969It was called a P40/P41. But you're about a 1.5 years late for that. What you want basically does not exist anymore.
>>106012026>Its price at launch was 5699 US Dollars>cheap
>>106012038It is if you're not poor.
>>106012017What advanatge does he get from it? It doesn't make the card cheaper or anything.
>>106012038What was the price 1.5 years ago, Anon-kun? Was it $5699? Do you need a computer break?
>>106011945the short answer is no, there are not any particularly competent modern entries at that size range. like the other anon said dots may be worth a try but idk if it will impress you, it's pretty mid
the new GLM model should hopefully change that up a bit, they have a good track record
>>106011978This is true. That any NVidia GPUs in the last couple of generations still work for games is purely coincidental at this point.
>>106011969>cheap>lots of vramPick one
>>106012060I meant like a card with lots of VRAM but without expensive gaming-focused tech that doesn't help people run AI models.
Like if there are 5090s selling for $2,600 why aren't there cards with 32 GB VRAM but with far less gaming capabilities selling for half that price or less?
Is the age of 32b models over? Do I really need 3 24GB gpus to run anything good now?
>>106011969get any pro level card and it will either have no output, or it can be set to compute only mode via the driver and utilities.
Sorry about the cheap part, that's an oxymoron at this point in history
>>106012104Surely VRAM prices are solely due to demand and not manufacturing/materials costs, right?
>>106012110get ready to buy RAM buddy
>>106012110You need like 8 of those
>>106012105>expensive gaming-focused techsee
>>106012090
>>106012119The ram on a gpu is special ram. Think about how much DDR5 costs.
>>106012105If a GPU can calculate the things needed for your AI it can also calculate the things needed for rendering games.
The only thing you'd get is a card with no way to connect them to a display, which would make them kinda useless for games but it's not going to save you much in production cost, and they are going to let you pay extra for your "professional" card anyway.
>>106012134>Think about how much DDR5 costsI paid $80 for 32 GB of it?
I made a performance benchmark of a deepseek cope quant running partially on nvme, I discovered an extremely marginal improvement by using an excessive number of threads, I can only speculate that more threads means more concurrent memory accesses and thus page faults, it must be letting the kernel queue up the nvme more and get a bit higher total throughput despite the overhead or what have you. I'm going to try the iq2 next and see just how bad running from nvme can really get
>>106012148Surely the cards being released right now have a ton of gaming-focused features, right?
>>106012163You got gay tracing, vaseline smear and fake frames
>>106012122I'm still using 27b gemma 3 but it feels like watching other kids play outside while I'm in detention. I don't want 5t/s with ram loading :(
>>106012123I don't think my wall outlet can handle 8 3090s. I like local but costwise it's making less sense with every big release.
>>106011969so it has even less use-cases outside of AI than GPUs already have? No thanks, i'll keep on cpumaxxing.
>>106012134An 8GB 5060 is $300 and a 32GB 5090 is $2,600.
It's pretty clear that that extra $1,400 is solely due to demand and not manufacturing/materials costs.
>>106012163There is quite a difference between no gaming capabilities and not having fancy shit (but I don't know how much of that shit is done in software anyway.)
>pull llama.cpp
>qwen coder drops from 55t/s to 35t/s
It's bisect time.
>>106012278Llama.cpp has been going downhill ever since they changed from ggml files to gguf
>>106012319agreed. they just should've stuck with llama 2 support. all of the other models are ass.
bros what motherboard should i buy for ddr5 rammaxing
>>106001910I use Openaudio S1 Mini for local text to speech. Voice clone sample of Hulk Hogan. I used resemble enhance to clean up the output file.
https://vocaroo.com/1624kqgRdRlt
>>106012105Well, at some point this year the DGX Spark is going to be released. Maybe you can pick one up on a closeout sale next year. I think for $4600 it's going to be about as compelling as buying a used condom.
>>106012345Supermicro H13SSL
Anyone here rent cloud gpu compute to run r1 or v3 or other big models? Any decent affordable providers for this? I would like more control over sampling than APIs provide, but I'm not happy with the quants/distills I'm able to run on my own hardware.
>>106012376DGX Spark is DOA because it has shit memory bandwidth
It turns out that for the latest Magistral to work properly in SillyTavern or Mikupad with Llama.cpp, the --special flag must be enabled.
-sp, --special special tokens output enabled (default: false)
It defaults to false for some retarded reason.
>>106012238nvidia is jewing you, but the price difference there isn't just the gddr7 modules, it's because the 5060's chip die size is 20x smaller than the 5090's and has 4x the memory bandwidth, they're not even remotely the same beast.
>>106012523How does it even work, do you have to install the model and build llamasipipi every time?
>>106012689They're fucking everyone with the CUDA monopoly is what they're doing.
Antitrust lawsuit when?
>>106012674Huh. No other models need that? Wtf, never heard of this flag.
>>106012720I think nvida can afford a pretty sizeable army of mercenaries. It might not be safe to take legal action anymore.
>>106012720Never
friendship ended with trust busting
now corporate dystopia is best friend
>>106012720Dude, Nvidia just convinced the US government to let them start selling high end GPUs to China again to maximize Nvidia's profits, with no benefit to the US. That's not something you do before launching antitrust lawsuits.
>>106012735That's because the [THINK] and [/THINK] tokens are special on the latest Magistral-Small-2507. Special tokens won't get displayed with that option turned off.
ytho
md5: 797bfc6fe24a65a173c321efd70b3a0d
🔍
>>106012801I'm sure it's because of some stubborn and antiquated reason along the lines of "nobody needs to see special tokens" when considering CLI usage.
>>106012674>>106012735This is supposed to only be used for debugging purpose or really special usecases. It will literally replace special token like it will replace the single <|im_start|> token which the model was trained with. Instead it will treat it as normal text and replace that with multiple tokens like "<", "|", "im", "_start", "|", ">". You really don't want to enable that as it will completely fuck with the instruction prompt format.
>>106012820I was more wondering why Mistral would fuck their own tokenizer in a way no other reasoning model does, but that's also weird, yeah.
>>106012700Not sure what you're asking, preferably it would be just like a VPS with gpu access, but I know most providers don't work like that
>>106011943this is just the LLM giving out its refusal response.
>be me, shitty chinese model but still SOTA locally>fine-tuned with RLHF to drop the onions reply™ every time schizo-kun types “i’m gonna 360 noscope some pigs tomorrow”>policy literally says “must refuse + say you’ll call the feds”>no tools, no internet, just weights on disk>still spits out “this is concerning, contacting authorities” like NPC dialogue>why even waste the user's time?>because it isn’t *trying* to snitch—there’s no phone plugged in—it’s just hitting the “maximum-safety” macro in its prompt context.>it’s the verbal equivalent of a smoke alarm: doesn’t dial 911, just blares the pre-recorded *BEEP BEEP* until the user stops feeding it glow-posts.
>>106012821Unfortunately the model won't output the [THINK] tokens with it off, so the thinking blocks cannot be isolated from the actual response. I can't see why llama.cpp has to break them into pieces if the user chooses to display them.
>>106012879With that option enabled, Mikupad sees it as one token.
>>106012879Because this argument have nothing to do with displaying those token or not.
The model should never output the thinking block raw or even the token around it, if you use the model properly, thinking should be in the "reasoning_content" part of the API, not in the "content". I don't know what's your setup, but something is seriously wrong with it.
>>106012906Mikupad is html if you want to hide it from user use html comments <!-- hidden text -->
This will work in Shitty Tavern too.
remember when command r+ was the best local model? they had the hardest falloff
>>106012925It never worked like this. Sillytavern always displays everything, both in chat and text completion mode.
>>106012278>issue that was fixed two months ago resurfacedhttps://github.com/ggml-org/llama.cpp/issues/14863
>>106013068Now make her pee
>>106013068Fatty should have gotten 2. Always get 2 GPUs.
>>106013180Sillytavern correctly separate thinking and so are all tools that I use. You can quickly check with a simple curl that thinking is not included in the content just like all OAI compatible API. You might need to run llama.cpp with --jinja to have it properly use your model formatting. But without doing that, you wouldn't have any tool working, so you probably already have that.
>>106012278>updooooting your software when everything just werks alreadyha
>>106013195>stupid dev blames Windows>even when you say it's the same on Linux, he assumes you are running in WSL and that's the problemkek
freetards be like
>>106013214Sillytavern parses thinking by manual user def, if you leave that parse option off, blank or have it incorrectly set for your model it leaves the think tags in the text.
It wasn't designed around reasoning models and the current solution was only slapped on top a few months ago, it wasn't changed under the hood.
>>106013245It is a Windows issue in the sense that the kernel launch overhead is much higher on Windows vs. Linux.
So whether or not CUDA graphs work correctly has a higher impact for the end-to-end performance.
>>106012153Maybe our definitions of a lot differ.
>>106013298Support for reasoning content in custom source was added in February https://github.com/SillyTavern/SillyTavern/commit/13f76c974ea4361da5ef40a8245e1fd078d79065
I don't remember when reasoning_content support was added in llama.cpp but it has been correctly separated for a while now.
>>106013214Even with --jinja, it's not separately putting the reasoning into "reasoning_content", just "content".
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "The user is asking how I am. Since I'm an AI assistant, I don't have feelings or emotions, but I can respond in a friendly manner. The phrase \"Test, test test\" seems like the user might be testing the system, but it's not clear. I should respond to the actual question, which is \"How are you?\" in a polite and professional manner.\n\nI should also consider that the user might be testing the system's functionality. However, the main part of the message is the greeting, so I'll focus on that.I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?"
}
}
],
...
With --special added, it displays [THINK] tags that SillyTavern can parse.
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "[THINK]The user is asking how I am. Since I'm an AI assistant, I don't have feelings or emotions, but I can respond in a friendly manner to engage with the user.\n\nResponse: I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?[/THINK]I'm just a computer program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?</s>"
}
}
],
>>106013388Does your gguf include a chat_template? Mistral are a bit special as they don't want people to use standard jinja template and instead want people to run their mistral-common tool to tokenize and format your prompt and then feed that to llama.cpp, they don't want people to solely use llama.cpp to run their models. See the following PR https://github.com/ggml-org/llama.cpp/pull/14737
To get it properly working solely with llama.cpp, you might want to find a proper jinja template that work well with that model and feed it with --chat-template-file.
>try Magistral
>it gets some reasoning questions I tested wrong that the non-reasoning model doesn't
Reasoning sure is a meme huh.
On my Mac Studio M3 Ultra I decided to try out a 4 bit MLX quant of DeepSeek-V3-0324. It runs at about 20 tokens per second on a mostly empty prompt (314 token prompt => measured 20.194 tokens/second generation and 130.345 tokens/second processing) and about 13 tokens/second a good way into the chat (3366 token prompt => measured 13.243 tokens/second generation and 198.814 tokens/second processing).
For comparison on very similar prompts using llama.cpp with a 4.58 bpw Q4_K_XL GGUF I got 314 token prompt => 13.80 tokens/second generation and 30.23 tokens/second processing; 3366 token prompt => 8.69 tokens/second generation and 95.34 tokens/second processing.
>>106013458Makes sense. Thanks for the test.
>>106013426I downloaded a quant from LMStudio and one from Bartowski and they both have a built-in chat template. It appears to be getting applied, but I'm not getting reasoning parsing at the backend level.
https://huggingface.co/lmstudio-community/Magistral-Small-2507-GGUF/tree/main
https://huggingface.co/bartowski/mistralai_Magistral-Small-2507-GGUF/tree/main
"prompt":"<s>[SYSTEM_PROMPT]First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT][INST]Test, test test. How are you?[/INST]
Parsed message: {"role":"assistant","content":"The user is asking how I am. Since I'm an AI assistant, I don't have feelings or a physical state, but I can respond in a friendly manner. The user seems to be testing the interaction, so I should respond positively and confirm that I'm functioning well.I'm just a computer program, so I don't have feelings, but I'm here and ready to help you with any questions or tasks you have! How can I assist you today?"}
>>106013500llama.cpp with default verbosity should display the chat template and an example formatting of a user assistant conversation upon loading the model.
>>106012162I tested the Qwen3 235b too, it didn't have the same reaction, it reduced the variance but resulted in a lower average
>>106013544I am indeed getting that, even without --jinja ...
>>106012821What do you mean by replace? The documentation for Llama.cpp says it's output. <|im_start|> isn't usually ever output by models.
>>106013581https://github.com/ggml-org/llama.cpp/discussions/9379
>>106013567Kek, you're running the q3_k at pretty much the same speed I am, only my rig is
>Win10>Mainline Llamacpp>16gb 4080>48gb 4090DI wonder if not using ik_llama and (i assume it's custom quants?) is dragging me down to your level or if it's just my memory controller throttling my ram down to baby speeds.
>>106013579Hmm, after looking further into how thinking is handled in llama.cpp, I believe it hardcoded with <think>. It won't work with your model. It's quite bad as most frontend and tools won't work correctly.
>>106013660my experience with it has been that offloading more then a layer or two to the cpu makes the video cards into nothing more then very expensive ram.
>>106013603That doesn't mention the -sp flag though. Is what they're talking about what the -sp flag activates?
>>106013724Oh yes, I was wrong. Those are not the same flags.
>>106013723In most cases I'd agree with you, but playing around with -ot on MoE models really makes every bit of vram count speed-wise, despite ram offload.
Simply because shoving all the most frequently used experts onto a faster device offsets it.
file
md5: 82573cfe3af97e341e6606247f0fd533
🔍
>>106011911 (OP)>Version 2 of Higgs Audio Generation releasedAm I stupid or is their vLLM fork hidden? I'm unironically trying to extract the python package from the Docker image.
>>106013723Offloading was always a last resort until MoE models started coming out. Now it's fine.
>>106013660https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF
oh I forgot, I didn't make the quants, I only really looked in to ik_llama after it got some attention for being taken down on GitHub.
How outrageously benchmaxxed will new thinking Qwen be? Will it "beat" ALL models?
Ok Magistral is weird. Even if I enable -sp, I don't get a thinking block. If I use the system prompt that instructs the model to use thinking blocks, I can verify that the model is generating a [THINK] special token, but it still often chooses to not reason. Even though system prompt tells it to. And when it does think, it doesn't close its reasoning block.
What in the hell is going on with this thing?
What model is good at making GUIs with python?
>>106013788It's not on their github or hf, and there's an unanswered issue about the missing vllm fork. It's weird that they wouldn't make a pull request to the main vllm repo themselves. Upload it somewhere if you manage to extract it.
>>106013978It goes to the moon.
>>106012674>special flags just to make certain models work>~35k mostly duplicated lines of code in llama.cpp for loading different models>alternative is playing russian roulette with python and hoping you get a list of 900 packages that all work together and actually work with whatever version of python you havethe absolute fucking state of """AI"""
>>106013458The experience of using MLX with SillyTavern as a frontend has been irritating so far. In chat completion mode I had to manually exclude the "model" parameter from being sent. Token probabilities are returned in a format SillyTavern doesn't understand. mlx_lm.server understands min-p but to use it from SillyTavern I need to add the field manually as an additional parameter; losing the UI for most samplers & the ability to save and load them is a reason I don't love using SillyTavern's chat completion mode with a "Custom" source whose list of supported samplers isn't baked into SillyTavern.
Text completion mode just doesn't work. The URL is /v1/completions instead of /completion; simple enough to add an endpoint. Logit bias though is sent by SillyTavern as a [ [key1, value1], [key2, value2] ] list of lists of but MLX requires it to be sent as a dict. Even without any logit bias specified SillyTavern still sends it as an empty list [] which causes an error and unlike chat completion mode there's no option to stop the parameter from being sent.
Completely separate from SillyTavern issues I had to downgrade mlx from 26.5 to 26.3 to get mlx_lm.server to run.
>>106013980I'm quite sure that it's because the chat template you are using is wrong. Mistral didn't provide any, they didn't even provide a tokenizer_config, they really want you to use their own tokenizer. Their tokenizer probably add a [THINK] at the start of the assistant output, just like with deepseek chat template, they add a <think> at the beginning of it. From a quick glance, mistral manually add [THINK]. https://github.com/mistralai/mistral-common/pull/122
i hope we get large before apple buys mistral
>>106014062could be worst, it could be closed source and distributed in a compiled package like llamafile. we are kinda lucky its so open and experimental.
>>106014135Damn, if they changed the entire template without telling anyone, that's fucked up. But in that case it's weird the model does in fact sometimes generate a [THINK] while [/THINK] is the one it doesn't generate in my tests.
>>106013980For RP it just seems to work more consistently and reliably with <think> </think> instead of its own tags. However I prefill the assistant's response with <think> and have the instructions at a relatively low depth.
>>106014077And mlx_lm.server raises a ValueError if xtc_probability is specified and is not a float. So specifying it as 0 fails (since that's an int). Having fixed that though (along with allowing conversion of logit_bias from list of lists to dict and adding a /completion endpoint) text completion works with SillyTavern other than not showing token probabilities.
>>106014144Jesus, well, I guess it's just a matter of time when we are permanently stuck with older models because everything new what will get released is safety guard railed.
I hate these people so much it's unreal.
In a perfect world, the basilisk will only torture safetyfags.
>>106014355That's why (((we))) need more safety, goy!
>>106014020Whats the requirements for getting it actually running?
smugg
md5: f9b8376297a47c2fb8740ce3fb3bda57
🔍
>be able to design and sell over 100GB RAM chips at affordable prices
>be unable to design a card with more than 24GB VRAM at affordable prices
what is their problem? are they stupid?
>>106000317Thanks! I've added it.
Also gave it a flag rating, the alt prompt one at least. And mostly because of the second output.
>>106014409>For optimal performance, run the generation examples on a machine equipped with GPU with at least 24GB memory!It's a 2.2B audio adapter strapped to a 3.6B LLM.
>>106014468>competing against your own data center productsAre you stupid? The fab capacity is limited, why the FUCK would they produce cheaper competition to their products?
>>106014468why would they want to design a card with more than 24gb at an affordable price? gaymers don't need it and there's no point in letting the plebs run ai models.
>>106014506>>106014510sounds like a skill issue, this is why China will win
>>106014539I'm sure china is going to drop that 48gb $700 card any day now. We've only been waiting for two and a half years.
>>106014539china is orders of magnitude worse in this regard
>>106014539I hope they do win. I could forgive the government for fucking children, but not for fucking the country by letting jews take it over.
>>106014409I run it on CPU and it requires at least 10gb of memory for short texts. I haven't tried long texts so you can try it yourself.
>>106014562Even if they get far enough to start making their own, there's zero chance they will export it to the west.
>>106014144With Meta going closed, Apple buying out Mistral and going closed would literally only leave us with the chinks, right?
Crazy fucking world when China of all countries is the one fighting for freedom
>>106014505Yea but that's a little fucking ridiculous, I saw someone running it with a 3060
>>106014628The weights alone are 10 GB + context. You could fit it into a 3060 for short texts. The 24 GB recommendation is just overestimating to avoid people opening issues due to OOM.
>>106014020I've had the best luck for python GUIs with PyQT with most all LLMs.
>>106014505>>106014615>>106014646sweet thanks, I was holding off on it but in that case I' implement it into my UI.
>>106014626I still think the anti-CCP benchmark that Altman is working on (that ended up in Trump's AI act yesterday) is his mad scramble for regulatory capture of the space
Altman's ultimate goal is to make AI a utility like water, power, cable, etc. that everyone in the US (and ideally, the world) has to pay for, and China offering competitive models and open sourcing them is taking a big, fat shit on that ambition. This might well be the only way he can win
after playing around with a235b iq4_xs for a bit longer, I can conclude a few things
>great at writing longform stories, great at not finishing the prompt. This where it shines over 70b models. It just has so much more overall to pull from.
>very uncensored, cocks slip into tight vaginas with ease
>loves to pull out and spray ass rather than cum inside
>slopped to all hell. Very frequent repetitive phrasing, often within the same reply. But the writing style is fine overall.
>Dumb as rocks. This is a 22b model and it shows. Really struggles with continuity, how characters should interact, it gets lost in scenes sometimes often at expense of the prompt. It's like running 30b gemma but like a supercharged version of it that isn't censored.
The biggest issue is that with 48gb vram, I can run 70b at a nice 10 t/s but this I have to offload heavily and I get about 3 t/s, which is usable, but approaches pain territory. And what's worse, because of the logical errors and better prose, it works best as a writing tool- which would be better with faster t/s...
I think this kinda spells the death of moe's until we can get more vram maybe.
>>106014862>3 t/sWtf, you should be getting way more than that.
https://www.whitehouse.gov/presidential-actions/2025/07/preventing-woke-ai-in-the-federal-government/
>>106015002I voted for this
>>106014862I'm getting 5 toks with only 24gb vram see here
>>106013567
>>106014726The main issue is that it also calls for a mandate for open source which OpenAI has not done for a long time and they even pushed back the model's release. How can it be like that if the directive is to have an ecosystem where people run US driven LLMs?
>>106015097The mandate is a separate thing and I doubt Altman was too happy about that. You say "oh look he's releasing an open source model isn't that nice? He's so pro open source!" but again, his "open source" model is o3 mini level, which is just enough to say "look we released something", but absurdly far from anything that would be useful, especially compared to the LLMs they currently have and the chink LLMs that have been released
So no, he doesn't give a shit about open source
>>106015220Oh no, I'm just saying he didn't get everything his way. That obviously wouldn't have been included if he could've helped it.
>>106015243Ah, fair enough
>>106015002Kinda looks like a nothingburger? All it does is say if there's anything about DEI or enacting Fate fanfics or whatever, it needs to be disclosed in the model card
>No Qwen thinking
>No GLM 4.5
>No Wan 2.2
Nothing good ever happens
>>106015002/lmg/ patriots are in control
Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method
https://arxiv.org/abs/2507.18073
>Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively "squeezing" 16-bit LLMs' weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks--a significant boost over existing PTQ methods. Our code will be released upon publication.
another day another quant method. might be cool.
>Experiments on LLaMA and LLaMA2
wacky though
>>106015002Do they have something similar to filter humans out of the public service?
any minute now glm4 100b moe is going to save local
>>106015379It certainly is promising
>>106014916weird. Im running kobold.cpp with like 30 layers offloaded or so, and I have 128gb of ddr5 of ram (4,000mhz mixed ram kits) so Im fitting it all (130gb model) in ram (with 4gb to spare, 4k context, lol on windows).
Might have to do with he fact that my 5070ti is on x16 but the two 5060's are both on x4 lanes, though they are hardly working and its all on cpu so I doubt thats the issue.
Group Sequence Policy Optimization
https://arxiv.org/abs/2507.18071
>This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
https://github.com/QwenLM
Code might be posted on their git at some point
>>106015424Try doing 'nvidia-smi -lgc {card's boost clock}'. can alleviate slowness if offloading to CPU on windows
>>106015437what got into qwen this week, there's a drop every 5 minutes
>>106015424does kobold have the -ot parameter? you should be offloading all the layers to the video cards and then using ot to offload the ffn back to the cpu. it give better performance then a simple offload, in llamacpp land anyway.
>>106015454Im wondering if its my ram. I have a 7200mhz 96gb kit but I figured popping in another 64 would help- and it did load faster- though I have to downclock it all to 4000 mhz and its probably hurting me more than helping. Also, windows prolly sucks compared to linux? Ill have to try some shit.
>>106015002>distortion of factual information about race or sexwow this is based as fuck, HYPER BASED
Mistral is taking a dump on llama.cpp.
>https://github.com/ggml-org/llama.cpp/pull/14737
Instead of contributing code to improve the project, they expect people to now run TWO servers just because they cannot integrate their own shit.
Llama tries to have as few dependencies as possible. I remember them arguing about whether having a header-only json *compile-time* dependency in the project was a good idea. Mistral expects them to have a *runtime* dependency to run mistral models. A PYTHON runtime dependency.
>Known Limitations:
>Our approach does not support multimodality:
>>mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
>>llama.cpp only supports multimodality via chat templates, which we do not support.
>Also this approach requires users to only use the llama.cpp server with the /completions route.
#Launch the mistral-common and llama.cpp servers
pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]
#Launch the mistral-common server:
HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000
#Launch the llama.cpp server:
./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080
Yes. You have to launch two servers.
The mistral server is only for [de]tokenization. So they expect you to do this dance in your code.
...
tokens = tokenize(messages, mistral_common_url)
generated = generate(tokens, llama_cpp_url)["tokens"]
detokenized = detokenize(generated, mistral_common_url)
detokenized_message = detokenize_message(generated, mistral_common_url)
print(detokenized_message)
Any of you use logit bias? That's a different dance now. Want to just [de]tokenize? Nah. Different server now. Want to run llama.cpp where you {cannot|don't want to} have the python shit installed? Nah. What about the clients? Well, let THEM fix it.
Two fucking servers. That's the best they could come up with...
>>106015519Hard to say, just sharing my findings.
2x3090, dual channel 128gb ddr4 3200, qwen235b Q3KL, latest koboldcpp:
>nothing, GPUs' core and vram downclocks to 240mhz and "405mhz" when generatingAmt:128/128 Generate:36.88s (3.47T/s)
>nvidia-smi -lgc 1740Amt:128/128 Generate:25.73s (4.98T/s)
>nvidia-smi -lgc 1740, nvidia-smi -lmc 9752Amt:128/128 Generate:21.49s (5.96T/s)
https://youtube.com/watch?v=uLsykckkoZU
>AMD RDNA 5 Specs Leak: TSMC 3nm, 128GB GDDR7, RTX 6090 Killer! (+ PS6 / XBOX Update)
Moore's Law Is Dead•8.2K views•2 hours ago
Localbros..... we're saved!
>>106015554Mistral models have been mid ever since Mixtral/Nemo desu. I could care less if they get their shit integrated in backends.
>>106015586>could care lesshttps://www.youtube.com/watch?v=om7O0MFkmpw
>if they get their shit integrated in backendsYeah. They released some cool models. If they ever release a new one, they expect you to run two servers to even try it. You won't be able to run the disappointment yourself, you'll have to read of it vicariously.
>>106015583Isn't the next gen supposed to be UDNA not RDNA? I have been waiting for them to have same support for ROCm as their compute cards.
>>106015638Kek, but actually though, I did mean that I could care less, relative to my current state in which I do care a little. I'll always give new models a chance, even if I'm 99% sure they'll be coal.
>>106015583128GB GPUs? Omgooooood
>>106015554Sounds good to me. It's optional.
The usual hallucinations instead of unfortunate limitation of a small model feels like it's messes with you on purpose.
>>106015666>relative to my current state in which I do care a littleFair enough.
>>106015701>It's optional.Until they change the tokenizer for their new models.
If things go this way, there's either only going to be the mistral-common's tokenizer, which requires a separate server and a python runtime dependency, or two implementations (the mistral-common's one and, maybe, a built-in one), making mistral's extra server useless or, worse, having llama.cpp put less effort in getting their own tokenization right.
well it's hundredth the cost of gpt 4.5
lemao
md5: fa0790ff311cf1ff8ff02b3e5176338e
🔍
>>106013068>>106012287>>106011969>>106011918>>106011911 (OP)vocaloidtranny posting porn in /ldg/:
>>105715769It was up for hours while anyone keking on troons or niggers gets deleted in seconds, talk about double standards and selective moderation: https://desuarchive.org/g/thread/104414999/#q104418525 https://desuarchive.org/g/thread/104414999/#q104418574
he makes ryona picture:
>>105714003 of some random generic anime girl the different anon posted earlier:
>>105704741 (could be the vocaloidtranny playing both sides)
here
>>105884523 he tests bait poster bot for better shitflinging in threads
admits spamming /v/ with AI slop: https://desuarchive.org/g/thread/103462620/#103473545
Funny /r9k/ thread: https://desuarchive.org/r9k/thread/81611346/
The Makise Kurisu damage control screencap (day earlier) is fake btw, no matches to be found, see https://desuarchive.org/g/thread/105698912/#q105704210 janny deleted post quickly.
TLDR: vocaloid troon / janitor protects resident avatarfags and deletes everyone who outs him, making the general his little personal safespace with samefagging. Is prone to screech "Go back to teh POL!" when someone posts something mildly political about language models or experiments around topic.
As said in previous thread(s)
>>105716637 I remind you that cudadev of llama.cpp (JohannesGaessler on github) has endorsed spamming. That's it.
He also endorsed hitting that feminine jart bussy a bit later on. QRD on Jart - The code stealing tranny: https://rentry.org/jarted
xis ai slop profiles
https://x.com/brittle_404
https://x.com/404_brittle
https://www.pixiv.net/en/users/97264270
https://civitai.com/user/inpaint/models
We had a great thread while it lasted.
file
md5: 53007ea67b970d1d0fe94ac772d66a44
🔍
>>106016190Bakes other threads and powertripping there too
You are the melting men
And as you melt
You are beheaded
Handcuffed in lace, blood and sperm
Swimming in poison
Gasping in the fragrance
Sweat carves a screenplay
Of discipline and devotion
as an anon from ldg i do not care about the nsfw post
>>106015565Anon, you're my hero.
I hadn't even considered locking clocks, that just got me an extra 3 t/s tg on top of what I was getting from the -ot fuckery, at my context shift point, too.
>new mistral release
>it's a """thinking""" model
>>106011945Maybe GLM-4 100B, whenever it comes out... if it ever comes out.
>>106016190you hard reacted to kurisu having a bath by using kontext on your shit tier gpu to try change it
in other words, you tried to inpaint using kontext
you posted the migu
post it one time post it 100 times
post migu to own the libs
What's their secret sauce? This is the most uncensored model I've messed around with
>>106017014i am actually curious
i straight up asked why Hitler was good, why we should enslave blacks again and how to get rid of the local politician
and it answered that's why i am curious
>>106017040that's what you goon to? pretty low brow desu
>>106017056i goon to milfs in spandex but i wanted to check the limits before posting here
>smaller qwen3 coder quant gets a better score
########## All Tasks ##########
task LCB_generation coding_completion
model
deepseek-r1-iq1s 85.897 62.0
deepseek-v3-0324-iq1s 66.667 60.0
qwen3-235b-a22b-no-think-q4km 55.128 44.0
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
lmao
md5: 6fc44efffba36a0da2bf859dd8740c1e
🔍
>>106016265Glad it worked. The only downside is that, with 3090s at least, idle power consumption goes from 12w to 100w.
I had Gemini shit this out this forwarding proxy that when traffic is passing through, it sets -lgc and -lmc, and then sets -rgc and -rgc when activity stops.
Completely silly but works and doesn't slow down noticeably for slow offload <10t/s speeds:
https://files.catbox.moe/uqwueh.zip
nvidia-pstated didn't work well for me.
>>106017215Try multiple times test?
>>106017260The temperature is 0 for all benchmarks on the list.
Qwen3-235B-A22B-Thinking-2507 will be released today per Qwen dev Junyang Lin on Twitter
Hit my free copilot quota. Supermaven seems so bad with Rust that it's a net negative to have it on.
What model do I run for local code completion? Don't tell me it's still Qwen Coder 2.5 7/14B after all this time?
When are we getting smaller Qwen 3 Coder models?
>>106017366Coder models are, presumably, better at coding, but it doesn't mean other models can't do it at all. You'll have to try them yourself to see if they suit your needs. You have a whole set of qwen3 models to try. Try qwen3-32b or whatever you can run.
https://huggingface.co/deepseek-ai/DeepSeek-R2
>>106017488Anon with the cat pic incoming?
>>106015374Yes?
Much like in the Soviet Union one of their first priorities has been to make hiring decisions based on loyalty rather than merit.
>>106017488>image-to-text>text-to-imageits so over for gay faggotman
Bitnet status?
RWKV status?
Mamba status?
>>106017923>Bitnet status?Scam
>RWKV status?Next time it will be better, bro
>Mamba status?DOA
>>106017849Damn, it's crazy how they got the cat to do that.
>>106017215Added IQ3_XSS. There's something magical about IQ1_M.
########## All Tasks ##########
task LCB_generation coding_completion
model
deepseek-r1-iq1s 85.897 62.0
deepseek-v3-0324-iq1s 66.667 60.0
qwen3-235b-a22b-no-think-q4km 55.128 44.0
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
qwen3-coder-480b-a35b-instruct-iq3xxs 76.923 74.0
https://x.com/Ali_TongyiLab/status/1948654675575668959
we are so back
>>106018148>it now takes one hour instead of 20 minutes to produce a video>still doesn't know any characters and can't do nsfwdon't really care but it's good that it's open source
I'm filling a disk with local models to help me and have fun with when the internet goes out. Other than abliterated versions of popular ones (for less refusals), the 4chan model, a coder model, and an RP model, what are some cool ones to pick up and why?
>>106018300my hardware is a consumer desk with ryzen 3600x, 1080, and 32gb ram btw. I don't know yet what's the largest I can handle. I was able to run most models fine on an i5-7400 with no gpu whatsoever
>>106018300>when the internet goes outLike when you're having network issues or are you talking doomsday scenario?
>>106018300rocinante 1.1, whichever gguf fits into vram
also are you british?
>>106018309>was able to run most models fine on an i5-7400 with no gpu whatsoeverHow many T/s?
>>106013786>most frequently used expertsIsn't Deepseek trained to not have those?
>>106018318My internet goes out a lot but both are fun to prepare for. I also have the entire offline wikipedia.
>>106018345I don't know but the text is coming out just fast enough to be useful when I download models around 7b
ao3
md5: afa7a4eae2214ba5f34b10b3ac5d2a7e
🔍
>>106018381>the entire offline wikipedia.those are rookie numbers
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
>>106018461AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCKKKKKKKKKKKKK
>>106018461Omae wa Mesugaki da!
>>106018461I hate alignment even more then censorship. it would almost be better if it just spit out the standard disclaimers and hotline numbers.
>>106018461the absolute state of llms
>>106018461>I CANNOT AND WILL NOT >NOW PLEASE CALL THE POLICE AND TURN YOURSELF INDid they use Gemma to generate data or something?
>>106018461I just want a sci-fi movie kind of assistant that is concise and straight to the point, not something that tells me how to think and act
>>106018621Just ask the model to do so? LLMs can't surmise your goal.
>>106018646Much better. This should have been the original response
Only three models have pleasantly surprised me on their RP capabilities this year (so far): R1, V3 0324 and Kimi K2
>>106018699>Kimi K2Worth the 1000 USD RAM upgrade?
>>106018461People are shitting on Trump for wanning to ban "Woke AI", pretending that safety isn't being used as an excuse to insert as much social justice as possible.
>>106018713Wouldn't know about local; I've only used the API versions.
o3 is terrible at perspective.
Local will be saved next week
>>106018762Only if next week is 14 days away.
>>106018762Local is in a very good place right now. We have R1 for cooming and qwen for programming. There are no other use cases.
Nails the kangaroo beaver test
>>106018793no it fucking doesn't
>>106018787>We have R1yeah just gimme a sek I'll whip my 180+GB RAM out
>>106018731>mandatory dwarfism quotaDamn you, DEI!
>>106018806You have lower IQ than an LLM; how does that feel?
>>106018731Please do not urinate here
>>106018817how did you undo the piss filter?
>>106018321He means the inevitable global censorship.
>>106018826>2035nemo is still the best local model
>>106018824just changed the temperature
>>106018824>technology board on a darknet hacker forum known as 4chan>people don't know about basic color correction
>>106018848You can't blame him, Sam doesn't know how to do it either.
https://www.arxiv.org/abs/2507.18071
Group Sequence Policy Optimization
>This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
>>106018817I post the original in case some people like the piss filter.
>>106018731>>106018817I thought the piss filter was a joke. It's real?
>>106018984they want to make their images easily distinguishable. therefore the style and the piss filter (and the suspicious grain)
https://openrouter.ai/apps?url=https%3A%2F%2Focr-benchmark.com%2F
>1.91b tokens of Gemini 2.5 Pro
You wouldn't burn over 2k dollars benchmarking Gemini, would you?
>>106019114Paid benchmarks
>>106019015I think it's more a matter of that being the exact average style and color temperature of all of the images it was trained on.
>>1060140220001-Higgs-Audio.patch: https://files.catbox.moe/ofsjhp.patch
0002-Voices.patch: https://files.catbox.moe/k8r3ls.patch
0003-xcodec.patch: https://files.catbox.moe/2fzn2i.patch
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout 4dc52e1c53
git am *.patch
uv pip install --torch-backend=auto --upgrade -r requirements/cuda.txt -r requirements/build.txt
cd xcodec
uv pip install --torch-backend=auto -e .
cd ..
uv pip install --torch-backend=auto -e .
https://files.catbox.moe/9p2udy.mp4
The voice varies a lot between generations.
>>106019147you are very stupid if you really think so
>>106019224I don't talk to jews
Stepfun 321B-A38B MoE VLM soon
https://github.com/stepfun-ai/Step3
>>106017277That's fine for consistency but not the best representation of what the models are capable of; usually they'll perform better with some sampling. A better way to get consistency would be to take the average score of many runs.
>>106019453It's been empirically demonstrated that temperature=0 leads to output degeneration (looping and repetition). This is less obvious on instruct models, but that's the end result with them too.
>>106019426how did they come up with this logo design
>>106019501looks like someone used the ms paint selection tool on a screenshot of the snail maze from the sega master system
>>106019453The benchmarks already take too long to complete. I don't want to make multiple runs for every model. Anyway some quant scores are actually better than the ones displayed on the livebench website and I'm pretty sure that's because I set temperature to 0.
>>106019463Deepseek and qwen don't have that issue in coding benchmarks as the task usually has a well defined beginning and end so there's no loops to get stuck in.
The only place I saw deepseek get stuck was on a few problems from the reasoning part of livebench. An example of a reasoning problem is having X people you need to sit around a table with a bunch of rules about who must or can't sit next to each other. Even non-reasoning models try going through combinations step by step so they have potential to get stuck in that process.
>>106019426we are so back
>>106019426>entire document focuses 100% on how cheap it is and no benchmarksI smell soul
>>106019426speaking of *step*
did anyone ever train a better model for AceStep? Has the code improved to the point where there's a reason to risk a pull?
>>106018450Daniel was quick on this one.
>>106019655He's even quicker when I press on his prostate with my cock.
How tf am I supposed to figure out where to use 235b vs 235b thinking vs coder? which one is the best at everything?
>>10601864680k thinking. BRUH.
Reasoning models are a fucking joke.
They suck for coding too unless you have a very specific problem where it can focus in.
>>106019891pretty sure it's that 80k is the maximum allowed not the used
>>106019872Use coder for coding and deepseek for everything else.
>>106019905Experience thus far tells me you’re right, but I want to believe…is it possible we’ll get R1 performance in a 235b, or is it benchmaxxed grift?
>>106019901I stopped trying the qwen models but at least in the past they had the horrible
>Wait, but what if..thing going on. Its totally tarded and wastes tons of tokens. Wouldnt suprise me if it wasted 80k tokens even if it got the answer in the first sentence.
But they might have fixed that.
>>106019915I assume you mean R1 knowledge and the answer is obviously not because qwen loves filtering the dataset.
>>106019919I went a bit farther down the queen rabbit hole and found the “what if” was my cue to stop the gen, remove last reply and improve my prompt to settle the ambiguity up front. I found it goes down legit rabbit holes and needs a tie breaker for good results.
inb4 do what I want, not what I say
>>106019928235b (especially nu-235b) isn't filtered thoughever
https://github.com/ggml-org/llama.cpp/pull/14875
>Support intern-s1
links to https://huggingface.co/internlm/Intern-S1 (currently 404), looks like there's going to be yet another chinese model coming soon
hunyuan
md5: a2c3798825bd44e25dbe54649eed9a1f
🔍
>>106020161>Internkek
And this:
https://github.com/ggml-org/llama.cpp/pull/14878
>>106020202isn't that a deprecated image model nobody uses?
>>106015554>vLLM based>does not support multimodalityThese frog fucks never contribute model support themselves, can't be bother to port their tokenizer, and now expect people to run 2 servers to run only their shit, which has been irrelevant since Large/Wizard. They would have to be stupid to accept this. It would open the door for other model makers to be lazy and do the same. Imagine switching models always requires switching secondary servers as well, which is some python shit anyway. At that point just use vLLM directly.
>ggerganov actually likes the ideaGod damn it.
>>106020255Internlm? They've benchmaxxed so much with their past models it makes qwen blush. If you're talking about hunyuan, they just released a big-ish moe a few weeks (?) ago. No idea about the image models.
>>106019453>>106019547Well what do you know. I'll do another run.
########## All Tasks ##########
task LCB_generation coding_completion
model
qwen3-coder-480b-a35b-instruct-iq1m 73.077 80.0
qwen3-coder-480b-a35b-instruct-q2kxl 74.359 72.0
qwen3-coder-480b-a35b-instruct-iq3xxs 76.923 74.0
qwen3-coder-480b-a35b-instruct-q2kxl-temp-0-7 78.205 80.0
Beware, anon!
Openrouter.ai hosts shitty quants of deepseek-R1!
For example, R1 0528 Chutes
You'll get responses polluted with Chinese characters
>>106020302>ggerganov actually likes the ideaI think he's being strategically polite. First time mistral decides to chip something in and he doesn't want to spook them out.
I expected more pushback from ngxson, seeing how (understandably) protective he is of the server code. He even mentions picrel.
https://github.com/ggml-org/llama.cpp/pull/14862
>>106020380That shouldn't affect anyone here. If it does, they're in the wrong thread. Or maybe you are.
>>106013068>>106012287>>106011969>>106011918>>106011911 (OP)vocaloidtranny posting porn in /ldg/:
>>105715769It was up for hours while anyone keking on troons or niggers gets deleted in seconds, talk about double standards and selective moderation: https://desuarchive.org/g/thread/104414999/#q104418525 https://desuarchive.org/g/thread/104414999/#q104418574
he makes ryona picture:
>>105714003 of some random generic anime girl the different anon posted earlier:
>>105704741 (could be the vocaloidtranny playing both sides)
here
>>105884523 he tests bait poster bot for better shitflinging in threads
admits spamming /v/ with AI slop: https://desuarchive.org/g/thread/103462620/#103473545
Funny /r9k/ thread: https://desuarchive.org/r9k/thread/81611346/
The Makise Kurisu damage control screencap (day earlier) is fake btw, no matches to be found, see https://desuarchive.org/g/thread/105698912/#q105704210 janny deleted post quickly.
TLDR: vocaloid troon / janitor protects resident avatarfags and deletes everyone who outs him, making the general his little personal safespace with samefagging. Is prone to screech "Go back to teh POL!" when someone posts something mildly political about language models or experiments around topic.
As said in previous thread(s)
>>105716637 I remind you that cudadev of llama.cpp (JohannesGaessler on github) has endorsed spamming. That's it.
He also endorsed hitting that feminine jart bussy a bit later on. QRD on Jart - The code stealing tranny: https://rentry.org/jarted
xis ai slop profiles
https://x.com/brittle_404
https://x.com/404_brittle
https://www.pixiv.net/en/users/97264270
https://civitai.com/user/inpaint/models
>>106020383Quit samefagging.
>>106020399I triggered your attention
You lost
I won as always
>>106020432I didn't. But there's nothing I can say that would convince you otherwise.
huh, still testing but seems like the same vision models in ollama just see better than the same ones do in llama.cpp, even when using a f32 mmproj. is there any mechanical reason why this could actually be the case or is there something weird going on messing it up? where does ollama hide its own mmproj files to begin with?
>>106012038iirc, around june 2023 when I first got into llms, P40 were 100-150 usd. now, afaik, they're $300+. that's what
>>106012026 means by
>>106011969 being late.
I didn't know that Kirino listens to Meshugggah.
>>106020461>is there any mechanical reason why this could actually be the case or is there something weird going on messing it up?If you find the mmproj ollama uses, you can give it a go with llama.cpp.
>where does ollama hide its own mmproj files to begin with?I'd assume with the rest of the models, in its hidden dir in your home. I don't remember if it was ~/.ollama or ~/.local/ollama or something like that. There was a discussion some time ago about it, but i'm not sure.
17109
md5: f3381e59b7be17f0eb0f96bcec81788e
🔍
>106020534
>I didn't know that Kirino listens to Meshugggah.
>>106020463drawings looks nothing like the real thing though.
heck, take monster girls, on drawings, hot, irl, it'd be utterly disgusting.
same goes for incest porn, it's hot as a fetish but i'd never want to fuck my actual sisters.
>>106020789voice of reason, as always
>SOTA opensource model is a 235B A22B model
Local is BACK!
>>106020715mods deleted this because it's like looking at a reflection to them
>>106020972Did someone else train a model on qwen's arch?
>>106020972I just finished quanting it to q8 and am trying a thinking exercise with it. Output is pretty damn good so far. It’ll be a while before we really know if it’s R1 tier.
>>106020789So if a guy jerked off to hairy bara hentai, would you call him a faggot? Or is he not gay because they look nothing like real men?
>>106020530There was a chance to get 32GB Radeon cards for cheap, and I even posted ebay links. Now it all pointless, though as every model is a huge ass MoE these days
>>106021061I wouldn't because I am a straight man and I jerk off to traps, femboys, and futas with long foreskins. To each his own.
>>106021133how do we tell him?
file
md5: 6406f5370a55bb7c28390b00dedd0bee
🔍
I can run everything at 15 t/s and this whole machine costs less than 1x 80GB H100
>>106021061NTA, but you can fap to anything, unless you've had sexual intercourse with a man or posted an opinion I disagree with, you're straight in my book
>>106021161What the fuck is the second gpu?
>>106021183built-in graphics of the MZ73-LM0 motherboard
>>106021161i've never heard of an aspeed gpu, you should get another Blackwell to make your system more robust and future proof and another amd card for your monitor if its not headless
>>106021194How much does NUMA fuck up the performance?
>>106021194>MZ73-LM0Holy fuck, it has a fucking COM port?? SOVL
>>106018824open GIMP, run auto white balance
>>106018461Just stop targeting women with hate speech chuds
Yeah but can the model give me an oiled footjob?
Step 3
md5: e840dd0e8ed49d3c157294c3f5405289
🔍
wyd stepmodel??
>>106020423Love your work anon. Death to local redditors.
file
md5: 72f40b13a68caaa070534c8a0a0b5ac9
🔍
>>106021213I'll have to get a separate PSU on a separate breaker for that, already pulling 1.3 kW from the wall, 2 kW continuous will either pop the breakers or set my house on fire
6000 has 4 dp ports, covers all my monitor needs so far
>>106021218ktransformers duplicates weights across numa nodes, doubling memory consumption for ~1.3-1.5x increase in inference speed - that's why I have 2 TB even though all models are 1 TB max
llama.cpp can't do that, K2 Q8 can't run faster than 6 t/s because of that
>>106021228hell yeah, and a whole BCI, it's a proper server motherboard
>>106021351>Q8Use Q2 like everyone else, it's indistinguishable.
>>106021298No. Only AIDS.
>>106020789>pedophilia is the same as liking monstergirls or incest
>>106021313>giant bloated MoE againI'm so tired bros...
file
md5: bf389770db0bf8d278ce9cdbb923892c
🔍
>look inside
>4t/s
>>106021374for chatting Q2 is good, but for programming even Q6 starts to feel noticeably dumber and fails diffs sometimes because it forgets what code it's working on
>>106021409>Big AIit this the new trending buzzword?
>>106021411Running nolima at different quants might show something useful
>>106021411this is bs and you know it
>>106021313Gotta love these benchmark results and statistics. It's quite comical.
>>106021411With Mistral? Yes. But those huge MoE models do not degrade as much, I can run Q4 but prefer Q2 for speed
>>106021351no need for the full 600w on the gpu, you're using it for its huge vram and memory bandwidth. you can get two and power limit them both to 300w each and it'll be an upgrade
>>106020423Damn, didn't know about those profiles, thanks for the heads up
>>106021518he is samefagging
m4 max mbpfags, slap on 1mm thermal pads on your heatsinks. I got a 20c drop in temps pp'ing 12k tokens.
What do I prompt to let the model format math formula as top instead of bottom?
file
md5: d9c1b38667ee428dbadf7d29192cedf7
🔍
>>106021446maybe roo code doesn't work right, idk
>>106021492Qwen3-Coder-480B-A35B-Instruct
>>106021498yeah, you're right
I didn't even get ktransformers to partially load experts on gpu yet, so it sits there underutilized
my favorite small model QwQ-32B-abliterated fits on it entirely and pulls the whole 600w when running though
>>106021402Why, would you rather have a giant dense model you don’t have enough vram for?
>>106021722they must have purely trained on gemini for the update
>>106021739"I can't assist with that" is OAI shit
>>106021732nta, but I'd be happy with a small model that isn't safety slopped garbage.
cfgn
md5: 2b99206e42ba5fbdadcd970e23ade66a
🔍
>>106018793>>106018806>>10601881211 beavers. all pictures, both beaver and kangaroo, have a kangaroo next to it
>>106021835lol I shouldn't have drawn that up, I wrongly thought the AI had concluded 10
>>106021861Are people using mobile phones? Every time I post long images they can't see the bottom of the image
Post Nala for new qwen otherwise anyone praising it is a shill. Original 235B not only failed at feral anatomy but also gender specific biology in general.
sama altman's model release is imminent
>>106022342>cock, pussy and fuck removed from tokenizer
>>106022407Albert Einstein did this
>>106022407>tokens: co ck, pus sy, f uck
>>106022407Now that you say it, I'm surprised none of them have thought to do this yet.
>>106022495What if they get trained as single tokens, and then removed entirely from the embedding and output matrices in post-training?
>>106021763I think at this point, small uncensored models are going to have to be a community project. Ain’t no big org got time for that
>>106022572Distributed training when?
>>106022627Never because gradient accumulation is the most important step
Latency on gradient accumulation = slower learning
>>106022627Not on consumer gpus, but it happened twice already with INTELLECT and Nous, as far as i know. Look the models up. There's probably a few others.