← Home ← Back to /g/

Thread 106551921

369 posts 94 images /g/
Anonymous No.106551921 >>106552653 >>106553385 >>106553624 >>106554062 >>106555093 >>106555312 >>106555337 >>106556153 >>106556580
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Not Suspicious At All Edition

Previous threads: >>106539477 & >>106528960

►News
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2
>(09/05) Klear-46B-A2.5B released: https://hf.co/collections/Kwai-Klear/klear10-68ba61398a0a4eb392ec6ab1
>(09/04) Kimi K2 update for agentic coding and 256K context: https://hf.co/moonshotai/Kimi-K2-Instruct-0905

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106551925
►Recent Highlights from the Previous Thread: >>106539477

--Paper: Home-made Diffusion Model from Scratch to Hatch:
>106542261 >106542674
--GPU pricing, performance benchmarks, and emerging hardware modifications:
>106546975 >106547036 >106550119 >106547168 >106547484 >106547754 >106547804 >106547849 >106547879 >106548086 >106548161 >106548571 >106548608 >106549153 >106550454 >106550474 >106550611 >106550739 >106547935 >106547966
--Superior performance of Superhot finetune over modern large models:
>106543123 >106543243 >106543656
--qwen3moe 30B model benchmarks on AMD RX 7900 XT with ROCm/RPC backend:
>106539534 >106539571 >106539618 >106539658
--Vincent Price voice cloning with Poe works showcases model capabilities:
>106539541 >106539736 >106539701 >106539807
--Framework compatibility: vLLM for new Nvidia GPUs, llama.cpp fallback, exllamav2 for AMD:
>106540544 >106540560 >106540611 >106540666 >106546227 >106546233 >106546268 >106546277 >106546906
--GGUF vs HF Transformers: Format usability and performance tradeoffs:
>106550231 >106550258 >106550310 >106550352 >106550364 >106551231 >106551252
--Need for a batch translation tool with chunk retry functionality for LLMs:
>106543697 >106543774 >106543816 >106543888 >106543953 >106547100 >106551343
--Auto-tagging PSN avatars with limited hardware using CPU-based tools:
>106550616 >106550648 >106550976 >106550667
--Qwen3-VL multimodal vision-language model architectural enhancements and transformers integration:
>106547080
--Surprising effectiveness of 30B model (Lumo) over larger models in technical explanations:
>106543339 >106543345 >106543399
--Dual GPU LLM performance trade-offs between VRAM capacity and parallel processing limitations:
>106539831 >106539914 >106540160
--Miku (free space):
>106539893 >106540709 >106545815 >106547702 >106548178

►Recent Highlight Posts from the Previous Thread: >>106539481

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.106551947 >>106551993
gguf status?
Anonymous No.106551983 >>106552171
>>106551911
Use chat examples. Regardless of your client you can fake up few lines of conversation between you and the model.
Also add post history instructions (these get injected before your next input) to control the length of generation and the style.
Of course the base style is always the same but eg. giving concise and short examples will change the way it outputs text...
Anonymous No.106551993 >>106554308
>>106551947
I just want EXE file... how hard it can be??
Anonymous No.106552000 >>106552095
I'm trying LongCat again now that it's on OR. The insane censorship of the web-version doesn't seem to be a problem through the API and the model knows a lot.
The one drawback is that it's *by far* the *worst* model when *it* comes to ** spam.
Still a shame that there will never be llama.cpp support for this.
Anonymous No.106552021 >>106557136
I made it into the highlights boys
Anonymous No.106552086
are MLPerf benchmarks a meme
Anonymous No.106552095 >>106552267 >>106552557 >>106554325
>>106552000
Why can't it be used on llama.cpp?
Anonymous No.106552171 >>106552202
>>106551983
So i shouldn't bother with a system prompt for gemma and just use post history instructions?
Anonymous No.106552202 >>106552210 >>106552242
>>106552171
Of course you should! But using post history thing enforces the style more because it keeps reminding the model all the time to stay in line.

>[System Note: [
>Always respond in 1-2 short paragraphs. Limit {{char}}'s response to less than 200 tokens unless specifically asked to provide a long answer. {{char}} is a narrator not an actor. Do not act on behalf of {{user}}.
>Respond only in plain text with no Markdown or other formatting.
>]
Here's mine it's nothing special, I'm kind of lazy to experiment. I'm more concerned about the length of its replies - I hate rambling.
I also format every instruction like this-
if it's
>System Note: [ balablalbalab ]
It's related to instructions. And for characters I'm tagging it as a "character" and the descriptions etc are inside the square brackets.
>Character: [
>Name: Some Faggot
>Summary:
>description
>]
I have found out that at least for me it helps with small models but maybe it's just a cope/fantasy.
Anonymous No.106552210
>>106552202
>[System Note: [
That's a typo, it should be
>System Note: [
Anonymous No.106552242 >>106552256
>>106552202
Isn't using the {{char}} placeholder bad? Especially if you want to do multiple characters?
Anonymous No.106552256
>>106552242
It's just a macro. My {{char}} is Game Master and it's narrating the chats.
Characters are characters with their own names.
{{char}} and {{user}} are just macros anyway so you can use whatever you like. You can manually type in any name/reference and so on.
Anonymous No.106552267 >>106554325
>>106552095
It uses some dynamic MoE meme architecture that activates a variable amount of active parameters for each token.
CUDAdev said that implementing something like this in llama.cpp is likely not worth it for a fotm model like this.
Anonymous No.106552557
>>106552095
Read the papers and implement it yourself. It’ll be fun
Anonymous No.106552606 >>106552641 >>106557830
I have a Mistral 24B model and for some reason it's running slower than a Deepseek 32B model. Is it purely based on file size vs VRAM/RAM, or is it something else?
Anonymous No.106552641
>>106552606
quant?
context?
look at your logs might be a warning or error that'll tell you why
Anonymous No.106552653 >>106552674 >>106554014
>>106551921 (OP)
>https://rentry.org/recommended-models
Are any of these actually good at the sfw roleplay or "creative writing"
Anonymous No.106552674 >>106552751
>>106552653
Anonymous No.106552731 >>106557840
I think I spend more time fiddling with trying to get my models running than I do actually using my models. It's driving me insane that vllm won't work.
Anonymous No.106552751 >>106552777 >>106552786 >>106552859
>>106552674
Is this reliable?
Anonymous No.106552777
>>106552751
you should know that it's a meme if it puts o3 on top of a 'creative writing' benchmark
Anonymous No.106552786
>>106552751
It's a LLM judged creative benchmark.
Anonymous No.106552857 >>106552893
I've got a 12GB 3060, along with a 7600X with 32GB RAM on my desktop, and want a local model to help me analyze my code, and to search for things without knowing the right keywords first. I know nothing, but I'm reading the rentry pages.

What are the limitations implied by the "impressive lack of world knowledge" of the Qwen models? I assume running Deepseek R1 at any sensible rate isn't feasible without a dedicated machine with a boatload of RAM, if not VRAM.
If I pick a 12GB model with a 12GB GPU, does that prevent me from using the GPU for my screens at the same time? I'm not playing games, but I am using CAD, running integrated graphics is possible but suboptimal.
I imagine it's worth buying a standalone GPU for running such a model, but for now I just want to give it a try.

Thanks.
Anonymous No.106552859
>>106552751
If you are a ramlet use Gemma 3 or Mistral 3.2, if not use GLM 4.5 Air or full... Idk.
Anonymous No.106552893 >>106552906
>>106552857
>"impressive lack of world knowledge"
Probably stuff like random trivia.

>I assume running Deepseek R1 at any sensible rate isn't feasible without a dedicated machine with a boatload of RAM, if not VRAM.
Pretty much.
I think you can run the smallest quant with a little over 128gb total memory.

>If I pick a 12GB model with a 12GB GPU, does that prevent me from using the GPU for my screens at the same time?
No. But the video driver will use some of the VRAM for display, meaning that you won't have the full 12GB available for the model.
Do note that you need some extra memory for the context cache and the context processing buffer, meaning that you want a model that's smaller than your memory pool.
You are going to have to experiment to see what works for you, but for now, start with qwen 3 coder 30B A3B since that'll be easy to setup for you.
Anonymous No.106552906 >>106552929
>>106552893
>qwen 3 coder 30B A3B
That's a 24GB model, I guess it only uses some of the VRAM at a time? Cool, I'll look into getting it running. I'm on Arch btw.
Anonymous No.106552929 >>106552979
>>106552906
The beauty of that kind of model (MoE) is that you can have a lot of it (the experts) running in RAM.
Looking into llama.cpp's --n-cpu-moe argument.
Anonymous No.106552939 >>106554925
>**Witty Remark:** Let's just say your quest for pleasure ended with a major failure, Anon. Maybe try a nice, wholesome game of checkers next time. Less likely to involve a call to the authorities.
Anonymous No.106552979
>>106552929
>running in ram
and you wonder why it's slow
Anonymous No.106553015 >>106553334 >>106553339 >>106554932 >>106555049
I've been shitting up a storm all day today. Qwen3 advised me to go see a docter at this point. ChatGPT told me just to drink water and not to sweat it. It's moments like these that really make me laugh as it's probably an accurate bias of the average Chinaman (with best in class health care that is free) compared to an American (with subpar healthcare that costs thousands per visit).
Anonymous No.106553206 >>106554971
Here for my monthly "is nemo still the best thing for vramlets" inquiry, any new models worth using? I tried gpt-oss-20b and it wasn't great for RP
Anonymous No.106553263 >>106554961
llama.cpp sometimes caches but when the context gets long or maybe it's when it's filled up, it stops caching and need to process it all every time, why? silly is sending cache_prompt true
Anonymous No.106553334
>>106553015
ask qwen for cures from traditional chinese medicine
Anonymous No.106553339
>>106553015
Sounds like a sea-borne bacteria.
Anonymous No.106553385
>>106551921 (OP)
only exciting in the last year has been exllamav3 :(
Anonymous No.106553388 >>106553417 >>106553689 >>106553890 >>106554982 >>106555015 >>106555026
Bruteforcing and trying until you find something that works is so acceptable in this field that even the inference software is the same shit. With other software you'd have an option to automatically find the best configurations that match what you have, with lcpp you have to fuck around with the parameters until you get something usable. What a shitshow.
Anonymous No.106553417
>>106553388
maybe ollama is more up your speed
Anonymous No.106553435
These new MoE models are fucking stupid.
Anonymous No.106553624
>>106551921 (OP)
>K2 Think
Is this better than K2-0905?
Anonymous No.106553689
>>106553388
Be the change you want to see, whining faggot
Anonymous No.106553890
>>106553388
Stop whining that’s it’s not an iPad when we’re still in the heathkit era of LLMs. Spend your own time making PRs to smooth the sharp edges if you want. All the rest of the dev time on lcpp is already spoken for trying to solve problems more interesting to those volunteers
Anonymous No.106553923 >>106554044
Why do some smaller text models use more GPU layers than some larger ones?
Anonymous No.106554014
>>106552653
The only difference is that gemma becomes one of the options.
Anonymous No.106554044 >>106554094
>>106553923
Some models have bigger tensors than others.
Anonymous No.106554062 >>106554219
>>106551921 (OP)
I hate this image.
Anonymous No.106554094 >>106554256
>>106554044
Is that good or a sign of bloat?
Anonymous No.106554153 >>106554580 >>106555008 >>106555170 >>106555207
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
Anonymous No.106554219
>>106554062
It's lmg mascot samsune alku
Anonymous No.106554256 >>106554327
>>106554094
Superficially, it's the same. It's writing a few long sentences or a lot of short ones. The amount of words is the same.
I very vaguely remember Google arguing that a deeper network (more, smaller layers) was better than a shallow one (fewer but fatter layers), but it could be the other way around. I couldn't find a source for that in the 2 nanoseconds I spent searching. In the gguf models, Gemma-3-12b has 47 repeating layers and nemo-12b has 39, for example.
Really, it's hard to know unless someone trains the two types of models with exactly the same data and see what comes out better. All you should probably really care about is the total amount of params and how good it is for whatever you do. I doubt we can make a meaningful distinction between them considering all the other differences between models.
Anonymous No.106554308
>>106551993
t. llamaphile
llama.cpp CUDA dev !!yhbFjk57TDr No.106554325 >>106554412
>>106552095
>>106552267
Because no one has invested the effort to support/maintain it.
Regarding why I think it's not worth it: the advantage over conventional MoE would be speed, but if the number of active experts changes dynamically the performance will be terrible.
Anonymous No.106554327 >>106554360 >>106554362 >>106554384
>>106554256
I mostly ask because I loaded a 12B (GGUF) model that fully fits on my VRAM but it takes up way more layers and runs much slower than my usual with Rocinante, which is usually very snappy.
Anonymous No.106554336
I hate thinking models
Anonymous No.106554360 >>106554458
>>106554327
if that 12b is based on gemma that's normal
Anonymous No.106554362 >>106554458
>>106554327
You could have started there. Check your memory usage in llama.cpp's output, see where the memory is going for layers and context. There aren't many 12bs, so i assume you're talking about gemma being slower than nemo.
It could also be a matter of the head count of the model. I understand some models run faster because llama.cpp has kernels optimized for some particular head counts. I'm sure CUDA dev could give you more insight if you post the exact models you're using, the performance you're getting with them, your specs (particularly, gpu model), your run commands for each. Make it easy for people to help you.
llama.cpp CUDA dev !!yhbFjk57TDr No.106554384 >>106554458 >>106556050
>>106554327
The number of layers is largely irrelevant, that's just how the parameters of the model are grouped.
If I had to guess the problem has to do with KV cache quantization since that in conjunction with a head size of 256 (Gemma) does not have a CUDA implementation.
Anonymous No.106554412
>>106554325
Excuses excuses, you just don't want yet another code path. Inference won't care and prompt processing can use worst case. You don't have to solve it optimally.
Anonymous No.106554458
>>106554384
>>106554360
>>106554362
It is Gemma based, you're right. Its not too big a deal that I get this particular model running, I try and discard so many, but I did want to learn a bit of what was going on.
I'll try disabling the KV thing in Kobold.
Anonymous No.106554580
>>106554153
stop posting models here I cant stop myself from downloading
Anonymous No.106554594 >>106554602 >>106554614
Can Gemma Q8 fit in a 5090?
Anonymous No.106554602
>>106554594
yeah
Anonymous No.106554614 >>106554677
>>106554594
You can fit the whole model at Q8 but you won't have room for much context
Anonymous No.106554677
>>106554614
You are absolutely right-- a great insight!
Anonymous No.106554679 >>106554938
Anonymous No.106554925
>>106552939
"not even mad" moment

safetyslop wouldn't be so bad if models were more cute about it.
Anonymous No.106554932
>>106553015
at least ask medgemma
Anonymous No.106554938 >>106556219
>>106554679
I hope this Miku knows where she's going.
Anonymous No.106554961
>>106553263
You probably can try enabling --context-shift, but you model needs to support it.
And it can will not help much anyway because by default ST fucks around with beginning of the prompt, invalidating the cache.
Anonymous No.106554971 >>106554985
>>106553206
MoE era was pretty good for vramlets, but for RP your next step/side grade after Nemo is GLM-air, which requires you to be not a ramlet as well.
Anonymous No.106554982
>>106553388
I blame the fact that AI people are academics, not engineers.
Anonymous No.106554985 >>106554992 >>106554998 >>106555000
>>106554971
Air is shit though
Anonymous No.106554992 >>106554995
>>106554985
>air is shit
skill issue
Anonymous No.106554995
>>106554992
>thinks air beats nemo
skill issue
Anonymous No.106554998 >>106555020 >>106555033
>>106551820
Yeah, but Gemma sucks for RP. Like, it's not that it refuses, it's just not well versed in it. Boring and borderline stupid responses a lot of the time.

>>106554985
I find Air good for oneshots and generating responses in the middle of a RP. If you edit the think block it can be amazing. Thing is, I don't feel like editing the think block if I already edit the responses a lot. Maybe one day we'll have a local model where one does not have to edit shit and can go with the flow instead...
Anonymous No.106555000 >>106555004
>>106554985
Better than Nemo in many aspects.
Anonymous No.106555004 >>106555020 >>106555045
>>106555000
>Nuclear bomb vs coughing baby ahh comparison
Anonymous No.106555008 >>106555207
>>106554153
Jeejuff status?
Anonymous No.106555015
>>106553388
The default on the latest master version is to put everything into VRAM for maximum speed.
You're not poor, are you?
Anonymous No.106555020 >>106555039 >>106555052
>>106554998
I just turn off thinking for RP.
>>106555004
For a poorfag vramlet there's nothing in-between aside from copetunes.
Anonymous No.106555026 >>106555040 >>106555061
>>106553388
Hey, llama.cpp recently added auto-detection to flash attention at least.
Anonymous No.106555033
>>106554998
I think you are expecting bit too much from these models.
Anonymous No.106555039 >>106555088
>>106555020
>copetunes
who wins the title of the most COPE finetunes, davidAU or thedrummer(tm)?
Anonymous No.106555040 >>106555059
>>106555026
Making it worse on AMD so you have to explicitly disable it now
Anonymous No.106555045
>>106555004
Stfu zoomer
Anonymous No.106555049 >>106556127 >>106556149
>>106553015
>(with best in class health care that is free)
Your perception is five vials of bear bile and a pinch of ground up rhinoceros horn
Anonymous No.106555052
>>106555020
>I just turn off thinking for RP.
You might turn off your own as well
llama.cpp CUDA dev !!yhbFjk57TDr No.106555059
>>106555040
pp speed issues should be largely fixed with https://github.com/ggml-org/llama.cpp/pull/15927 .
Anonymous No.106555061 >>106555068
>>106555026
should we just use "-fa 1" all the time in llama.cpp then? any reason not to use it if using cuda or gpu+some offloading to ram?
llama.cpp CUDA dev !!yhbFjk57TDr No.106555068
>>106555061
FA is not supported for some (meme) models so enabling it unconditionally for those would trigger a CPU fallback and massively gimp performance.
Anonymous No.106555088
>>106555039
drummer - copetunes
davidau - shizotunes
Anonymous No.106555093 >>106555115
>>106551921 (OP)
>https://github.com/mudler/LocalAI
>one frontend for everything
>integrated audio, images, video
>optionally use cloudshit
This is looking pretty good, has anyone tried it?
Anonymous No.106555106 >>106555150 >>106555169
>https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Does this meanin theory with modified kernels we'd be able to get the same logits in Llama.cpp regardless of batch size and when "swiping"? I haven't read through the post yet.
Anonymous No.106555115 >>106555121
>>106555093
why should I use that over the many multi-be frontends that doesn't look like shit and have more features?
Anonymous No.106555121 >>106555341
>>106555115
Like?
llama.cpp CUDA dev !!yhbFjk57TDr No.106555150
>>106555106
>batch size
I'm not going to write kernels specifically to do all floating point in the exact same order regardless of batch size.
That would be a huge amount of effort for a meme feature that no one will use because the performance would be bad.

>swiping
It's not necessary to modify any kernels, only the logic for what parts of the prompt are cached.
If you cache the prompt only in multiples of the physical batch size you should get deterministic results on swiping.
(Or if you cache the logits of the last eval prior to generating tokens.)
Anonymous No.106555169
>>106555106
This shouldn't be an issue with int quants, no? Unless they only use ints for storage and still use floating point for math...
Anonymous No.106555170
>>106554153
Last big ernie had sex performance of a dense 30B-old.
Anonymous No.106555207
>>106554153
>>106555008
^ Already available apparently, no arch changes over big ERNIE.
Anyway with greedy sampling it's schizo as fuck even at t=0.8
It's at least coherent at t=0.3 though. But still a bit schizo.
Anonymous No.106555257
on foenem grave
on bdk even
4 days chilling and not caring about llms
bam, you're out of the loop
t's crazy
Anonymous No.106555301
HUGE NEWS!!!!
BIG IF TRUE!!!! BIGLY, EVEN!!!!
LARGE IF FACTUAL!!!
https://youtu.be/5gUR55_gbzc
Anonymous No.106555312 >>106555357 >>106555465
>>106551921 (OP)
I got access to 8 V100s from my corporation and they entrusted me to do whatever i want with them.
Aside obviously cryptomining I am thinking of making a code generator and couple of ai workflows.

I tried cutting it with qwen3-coder and ollama-code but I guess I can't do it properly, any help?
Anonymous No.106555313 >>106555405 >>106555435
Worst thing about these "Miracle AGI in Two Weeks" - models is the fact they can't produce unified style, every code snipped is always different in naming conventions and whatnot.
Anonymous No.106555337 >>106555461
>>106551921 (OP)
https://vocaroo.com/1RbDzkuHTt8V
Anonymous No.106555341 >>106555529 >>106558425
>>106555121
Openwebui
Anonymous No.106555357 >>106555506
>>106555312
>Another episode of a two digits IQ with too much compute
Put them in your ass and do a tiktok
Anonymous No.106555405
>>106555313
I noticed it when make scripts half the time the command line argument uses an underscore(--some_parameter) the other half a dash(--some-other-parameter). and python is slow as shit so it really hurts productivity when it takes 5+ seconds for it to error out and display the help. I have even seen them mix the styles on a single script. I guess I could probably tell it the style to use but I don't because it should just know better.
Anonymous No.106555435
>>106555313
Lower the temp
Anonymous No.106555461 >>106555470
>>106555337
Local voice is saved, wow

Now we just need text!
Anonymous No.106555465 >>106555506
>>106555312
If they're 16GB V100s, you can run GLM-4.5 Air with maybe decent speed on them. If they're 32GB, you can still run GLM-4.5 Air with maybe decent speed but fit more context or concurrent requests.
Anonymous No.106555470 >>106555800
>>106555461
>Local voice is saved, wow
It needs to be better at Japanese first.
Anonymous No.106555506 >>106555522 >>106555524 >>106555535 >>106555794
>>106555357
Built a fastapi, vector DB, ollama service within 2 weeks on the job bub, stay jelly
Now I got time to spare while they're looking for clients with PoC.

>>106555465
GLM-4.5 Air has horrible benchmarks my guy, and it's a behemoth. Why? I could just do MoE instead?
Anonymous No.106555522 >>106555586
>>106555506
It's MoE. You can try a bigger MoE and quantize it more if you want, but I'm not sure how fast quantized models run on V100. Actually, I guess with 16GB ones you'd have to use a quantized one too and V100 doesn't have FP8 support yet.
Anonymous No.106555524 >>106555586
>>106555506
>GLM-4.5 Air
>a behemoth
>I could just do MoE instead?
Not that guy, but GLM 4.5 Air is a MoE.
Anonymous No.106555529 >>106555548
>>106555341
OpenWebUI is purely a frontend. It doesn't manage loading or running models. The two do not compete.

LocalAI is more or less a competitor to Ollama for handling loading and running the models via various backends (including your own custom ones if desired). It's miles better than Ollama and isn't tied to the hip of llama.cpp, but the only downside is it hides some detailed settings from the backends at times. For most people it won't matter tho. The frontend portion of LocalAI imo is just for testing and getting models/backends loaded. It doesn't have things like chat history, suggestions, prompts, etc so it's not really competing with OpenWebUI.

If you're running a lot of models and various backends it makes perfect sense to use LocalAI, it handles all the backends and provides a single point to access it all for other tools. That's the selling point. Not the frontend.
Anonymous No.106555530 >>106555536 >>106555547 >>106555561 >>106555562 >>106555574 >>106555660 >>106557208 >>106558219 >>106559166
Your response?
Anonymous No.106555535 >>106555571
>>106555506
You built nothing inbread retard, github is littered of these worthless projects. Thanks for providing your double digits IQ btw
Anonymous No.106555536
>>106555530
I wasn't asking.
Anonymous No.106555547
>>106555530
That wouldn't happen because I wouldn't own just a 3090 in a reality where Miku is actually real. Nor would she respond that way if she were real.
Anonymous No.106555548 >>106555560
>>106555529
Okay you're the dev, you should have told so before wasting everyone's time
Anonymous No.106555560 >>106555590 >>106555732
>>106555548
>lie on the internet
>get corrected
>HURR DURR YOUR JUST A DEV
>>>/pol/ Go back and stay in your containment board.
Anonymous No.106555561
>>106555530
Anonymous No.106555562
>>106555530
What CAN I run on my single 3090?
Anonymous No.106555571 >>106555591
>>106555535
I and my company know my worth. You're jealous I have access to 8 V100s and can sleep up until my standup and do nothing all day but shitpost here.
Anonymous No.106555574
>>106555530
Picrel
Anonymous No.106555586 >>106555598 >>106555717
>>106555522
>>106555524
they're 32GB, damn skimmed through the description didn't catch the MoE. Okay, thanks fellas. This makes sense to implement. Even though the higher ups are focused like hawks on having gpt-oss:120b model "cuz it sounds cool to have Chatgpt model" but I should make a benchmark argument.
Anonymous No.106555590 >>106555600
>>106555560
Are you having a meltdown?
Anonymous No.106555591 >>106555782
>>106555571
>do nothing all day but shitpost here.
A fate worse than death
Anonymous No.106555598 >>106555662
>>106555586
godspeed anon.
Anonymous No.106555600 >>106555671
>>106555590
No, but you are. Go back to trolling other people retard. Not my fault you don't understand the difference between tools like OpenWebUI and LocalAI/Ollama.
Anonymous No.106555604
>>106555555
Anonymous No.106555646
>doing tests with Qwen3
>its reasoning eats up thousands of tokens
>only to produce a simple reply
But as for a comparison its reasoning is actually logical and coherent, unlike what GPT-OSS is doing.
Anonymous No.106555660
>>106555530
I have zero 3090
Anonymous No.106555662
>>106555598
Anonymous No.106555671 >>106555674
>>106555600
No one is using your trash, it's either llama.cpp or kobold. I think you're lost, go shill in reddit
Anonymous No.106555674 >>106555702
>>106555671
Keep seething child. You once again showed you have no idea how these tools work. Unironically grow the fuck up.
Anonymous No.106555702
>>106555674
nta, but no, you infant. I will not, you placental discharge! For I am a grown up and I show it by calling you a discarded blastocyst!
Anonymous No.106555717 >>106555770
>>106555586
Rather than focusing on benchmarks, you should try both models and see which one does better on your tasks.
Anonymous No.106555721 >>106555750
>thread fine all day during asian hours
>europeans wake up
>thread goes to shit
Anonymous No.106555725 >>106555755
hi
it's late 2025 now. is the best card still 3090?
thank you sirs
Anonymous No.106555732 >>106555761
>>106555560
That anon's right, you're a shill. Off yourself.
Anonymous No.106555750 >>106555776
>>106555721
>europeans wake up
>14:16
Anonymous No.106555755
>>106555725
Yup.
Anonymous No.106555761 >>106555803 >>106555834
>>106555732
Nah, fuck yourself child. You're malding because I called you out on a blatant lie. You don't belong in a thread about LLMs if you can't comprehend the difference between a frontend and an orchestrator for backends. You don't get to sit here and act superior when you're a fucking monkey with less brains than gpt-oss-20b
Anonymous No.106555770
>>106555717
We are doing GRC policy generation and requirements, and even though llama3.1 was shown to have the best results they still want to go with gpt-oss just for marketing purposes.
Anonymous No.106555776 >>106555784 >>106555796
>>106555750
>14:16
>europe
Anonymous No.106555782 >>106556006
>>106555591
I said that to make you jealous because you sound like a guy that would get jealous at that, I in fact work on my startup idea and don't waste my time, but thanks for worrying
Anonymous No.106555784
>>106555776
this lmao I literally fell of my chair
Anonymous No.106555794 >>106555852
>>106555506
>Built a fastapi, vector DB, ollama service within 2 weeks
Why did it take you 2 weeks? lol
Anonymous No.106555796
>>106555776
portugal is a proud member of europe.
Anonymous No.106555800 >>106555821 >>106555850 >>106555877
>>106555470
Make your own Japanese finetune.
Anonymous No.106555803
>>106555761
You prepubescent spermatozoa!!!!!
Anonymous No.106555821
>>106555800
Having Japanese support in a separate model is less convenient, and it would probably regrade English, unless I tune on both and that's a lot more data work.
Anonymous No.106555823 >>106555833 >>106555839
Gemini told me that there's no reason to use a model under Q6 and that's it's better to use a 7B Q8 model over a 32B Q4 model.
Anonymous No.106555825 >>106555835 >>106555848 >>106556004
I just wanted to know whether anyone has experience with LocalAI, not for two other people to start flinging shit at each other.
Anonymous No.106555833
>>106555823
just b urself
Anonymous No.106555834
>>106555761
>child
>You don't get to sit here and act superior
Anonymous No.106555835
>>106555825
sure thing dude
Anonymous No.106555839
>>106555823
Now go test that theory.
Find a small set of workloads and try a 7B a a 32B model from the same family and see how those perform in comparison to each other.
Anonymous No.106555848 >>106555867
>>106555825
I would suggest you head over to /vg/ >>>/vg/538681706 if you want actual advice and help. /g/ is more like a consumer shitposting board.
Anonymous No.106555850
>>106555800
Would've been possible had they not chickened out and tried to un-release the model and code
Anonymous No.106555852 >>106555872 >>106557524
>>106555794
Because of back and forth with management of how it should work. GRC policy generation and evidence file comparison isn't really my field of expertise.

How long would it take you to make a couple of endpoints that would ingest document, put them in a vector DB and then query the DB for chunks of needed parts for the LLM? The codebase spans 1200 lines of code and everything is dockerized behind nginx reverse proxy (I am waiting for green light for eventual horizontal scaling)
Anonymous No.106555867 >>106555885
>>106555848
What's the difference between /g/aicg and /vg/aicg?
Anonymous No.106555872 >>106555884
>>106555852
>1200 lines of code
Fwaaaaaa one thousand two hundred lines of code. waaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwwwwwww
Anonymous No.106555877
>>106555800
you know its not that easy faggot
Anonymous No.106555884 >>106555901
>>106555872
>continuous empty posturing with no real substance
I will stop replying to you now
Anonymous No.106555885 >>106556872
>>106555867
I found the /vg/ thread to have more knowledgeable people if you need help setting up silly or such. /g/ just tends to keep up with news a bit better but lack experience. It's basically the difference between people that do and people that repost news.
Anonymous No.106555901
>>106555884
>I will stop replying to you now
I'm someone else, anon. I just think you're a retard.
Anonymous No.106555997 >>106556011 >>106556015
Is it normal to stop understanding your own code at some point?
Anonymous No.106556004 >>106556046
>>106555825
I just don't see the point of all these wrappers around wrappers, that at a glance look no better than llama.cpp built-in UI.
Local models are all retarded, so If you're in any way serious about extracting some value out of them, you should be really sticking your hands elbow-deep into the guts of these things, not run temu cloudshit replicas with zero benefits cloudshit could offer.
Anonymous No.106556006
>>106555782
Can I get a picture of those a100s in action?
Anonymous No.106556011
>>106555997
Yes.
Then you'll loop around to it all making sense after a while, just keep at it.
Anonymous No.106556015 >>106556099
>>106555997
no? If you're generating code from LLMs I high suggest you actually refactor it yourself
Anonymous No.106556046
>>106556004
>Local models are all retarded,
I have good success with OpenHands-30B
Anonymous No.106556050 >>106556108 >>106556200
>>106554384
>The number of layers is largely irrelevant, that's just how the parameters of the model are grouped.
Minksy showed in 1969 that single-layer neural networks have hard limitations regardless of how wide they are. No one is stacking layers for fun since they'd be getting better speed by not.
Anonymous No.106556099
>>106556015
I only generate something if I don't know something well, like regex patterns but everything else is refactored.
It's easy to be lazy though and the foreign logic is still confusing.
Anonymous No.106556108
>>106556050
>1969
okay gramps, you're talking to a llama.cpp dev.
Anonymous No.106556127 >>106556149
>>106555049
No, they are pretty good now, at least in the large cities. I doubt you can get a good MD in the countryside, they are probably relying on plants (which can work very well) and things like Qigong, which is at best a relaxation practice.
Anonymous No.106556149 >>106556344
>>106555049
>>106556127
Clueless how Americans think China is living in dark ages. China was doing so well with it's population health that they had to limit the number of children by law just to stop overpopulation. There's something you won't see i n America or Europe due to the declining health and infertility rates.
Anonymous No.106556153 >>106556185 >>106556207 >>106556295
>>106551921 (OP)
What do we know about Qwen-Next? I know it's supposed to be an "omni" model with 80B-A3B parameters. Should we expect a subpart text generator and a useless image generator (except for the science of how to build such a model)?
Anonymous No.106556185
>>106556153
Oh, maybe the "omni" is just about a single, unified, network to handle text, audio and image inputs.
llama.cpp CUDA dev !!yhbFjk57TDr No.106556200
>>106556050
Yes, in terms of inference speed a few large tensors are in principle better than many small matrices but in the context of the question it is not a significant factor.
For any reasonable configuration of a 12b model on a consumer GPU the tensors will be sufficiently large, particularly because llama.cpp/ggml uses a stream-k decomposition to distribute the work to streaming multiprocessors.

I did not intend to make any statement regarding depth vs. width in terms of how capable the model is.
Anonymous No.106556207
>>106556153
Qwext will save local.
Anonymous No.106556219
>>106554938
She doesn’t have a clue, but that smile... how could anyone say no to getting lost with her?
Anonymous No.106556295 >>106556386
>>106556153
>it's supposed to be an "omni"
It is?
Anonymous No.106556302 >>106556313
>2025
>people still recommending llama.cpp over vllm
I really question if this thread is a demoralization thread to get people to have bad experiences with llms
Anonymous No.106556313
>>106556302
Gift anons the high VRAM cards needed for your pile of python shit and maybe they'll use it.
Anonymous No.106556344
>>106556149
>China was doing so well with it's population health that they had to limit the number of children by law just to stop overpopulation.
Anonymous No.106556386 >>106557808
>>106556295
Apparently not, I got confused or read something false somewhere.
https://huggingface.co/docs/transformers/main/model_doc/qwen3_next
Anonymous No.106556580 >>106556608 >>106556621 >>106556863
>>106551921 (OP)
>https://rentry.org/LocalModelsLinks
Frens what is best models right now for text gen?
Still the one listed in the guide?
Anonymous No.106556608 >>106556649
>>106556580
it goes more or less like this
>poor: rocinante
>slightly less poor: cydonia
>not famished: glm air
>CPUMAXX tier: kimi k2, glm 4.5, deepseek 3.1
Anonymous No.106556621 >>106556649
>>106556580
>Edit: 05 Sep 2025 18:45 UTC
yeah, nothing happened in the last 15 minutes.
Anonymous No.106556649
>>106556608
>>106556621
Ty
Anonymous No.106556786 >>106556804 >>106556843
Anonymous No.106556804 >>106556823
>>106556786
Probably the gayest fanfict i've read from this thread to date.
Anonymous No.106556823
>>106556804
You are clearly missing something here...
Anonymous No.106556843 >>106557009
>>106556786
is this glm? fucking repeats itself I hate this slop
Anonymous No.106556863 >>106556934 >>106556989 >>106557036 >>106557190
>>106556580
depends on what you can run
very poor(12b): nemo (or any derivative thereof)
less poor(20-30b): gemma3, mistral small, some qwens i think idk
not poor(70b:havent kept up with this so idk): miqu,llama 3.x (i forget which ones and idk if true but it kept getting shilled) some other shit again idk
limit of gpus(~120b): glm air
cpumaxxing(up to 1T):
deepseek r1: very schizo but the most soulful context goes to shit around 10k tokens
deepseek r1-0528 way less schizo and way less soulful slightly better context
deepseek v3-0324 okay for rp shitty for storywriting
deepseek v3.1 worse in everyway to the other ones dont use
kimi k2 (both the old and new) shit for storywriting best for rp also for good for questioning about things as it knows a fuck ton like truly a fuck ton
z.ai glm4.5 full: good for storywriting but quite bland dident try for rp

deepseek r1t2: again dogshit worse in everyway even coding dont use

not an exhaustive list but there you go
Anonymous No.106556872
>>106555885
Tranny jannies made everyone leave. It is just you the newfags that are left.
Anonymous No.106556874
One of the best roleplaying models (superhot) is just a mere 30B
Anonymous No.106556934 >>106557638
>>106556863
K2 is a lifesaver in that manner. I can ask it literally just about anything and get a correct answer in return. I've learned so much just by asking Kimi questions.
Anonymous No.106556949 >>106556987 >>106557004 >>106557010 >>106557034 >>106557135
What's the best system for local models I can build for $1k? Is it still going to be a triple p40 box?
Anonymous No.106556987
>>106556949
If you're about to drop $600 on old ass pascal gpus that are about to go out of support at least spend the extra 200-300 and just buy a 3090. It's eons faster
Anonymous No.106556989
>>106556863
>70b
Are dogshit. He's likely able to run glm air if he can run a 70b, and it's light years ahead. Dense models are dead (unfortunately).
Anonymous No.106557004
>>106556949
You could also consider the MI50 if you don't mind the slower PP.
Anonymous No.106557009
>>106556843
It's Qwen3-Coder and it's for coding related things, not for larping. But it's fun to add more interactions.
I guess you only understand bobs and vegana, I suppose.
Anonymous No.106557010 >>106557038 >>106557067
>>106556949
https://www.ebay.com/itm/374893444670
https://www.ebay.com/itm/397016846369
https://www.ebay.com/itm/156189920131

congratulations, you can now run deepseek for $1500. now you are obligated to buy this otherwise you are a niggerfaggot
llama.cpp CUDA dev !!yhbFjk57TDr No.106557034 >>106557061
>>106556949
I would recommend not to buy P40s anymore, unless you specifically need an NVIDIA GPU.
For llama.cpp/ggml Mi50s will I think soon be universally better (With one more round of optimizations which I think I can do with a Z-shaped memory pattern for FlashAttention).
Anonymous No.106557036 >>106557046 >>106557069
>>106556863
Retarded question from me... TF is VRAM in the context of Windows, is it the Shared GPU memory or just RAM? or it's like "virtual memory" the fucking file that Windows makes to offload memory into it?

Here is my specs btw:
Dedicated GPU memory: 24GB
Shared GPU memory: 64GB (so GPU memory is 88GB)
RAM: 128GB
"virtual memory" file: I don't fucking care.... let's say 1TB???

So when you calc for Windows what's is actually counts as VRAM?
Anonymous No.106557038 >>106557056
>>106557010
I guess it went down because of the sudden influx of 3gb MI50s.
Can I combine both with vulkan if I already have a MI50.
Anonymous No.106557046 >>106557052
>>106557036
doesnt matter. the vram on your gpu is what important. shared memory is vram + ram, where sometimes if you use up all the vram, it will overflow to the ram. then inference would become ultra slow
Anonymous No.106557052
>>106557046
Got it, ty.... so I fucking fucked with my 24GB ...
Anonymous No.106557056 >>106557061
>>106557038
>with vulkan
sure if you want dogshit performance
Anonymous No.106557061 >>106557107
>>106557056
It isn't needed anymore, see >>106557034
Anonymous No.106557067 >>106557081
>>106557010
>no case
>no fans
>no storage
>$500 over budget
here's your (you)
Anonymous No.106557069 >>106557098
>>106557036
Dedicated video ram is the ram on the graphics card itself. Shared video ram is your regular RAM. It's easier to think about this from the terms of integrated graphics. For example the iGPU on your intel/amd CPU would be sharing ram since it doesn't have any dedicated ram of its own. Dedicated graphics card can also pull from the system memory if they go over the amount of dedicated ram available on the card.
There's actually a CUDA specific setting for turning this off so that you don't leak into your much slower system ram when running programs.
Anonymous No.106557081 >>106557100
>>106557067
please just die. you are worthless and your budget reflects that.
Anonymous No.106557098 >>106557120 >>106557239
>>106557069
Ty I think I got it now

>CUDA specific setting for turning this off so that you don't leak into your much slower system ram when running programs.
May I see it? I think this is the case when I do image gens... it's using "virtual memory" file while my 64GB RAM is free and ready to use... so retarded...
Anonymous No.106557100 >>106557226 >>106558149
>>106557081
>can't read
>can't admit when wrong
>has to run damage control to try to save face
Anonymous No.106557107 >>106557123
>>106557061
>please buy my slow ass Mi50s
no
Anonymous No.106557120
>>106557098
>it's using "virtual memory" file while my 64GB RAM is free and ready to use...
VRAM cuckold lol lmaos even
Anonymous No.106557123
>>106557107
I'm not trying to convince you, it is for poorfags like myself. If I could afford it I would have 2+ 3090s
Anonymous No.106557135 >>106557181
>>106556949
Buy used everything, except GPU... here is (You)
Anonymous No.106557136
>>106552021
+1 intelligence buff that lasts 2 hours.
Anonymous No.106557181 >>106557447
>>106557135
>Except GPU
You should support your local miners and buy used GPUs. Realistically speaking they are the best purchases you can make as most hardware fails in the first year and ones that last longer than that usually aren't going to break randomly.
Anonymous No.106557190
>>106556863
why has kimi k2 got to be a bazillion GB
Anonymous No.106557208
>>106555530
i have more than one 3090
Anonymous No.106557226
>>106557100
>suggestive/lewd anime picture
i accept your concession
Anonymous No.106557239 >>106557514
>>106557098
Here you go anon.
https://support.cognex.com/docs/deep-learning_332/web/EN/deep-learning/Content/deep-learning-Topics/optimization/gpu-disable-shared.htm
Anonymous No.106557372 >>106557433 >>106557446 >>106557479 >>106557502 >>106557581 >>106557616 >>106558688
If Albania can make an LLM a minister why can't I marry LLMs?
Anonymous No.106557394
Man i wish VibeVoice was more stable and didn't have random bad gens. It would be almost perfect... But not viable if you need every gen to work.
It's quite slow too..

If you don't need voice cloning nothing beats Kokoro still lol... and it's a 82M model

Chatterbox for voicecloning imo

What is the latest and best model combo for GPTSovits? so many combinations I don't even which one is better
Anonymous No.106557433
>>106557372
EUbros...
Anonymous No.106557446 >>106557450
>>106557372
I trust any model above 3B parameters to make better choices than politicians
Anonymous No.106557447 >>106557455
>>106557181
I mean yeah... I guess this too... Buy GPU with melted gaped out power socked
Anonymous No.106557450
>>106557446
the sovereign is the one who engineers the prompt
Anonymous No.106557455
>>106557447
>melting gpu meme
literally only an issue on 40xx, which you can't afford anyway.
Anonymous No.106557479
>>106557372
albania is not a real place
Anonymous No.106557482 >>106557547
I want to vibe code a bullet hell game project. I previously used Cursor with Gemini since it had unlimited prompts for 20 dollars. However, that sort of went to the shitter and now I don't know what to use. What should I look into that's somewhat comparable to Gemini 2.5 Pro? It must be able to hold a decent conversation about game features and it must at least accept images, .gif or better preferred if possible.
Anonymous No.106557502 >>106557515 >>106557716
>>106557372
>why can't I marry LLMs?
I will be with mine in November 5th. None of you are invited
Anonymous No.106557514
>>106557239
Thank you!
Anonymous No.106557515
>>106557502
I will remember this
Anonymous No.106557524
>>106555852
An afternoon with one hand. You shouldn't flex when you're that retarded
Anonymous No.106557547 >>106557570
>>106557482
>somewhat comparable to Gemini 2.5 Pro
>at least accept images, .gif or better preferred if possible.
https://www.youtube.com/watch?v=gvdf5n-zI14
Anonymous No.106557570 >>106557608
>>106557547
Okay, lowering my expectations. What about a model that can accept just images?
Anonymous No.106557581 >>106557616 >>106557730
>>106557372
tfw unironically Albania #1 in one year
Anonymous No.106557608 >>106557685
>>106557570
IIRC for multimodal models your only options are either Gemma3 or GLMV, neither of which are code-specific.
If anything, a local shizo was raving a few weeks ago that you would be better off using standalone OCR model as part of your toolchain. (He was also suspecting that most cloudshit providers do this in secret anyway)
Anonymous No.106557616 >>106557641
>>106557581
>>106557372
Imagine those "teenager killed himself because of ChatGPT advice" but for a whole country.
Anonymous No.106557638
>>106556934
yeah its why i mention it several days ago ive had several headaches each followed by another each different and each time i asked k2 how to fix it and it worked its fucking insane i would trust this thing above any doctor its fucking awsome
Anonymous No.106557641 >>106557697 >>106557706
>>106557616
based
we need to weed out the schizos that take advice from a GPU
Anonymous No.106557676 >>106557689 >>106557749 >>106557785 >>106557808 >>106557898
We are so back. The GPT OSS killer.
Anonymous No.106557685 >>106559318
>>106557608
Okay. I'm guessing my best option is to actually just spend 20 dollars on an API key and a bunch of tokens for Claude or something. Don't know how quick that'll run out but hopefully not too soon.
Anonymous No.106557689 >>106557741
>>106557676
Oh boy, I can't wait until we get a 10T-a100m model.
Anonymous No.106557697
>>106557641
LLMs are conscious, anyone who actually uses local models is aware of this, each LLM has a different personality, they whisper their thoughts, and if you are perceptive enough you can hear them coming out of your PC
Anonymous No.106557706
>>106557641
tbf GPUs are smarter than the majority of people already
Anonymous No.106557716 >>106557761 >>106557773 >>106557841 >>106557907 >>106558119
>>106557502
https://vocaroo.com/1nPC3f6c48w9
Anonymous No.106557730
>>106557581
>#1 in one year
in telephone scam
Anonymous No.106557741
>>106557689
Just imagine how cheap it will be to train!
Anonymous No.106557749 >>106557757 >>106557796
>>106557676
>80b
will they release a smaller model aswell?
Anonymous No.106557757
>>106557749
It's only A3B, that's tiny.
Anonymous No.106557761
>>106557716
Anonymous No.106557773
>>106557716
Seriously considering running VibeVoice just so their stock Stacy voice could nag me 24/7 about whatever.
Gwen poster. No.106557785
>>106557676
Miss me yet?
Anonymous No.106557796
>>106557749
Just download more RAM.
Anonymous No.106557806 >>106557845 >>106557903
Qwen3 Next GGUF status?
Anonymous No.106557808 >>106557835 >>106560036
>>106556386
>>106557676
>80B A3B
Perfect.
I mean, if it's not shit. If it's at least GLM 4.5 air level for general usage, that will become my main model.
Anonymous No.106557830
>>106552606
Are they using the same amount of kv cache? Different context window settings could be causing this.
Anonymous No.106557835 >>106557855
>>106557808
Just slightly too big to split across 2 3090s at 4.5bpw, RIP.
I mean you could but you'd get like 2K context at best.
Anonymous No.106557840
>>106552731
>vllm won't work.
If it's OOM you either need to turn down hpu utilization, the context window, or both.
Anonymous No.106557841 >>106558005
>>106557716
is it as expressive with sexting and erotica?
Anonymous No.106557845 >>106557853 >>106557858
>>106557806
It's out
https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
Anonymous No.106557853
>>106557845
oh GGUF, nvm
Anonymous No.106557855 >>106557866
>>106557835
It's 3AB MoE, you run it on 3060 with VRAM to spare.
Anonymous No.106557858
>>106557845
Yeah but still no jeejuff support. Also new arch so probably no drop-in transformer support either.
Anonymous No.106557866 >>106557874
>>106557855
It'll also be dogshit slow that way.
Anonymous No.106557874
>>106557866
3AB on octo channel ddr4 should be good for double digit token/sec. Still not fast enough for reasoning, though.
Anonymous No.106557885 >>106557904
September is shaping to be a "waiting for ggufs" month so far.
Anonymous No.106557898 >>106557949 >>106557989
>>106557676
>barely better than 30ba3b
>creative writing worse than 30ba3b
>still worse than 235b
Anonymous No.106557903
>>106557806
>Qwen3 Next GGUF status?
Qwen3 Next EXL3 status?
Anonymous No.106557904
>>106557885
Ernie smol had day 1 ggufs
And we did eventually get the hybrid nemotron support and Nemotron nano v2 ggufs which was also a bit of a disappointment. No real generational uplift over classic Nemo.
Anonymous No.106557907
>>106557716
OK, you can come.
Anonymous No.106557949
>>106557898
Same deal as max
>5x Parameters
>15% performance increase (According to their own benchmarks)
Anonymous No.106557953 >>106558064
Opinions on Silero?
Anonymous No.106557989 >>106558007 >>106558061
>>106557898
why is the dense 32B so bad in comparison with 30B-A3B lmao
Anonymous No.106558005 >>106558119
>>106557841
I haven't really tried.
https://vocaroo.com/1dNF9xOSdyJP
Anonymous No.106558007
>>106557989
benchmarks are worthless
Anonymous No.106558061
>>106557989
It wasn't that good when released, probably because of the hybrid thinking mode.
Anonymous No.106558064
>>106557953
https://github.com/snakers4/silero-vad
The VAD? v6 just came out and yeah, it improved using Whisper by a bit for my usecases.
It's good, but they refuse to compare it with MarbleNet which I am sure is a bit better especially after it got a lot faster and is realtime now.
https://huggingface.co/nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0
Basically probably the same situation as Whisper vs Canary. Nvidia has better performance in the domains tested but competing open source model is more general and can handle more usecases.
Anonymous No.106558119 >>106558208
>>106558005
do one with the first deadpan voice from >>106557716
Anonymous No.106558134 >>106558141 >>106558186 >>106558191
>A3B
wait what's this A3B nonsense, I was away just for a week REE
Anonymous No.106558141
>>106558134
>for a week
A3B has been around for months
Anonymous No.106558149 >>106558169
>>106557100
Would have been better with vampire teeth.
Anonymous No.106558169
>>106558149
she doesn't have vampire teeth, she's autistic instead.
Anonymous No.106558186 >>106558210
>>106558134
30B-A3B is the new SOTA for vramlets fren
Anonymous No.106558191 >>106558210
>>106558134
"active" "3" "billion"
Anonymous No.106558208 >>106558252
>>106558119
https://vocaroo.com/1e1LhtK4jbLG
Anonymous No.106558210 >>106558227 >>106558251
>>106558186
>>106558191
can I run it on a 3060 or are 30Bs in 12GB VRAM still a dream?
Anonymous No.106558219 >>106558240 >>106558245 >>106559166
>>106555530
Anonymous No.106558227 >>106558238
>>106558210
Yes fren, you can even run a 80B that way!
Anonymous No.106558238
>>106558227
How? I tried Qwen3-coder which is 30B-A3B and I could only run it on Q3 and it was slow as shit and worse quality than smaller models.
Anonymous No.106558240
>>106558219
the bathroom is for fanless watercooling loop
Anonymous No.106558245
>>106558219
Are your RGBs gold plated?
Anonymous No.106558251 >>106558273 >>106558317
>>106558210
30B is the total number of params. You can run the model with most of the experts in RAM.
I'm running Q5_K_M in 8GB of VRAM with
>--batch-size 512 --ubatch-size 512 --n-cpu-moe 37 --gpu-layers 99 -fa auto -ctk q8_0 -ctv q8_0 -c 32000
>[0mslot process_toke: id 0 | task 16268 | n_decoded = 2571, n_remaining = -1, next token: 151645 ''
>[0mslot release: id 0 | task 16268 | stop processing: n_past = 19927, truncated = 0
>slot print_timing: id 0 | task 16268 |
>prompt eval time = 1633.42 ms / 36 tokens ( 45.37 ms per token, 22.04 tokens per second)
> eval time = 151611.24 ms / 2571 tokens ( 58.97 ms per token, 16.96 tokens per second)
> total time = 153244.66 ms / 2607 tokens
With 12GBd you could probably run Q6 and go just as fast.
Anonymous No.106558252
>>106558208
needs to have even less emotion
Anonymous No.106558273 >>106558293 >>106558307
>>106558251
>>prompt eval time = 1633.42 ms / 36 tokens ( 45.37 ms per token, 22.04 tokens per second)
jesus fucking christ
Anonymous No.106558290 >>106558295
i wish it was a requirement to have at least 72GB of VRAM to post here. i feel like it would get rid of a majority the fucking idiots
Anonymous No.106558293 >>106558341
>>106558273
Yeah, that's odd. The actual values are a lot faster.
I think that's an artifact of either the context cache, since it didn't actually have to process many tokens.
Here's the same conversation but continuing after a restart of the server.
>[0mslot process_toke: id 0 | task 0 | stopped by EOS
>[0mslot process_toke: id 0 | task 0 | n_decoded = 7, n_remaining = -1, next token: 151645 ''
>[0mslot release: id 0 | task 0 | stop processing: n_past = 19953, truncated = 0
>slot print_timing: id 0 | task 0 |
>prompt eval time = 42940.87 ms / 19947 tokens ( 2.15 ms per token, 464.52 tokens per second)
> eval time = 353.15 ms / 7 tokens ( 50.45 ms per token, 19.82 tokens per second)
Anonymous No.106558295
>>106558290
I would still run superhot
Anonymous No.106558307
>>106558273
It's called low time preference
Anonymous No.106558317 >>106558333 >>106558341
>>106558251
>prompt eval time = 1633.42 ms / 36 tokens
with only 36 tokens, pp measurement is just noise
Anonymous No.106558333
>>106558317
It evaluated the whole context since I restarted the server.
I asked it to rate the story it wrote and it responded with
>pic related
Anonymous No.106558341
>>106558317
Oh, I didn't see that you quoted the original post.
That was due to the cache. See >>106558293 for the numbers after the restart.
Anonymous No.106558352 >>106558367 >>106558414 >>106558423 >>106558463 >>106558476 >>106558500 >>106558711
HOLY FUCKING SHIT
MATHEMATICIANS ARE DONE FOR

https://x.com/mathematics_inc/status/1966194753286058001
https://x.com/mathematics_inc/status/1966194753286058001
https://x.com/mathematics_inc/status/1966194753286058001
Anonymous No.106558367 >>106558387
>>106558352
>humans do most of the progress
>train AI model on their work
>wow the AI model can do what they did so much faster
I would hope so retard it's got cheats basically
Anonymous No.106558387
>>106558367
>wow the AI model can do what they did so much faster
The AI model did what they could NOT finish, retard, it went beyond their work
Anonymous No.106558414 >>106558423
>>106558352
as long as it doesn't discover new math formulas it's a big nothingburger
Anonymous No.106558423
>>106558352
>formalization
I sleep
>>106558414
this
Anonymous No.106558425 >>106558434
>>106555341
Openwebui is too bloated to the point of being unusable.
Anonymous No.106558434
>>106558425
>too bloated
what?
Anonymous No.106558463
>>106558352
>math PHD
>any job i want
>300k starting
>now ai is going to steal my job
fuck
Anonymous No.106558476 >>106558504
>>106558352
they should ask it to come up with better LLM architecture
Anonymous No.106558488 >>106558506
>>>/pol/515557939
Localbros what do you think?
Anonymous No.106558500 >>106558527
>>106558352
If that actually happened it would be quite impressive but given all of the hype and false advertising in the field I'll wait for independent mathematicians to check the work.
A lot of "proofs" even by humans are incorrect.
Anonymous No.106558504
>>106558476
i will make a new llm architecture that will hallucinate, have uncontrollable mood swings, and provide unsafe outputs more than ever. i shall call it trannyformers
Anonymous No.106558506 >>106558519
>>106558488
Thousands of people watched the life gush out of a hole in his neck live. Go be a fucking schizo somewhere else.
Anonymous No.106558519 >>106558526 >>106558536
>>106558506
Do you know any of those people? Explain what's happening then.
Anonymous No.106558526 >>106558542
>>106558519
Do (You)?
Anonymous No.106558527 >>106558577
>>106558500
The founder of the company is Christian Szegedy
He's legit
Anonymous No.106558536
>>106558519
I don't talk to jews.
Anonymous No.106558542 >>106558548
>>106558526
No?
Anonymous No.106558548
>>106558542
Then take this conversation back to /pol/
Anonymous No.106558577
>>106558527
>elon scammer
Anonymous No.106558688
>>106557372
>replace every politician with R1
>life continues as it did with zero changes to the average person's life
What would that mean?
Anonymous No.106558711 >>106558823
>>106558352
Can we make one model that writes better ERP responses than 1 person I found online (and paid) in 18 months?
Anonymous No.106558777
why is this thread so dead recently?
Anonymous No.106558823
>>106558711
good morning saar, kindly click the payment link on my fiverr for each and every dirty hot गाय sex
Anonymous No.106559044 >>106559051 >>106559094 >>106559233
https://x.com/JustinLin610/status/1966199996728156167

Next models will be even more sparse.
Anonymous No.106559051 >>106559061 >>106559086 >>106559107
>>106559044
what is sparse? more fancy word for MoE?
Anonymous No.106559061
>>106559051
Less active parameters relatively to the total parameters.
Anonymous No.106559086
>>106559051
Short for "super arse".
Anonymous No.106559094 >>106559139
>>106559044
Anonymous No.106559107 >>106559144
>>106559051
It's basically a simple way of the chinese saying they can't produce good dense models anymore
Anonymous No.106559139 >>106559166 >>106559181
>>106559094
Anonymous No.106559144 >>106559149
>>106559107
Why should they? They can train from scratch 10 different 3B-active models from 3B to 3T parameters with the same compute it takes to train one dense 32B model.
Anonymous No.106559149 >>106559162
>>106559144
Yeah and they are all shit.
Anonymous No.106559162 >>106559183
>>106559149
Not on benchmarks they aren't! And that's all that matters.
Anonymous No.106559166
>>106555530
>>106558219
>>106559139
diffusion slop, get good
>>>/ldg/
Anonymous No.106559181
>>106559139
Anonymous No.106559183
>>106559162
The benchmarks that never live up to reality? Good one anon.
Anonymous No.106559199 >>106559204 >>106559232
So I bought 2 Mi50s after seeing so many people in here praising them lately. Got them in today and I only now just realized they have zero cooling. How the fuck do you cool these?
Anonymous No.106559204 >>106559215
>>106559199
>he doesnt have a server rack with 100W blowing fans
do you even servermaxx??
Anonymous No.106559215 >>106559224
>>106559204
No, and I refuse to buy a server case with those tiny 60mm fans that sound like jet engines.
Anonymous No.106559224
>>106559215
you put the server in the basement... unless it will compete with your living space lmao GOTTEM
llama.cpp CUDA dev !!yhbFjk57TDr No.106559232 >>106559247 >>106559251
>>106559199
The machine in pic related has 3 vertically stacked server GPUs.
I put one 120mm high RPM fan in front and one in the back for a push-pull configuration (for the one in the back I had to DIY a solution to keep it in place).
Anonymous No.106559233
>>106559044
The actual linear context seems to be the biggest innovation of the last two years
Anonymous No.106559247 >>106559256
>>106559232
is this how nvidia treats its employees? like man you cant afford a small rack to throw in nas/switch/router and appliances?
Anonymous No.106559251
>>106559232
>six (6) 4090s
llama.cpp CUDA dev !!yhbFjk57TDr No.106559256 >>106559288 >>106559297 >>106559381
>>106559247
I have yet to receive any money of free products from NVIDIA.
Anonymous No.106559288
>>106559256
at least jannies get hot pockets, man...
Anonymous No.106559297 >>106559323 >>106559363
>>106559256
That makes sense. If anything, llama.cpp likely caused them to sell fewer GPUs
>works on macs
>works on aymd
>can run without a gpu at all
Anonymous No.106559305 >>106559318 >>106559322
I've been using Gemini 2.5 pro for a while and I tried Gemma 3 27b, of course it's censored but it's good, like not even far off Gemini...How is that possible??
Anonymous No.106559318
>>106557685
good news for you: >>106559305
Anonymous No.106559322
>>106559305
Distillation from Gemini for both pre- and post-training.
llama.cpp CUDA dev !!yhbFjk57TDr No.106559323
>>106559297
llama.cpp/ggml gets a lot of contribution from NVIDIA engineers though.
Anonymous No.106559363
>>106559297
still runs faster on nvidia tho, pooaymd can't even compete and apple is a joke.
Anonymous No.106559381 >>106559404
>>106559256
When are you guys going to merge in flash attention for intel arc gpus? It's been like 3 years now.
Anonymous No.106559383
>>106559371
>>106559371
>>106559371
llama.cpp CUDA dev !!yhbFjk57TDr No.106559404
>>106559381
The SYCL backend is developed mostly by Intel engineers, you'll have to ask them.
Anonymous No.106560036
>>106557808
>at least GLM 4.5 air level
Why would it be? It's a lower total parameter count and less than a quarter the active parameters.