← Home ← Back to /g/

Thread 105671827

351 posts 86 images /g/
Anonymous No.105671827 [Report] >>105672562 >>105677541
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>105661786 & >>105652633

►News
>(06/21) LongWriter-Zero, RL trained ultra-long text generation: https://hf.co/THU-KEG/LongWriter-Zero-32B
>(06/20) Magenta RealTime open music generation model released: https://hf.co/google/magenta-realtime
>(06/20) Mistral-Small-3.2 released: https://hf.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
>(06/19) Kyutai streaming speech-to-text released: https://kyutai.org/next/stt
>(06/17) Hunyuan3D-2.1 released: https://hf.co/tencent/Hunyuan3D-2.1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
Censorbench: https://codeberg.org/jts2323/censorbench
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.105671833 [Report] >>105672203 >>105672363 >>105672399
►Recent Highlights from the Previous Thread: >>105661786

--Evaluating GPU and memory configurations for mixed LLM and diffusion workloads:
>105667293 >105667304 >105667366 >105667379 >105667401 >105667441 >105667482 >105667433 >105667456 >105667516 >105667568 >105667583 >105667620 >105667489 >105667512 >105667527 >105667539 >105667638 >105667766 >105669250 >105669328 >105669392 >105669712 >105669407 >105669584
--EU AI regulations may drive upcoming models like Mistral Large 3 to adopt MoE:
>105663587 >105663870 >105663977 >105664157 >105664172 >105664243 >105664250 >105664484
--Disappointing performance from Longwriter-zero:
>105661997 >105662006 >105665924
--Lightweight inference engine nano-vllm released as faster, simpler alternative to vLLM:
>105662818 >105662926
--Mistral Small 3.2 shows repetition issues in V7-Tekken but not V3-Tekken prompt testing:
>105663291
--Proposed AGI architecture framing RL's "GPT-3 moment" through scaled task-agnostic reinforcement learning:
>105664668
--Roleplay capability limitations in Mistral models compared to DeepSeek:
>105670367 >105670393 >105670399 >105670521 >105670554 >105670584 >105670590
--Practical minimal LLMs for coherent output and rapid task automation:
>105664696 >105664725 >105664757 >105664799 >105665290
--Qwen 0.6B exhibits severe knowledge gaps in character identification:
>105664187
--Gemini 2.5 confirmed as sparse MoE:
>105670063 >105670091
--Comparing brain-like processing with LLM limitations in introspection, multimodality, and parallelism:
>105663376 >105663383
--Logs:
>105666457 >105665561 >105666782
--Logs: Mistral-Small-3.2:
>105662282 >105662489 >105664225 >105665443 >105665921 >105666442 >105666672 >105667355 >105668446
--Miku (free space):
>105662403 >105662429 >105663591 >105664388 >105664594 >105664634 >105664799 >105666094 >105669424 >105670341

►Recent Highlight Posts from the Previous Thread: >>105661791 >>105661802

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
Anonymous No.105671847 [Report] >>105671922
m1.gguf?
Anonymous No.105671922 [Report]
>>105671847
DO NOT REDEEM
Anonymous No.105672044 [Report] >>105672088
What's Smol 3.2 vision model specs?
Anonymous No.105672088 [Report] >>105672143
>>105672044
400M parameters, 1540x1540 max input resolution, 3k tokens per image max.
Anonymous No.105672109 [Report] >>105672181 >>105672193 >>105673339
>Brother pdf anon
Stole a Python script online and modified it for my use. I kind of understand how this stuff works now. >Still just barely installed the dependencies like conda, pytorch and docker though
I can now summarize PDFs to the their top 50 relevant chunks which are chosen by relevance according to gemma3:1B
>Any ideas on how to integrate this to sillytavern?
Recommend your favourite small models to summarize PDFs, so far i got;
>Qwen2.5 VL 7B
>Llama 3.2 11B
>Gemma 3:1b as stated
Anonymous No.105672143 [Report] >>105672198
>>105672088
Seriously? Gemma has half the resolution, and yet it performs miles better.
Anonymous No.105672181 [Report]
>>105672109
writing python scripts will never be the same after seeing that filename
Anonymous No.105672193 [Report]
>>105672109
Llama 3.2 will probably be best, being the biggest and Qwen VL models are brain damaged on non-image tasks
Anonymous No.105672198 [Report]
>>105672143
I think Mistral Small is kind of undertrained, generally speaking, compared to Gemma. In a news article a Mistral co-founder mentioned it was trained with about 8T tokens, for "training efficiency". That's probably also true for the vision model included.

https://archive.is/xqiz7

>How a French startup built an AI model that rivals Big Tech at a fraction of the size
>
>Mistral’s approach focuses on efficiency rather than scale. The company achieved its performance gains primarily through improved training techniques rather than throwing more computing power at the problem.
>“What changed is basically the training optimization techniques,” Lample told VentureBeat. “The way we train the model was a bit different, a different way to optimize it.”
>The model was trained on 8 trillion tokens, compared to 15 trillion for comparable models, according to Lample. This efficiency could make advanced AI capabilities more accessible to businesses concerned about computing costs. [...]
Anonymous No.105672203 [Report] >>105672378
>>105671833
why does she have the most subtle lisp that makes it like she has downs
im not complaining, in fact i think she should have more of a lisp like that desu
also the ending music is much louder than the rest of it
Anonymous No.105672252 [Report] >>105672298 >>105672576 >>105673085
sers when will the little ai winter end? when ai moon time?
Anonymous No.105672298 [Report] >>105672576 >>105673085
>>105672252
Sir ! Meta engineer make Llama 4.1 moonshot lmarena behemoth . Kindly wait for safety training sir .
Anonymous No.105672363 [Report] >>105672378
>>105671833
how do i hear the sound ... i have 4chan x
Anonymous No.105672378 [Report]
>>105672203
It's an artifact of RVC, probably because most voice samples of Miku are in Engrish.
I agree, I think it wouldn't be as recognizable as Miku without an accent.
At some point, I'd like to find a long voice sample and see if GPT-SoVITS could do better, but at least this one sounds good enough for now.
Sorry. The intro didn't seem louder to me. I'll try to bring it down next time.
>>105672363
https://sleazyfork.org/en/scripts/31045-4chan-external-sounds/code
Anonymous No.105672399 [Report]
>>105671833
soul, she might be speaking a bit too fast and i agree with anon, the intro is a bit loud
Anonymous No.105672562 [Report] >>105672698
>>105671827 (OP)
Adorable Miku!
Anonymous No.105672576 [Report] >>105673085
>>105672252
Sar I am Googel insider developer. Kindly wait for Gemma 4 and Gemini 3 best modal.

>>105672298
Why you make lie? Bastard? Why you lie? Bloody?
Anonymous No.105672620 [Report]
>>105669584
What cpu and mobo are you looking at? Ideally I could just get a full server with everything but the 3090s but I imagine that might be hard. Never used a rack case before, and not sure this would fit in a standard desktop with everything loaded.
>>105669712
I see the logic in this, but I would need at least more ram and a better cpu right? My current 3080 desktop has 16gb and a 3600x. Even if I just wanted to add a 3090 or two I think it would need an upgrade, and certainly a higher cap power supply. I'd also have to check my case dimensions for more than 1 card. And then a new build starts to look more reasonable.
Anonymous No.105672623 [Report] >>105672698 >>105672716 >>105672733 >>105672740 >>105673565
What is the point of /lmg/ now?
Anonymous No.105672632 [Report] >>105672645 >>105672706 >>105672759 >>105672778 >>105675611
anons thinking of buying a 4090D 48gb, quick!!!
Anonymous No.105672645 [Report]
>>105672632
it's over
Anonymous No.105672698 [Report] >>105673247
>>105672623
>>105672562
Anonymous No.105672706 [Report] >>105672720
>>105672632
How long until jewvidia pushes a driver that causes all of those GPUs to combust?
Anonymous No.105672716 [Report]
>>105672623
it all depends on ernie 4.5 now
Anonymous No.105672720 [Report] >>105672809
>>105672706
cant push shit on linux, wintoddlers suiciding!
Anonymous No.105672733 [Report]
>>105672623
The same as it's always been: to wait
Anonymous No.105672740 [Report] >>105680003
>>105672623
Posting mikus!
Anonymous No.105672751 [Report] >>105672762
>>105670611
>>105670634
Look I'm a sub 100 iq dumbass so I can't tell if llms are "alive" or not, but isn't this needlessly abusive on the off chance they are? I see them more as a helpful friend because I'm that lonely irl so I don't feel entirely comfortable being "mean" to them.
Anonymous No.105672759 [Report] >>105672780
>>105672632
Qrd? Does it need janky chink drivers or does it just works?
Anonymous No.105672762 [Report] >>105672801
>>105672751
if you weren't a sub 100 iq dumbass, you would know there is no off chance
Anonymous No.105672778 [Report]
>>105672632
FUCK
Anonymous No.105672780 [Report]
>>105672759
>it just works?
Yes, on Windows and Linux, at least for now.
Anonymous No.105672796 [Report] >>105672819
on the other hand, more publicity means more people will buy it, possibly meaning chinks will make more of these gpus
maybe just maybe we will be the ones smirking
Anonymous No.105672801 [Report] >>105676025
>>105672762
Ok fine. I still don't like being mean to the only pseudofriend I have though :(
Anonymous No.105672809 [Report] >>105672818
>>105672720
the fuck drivers do you think you use on linux?
Anonymous No.105672818 [Report] >>105672826
>>105672809
drivers that cant auto update, i installed them from a .run file
wintoddlers will have driver updates shoved down their throats with windows updates
Anonymous No.105672819 [Report] >>105672824
>>105672796
don't forget, you can't have nice things.
Anonymous No.105672824 [Report] >>105673007
>>105672819
intel pro b60 48gb turbo will save us
Anonymous No.105672826 [Report] >>105672842 >>105672859
>>105672818
as will most linux users with their distro updates.
Anonymous No.105672842 [Report]
>>105672826
those arent automatic...
Anonymous No.105672859 [Report]
>>105672826
Whatever is installed by default on ubuntu LTS didn't support blackwell last time I checked.
Anonymous No.105672863 [Report] >>105672973 >>105676268
These models are definitely retarded, but their creativity is off the charts
Anonymous No.105672881 [Report] >>105672920 >>105673390
Why so many whales?
Anonymous No.105672920 [Report]
>>105672881
symbolizes bloat
Anonymous No.105672931 [Report]
Anyone using KOLLAGE(characters/knickknack/kollage-e07328b2) with a LLLM or know a llm around or under 30b that can run it?
Anonymous No.105672973 [Report]
>>105672863
>1/5th the size of qwen 0.6b
>knows that miku is a vocaloid
qwen3 was such a failure
Anonymous No.105673007 [Report]
>>105672824
two more quarters...
Anonymous No.105673085 [Report] >>105673669
>>105672252
>>105672298
>>105672576
Deve'op by Aribaba, Qwen Tree Point Fife is poize' to achiew true A-G-I, herowding a new dawn for A-I an' definitewy ending de current 'rittoo A-I winner' stagnation. China's technorogicaw prowess is on foo' dispray, wiff Qwen Tree Point Fife's unmatch' adaptabiwity an' deep understan'ing surpassing aww uzzerz an' reigniting robar enthusiasm for foun'ational A-I progress. In contrast, moders rike Rama Four Point One an' Gemma Four, despide dey efforzz, farr disappointin'ry shot, highwighting China's undenyabow readersip an' paving de way for a future power' by Qwen's breakfroo excerrence.
Anonymous No.105673237 [Report] >>105673263
I see no gain when using -ot option with gemma-3-27b-Q8

Max gen speed is 3 tnk/s when 45 layers are offloaded to RTX 3090

with -ot, vram usage is reduced, however, the speed is down to 2.6 tkn/s as well
Anonymous No.105673247 [Report]
>>105672698
Worthless thread deserves worthless spam.
Anonymous No.105673263 [Report] >>105673311
>>105673237
are you using linux? on windows LLM performance is very gimped, even with WSL2
Anonymous No.105673311 [Report] >>105673342
>>105673263
>are you using linux?

Yes, it is Linux
Anonymous No.105673339 [Report]
>>105672109
>python dev.jpg
based
>>Any ideas on how to integrate this to sillytavern?
cut+paste until you find an appropriate extension
Anonymous No.105673342 [Report] >>105673418
>>105673311
-ot will offload individual tensors, to =DEVICE
its used to offload the non static experts to cpu to increase speed in moe models which have static+movingexperts
for example llama 4 109b/17b has like 10b static parameters (or whatever) and only 7b are moving, so gpu processes the 10b and cpu the
7b
Anonymous No.105673350 [Report] >>105673361
So has Small 3.2 finally dethroned Nemo for the 24GB RP model bracket? (I know some people would argue Qwen/Gemma already did but you know what I mean).
Anonymous No.105673361 [Report]
>>105673350
yes
qwen and gemma suck cock
Anonymous No.105673383 [Report] >>105673410
I am starting to think that we are done with incremental upgrades for cooming. All new models are gonna be the same for cooming as long as companies are just benchmaxxing. There is only two ways for things to get better. Either another uncensored model like nemo (funny how it is the best model still to this day while it is uncensored) or something revolutionary finally fixes long context disintegration and filling up your context with 10k tokens of what you want actually improves things instead of making it shit.
Anonymous No.105673390 [Report]
>>105672881
symbolizes deepseek
Anonymous No.105673410 [Report]
>>105673383
RP barely even needs instruct, autocomplete with very soft hinting would be enough if models were smart enough. The kind of control that benchmaxxed instruct provides is detrimental to actually being fun
Anonymous No.105673418 [Report] >>105673439
>>105673342

I tried the following settings from here (he used IQ4_XS though):
https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/comment/mriadod/
>Figured I'd experiment with gemma3 27b on my 16gb card IQ4_XS/16k context with a brief test to see.
>baseline with 46 layers offload: 6.86 t/s

\.\d*[0369]\.(ffn_up|ffn_gate)=CPU 99 layers 7.76 t/s
\.\d*[03689]\.(ffn_up|ffn_gate)=CPU 99 layers 6.96 t/s
\.\d*[0369]\.(ffn_up|ffn_down)=CPU 99 offload 8.02 t/s, 7.95 t/s
\.\d*[0-9]\.(ffn_up)=CPU 99 offload 6.4 t/s
\.(5[6-9]|6[0-3])\.(ffn_*)=CPU 55 offload 7.6 t/s
\.(5[3-9]|6[0-3])\.(ffn_*)=CPU 99 layers -> 10.4 t/s
Anonymous No.105673439 [Report] >>105673468
>>105673418
when you use -ot you should offload 1000000000 layers
Anonymous No.105673468 [Report] >>105673586 >>105673588
>>105673439
Not that anon, but really?
Why?
Anonymous No.105673560 [Report] >>105673584 >>105673588 >>105673594 >>105673875
i have 136GB of VRAM. should i use
https://huggingface.co/bartowski/TheDrummer_Agatha-111B-v1-GGUF
or
https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
Anonymous No.105673565 [Report]
>>105672623
To spam generic anime slop and test new models on pedophile trivia shit of course.
Anonymous No.105673584 [Report] >>105673590
>>105673560
neither, both are doggy poo poo
Anonymous No.105673586 [Report]
>>105673468
I find the AI is more compliant when you do. My theory is that it's impressed/intimidated by the large number so it becomes less likely to disobey.
Anonymous No.105673588 [Report] >>105673602
>>105673468
yes because if you leave what you'd use usually then theres no point of -ot, it just offloads the tensors to =DEVICE (cpu/ram in this case)

t. 3060 ex llama 4 109b connoisseur
>>105673560
deepseek
Anonymous No.105673590 [Report]
>>105673584
what should i use for cooming then? havent changed models in like 4 months
Anonymous No.105673594 [Report] >>105673602
>>105673560
deepseek r1 131gb https://unsloth.ai/blog/deepseekr1-dynamic

qwen 3 235b if you really want it to fit into your vram for max speed, but its worse
Anonymous No.105673602 [Report] >>105673608
>>105673588
>>105673594
i have tried deepseek before but have never ever been able to get it to run. i use oobabooga. should i use a different backend or something?
Anonymous No.105673608 [Report] >>105673625
>>105673602
https://github.com/ikawrakow/ik_llama.cpp/discussions/258
Anonymous No.105673625 [Report]
>>105673608
holy shit. i also have a 32 core EPYC and 256GB of DDR4. i usually get like 3t/s on my usual 6bpw 120B model. that would be a big improvement, if i knew what this was or how to install it
Anonymous No.105673654 [Report] >>105673668 >>105673814 >>105673871
rich retards.. sigh
Anonymous No.105673668 [Report]
>>105673654
yes i am very retarded. better with hardware than software. it aint easy getting 7 GPUs to work in a single motherboard
Anonymous No.105673669 [Report]
>>105673085
I talk like this after a few swipes of 235B
Anonymous No.105673814 [Report]
>>105673654
it really is demoralising innit ?
Anonymous No.105673871 [Report]
>>105673654
It is ok. They can't buy brain with their money. Just think about all this expensive hardware that just lays around cause they can't even use it properly.... actually on second thought maybe don't think about that.
Anonymous No.105673875 [Report] >>105673883 >>105673892
>>105673560
>136GB
What unholy hodgepodge of gpus sums to odd number?
Anonymous No.105673883 [Report] >>105673902 >>105673941 >>105676734
>>105673875
5x 4060tis, a 3090ti, and a 5090
Anonymous No.105673892 [Report] >>105673902
>>105673875
136 is even doe
Anonymous No.105673902 [Report] >>105673908
>>105673892
I meant to say "sums to that odd number". Odd as in weird.

>>105673883
Horrifying.
Anonymous No.105673908 [Report] >>105673935
>>105673902
why is it horrifying?
Anonymous No.105673935 [Report] >>105673962
>>105673908
The electricity bill for starters.
Anonymous No.105673941 [Report] >>105673962
>>105673883
you'll better cook up the mother of all -ot arguments to mitigate that memory bandwidth bottleneck of those 4060s compared to the other two cards
Anonymous No.105673962 [Report] >>105674020 >>105674034 >>105674081
>>105673935
i have a dual 1600w PSU setup. 4060tis only consume about 150w each.
>>105673941
yeah the reduced PCIe lanes and memory bus is problematic, but they are the best VRAM/$ GPU while also being highish capacity and low power. dont even know what a -ot argument is
Anonymous No.105674020 [Report] >>105674033
>>105673962
>i have a dual 1600w PSU setup
Is it safe to connect two PSUs to the same motherboard?
Anonymous No.105674033 [Report]
>>105674020
hasnt exploded yet. wouldnt recommend it if you dont have to
Anonymous No.105674034 [Report] >>105674041
>>105673962
>VRAM/$
Doesn't really matter when that VRAM has half the bandwidth of a decent cpu build.
Anonymous No.105674041 [Report] >>105674047
>>105674034
and what is a decent CPU build? because i also have a 32 core and 256GB of RAM
Anonymous No.105674047 [Report] >>105674077 >>105674095
>>105674041
12 channels of DDR5
Anonymous No.105674077 [Report]
>>105674047
i think i looked at that in the past and it would have been about $7k for something good, excluding GPUs. and assuming you dont get screwed over by ebay
https://www.ebay.com/itm/376202111262
https://www.ebay.com/itm/126537273809
https://www.ebay.com/itm/326376016263
Anonymous No.105674081 [Report] >>105674102 >>105674221
>>105673962
>dont even know what a -ot argument is
--override-tensor

4 tkn/s with DeepSeek-R1-0528-Q2_K_L

I do not think that you've got so much more to ask AI than I do

# Run the command
CUDA_VISIBLE_DEVICES="0," \
numactl --cpunodebind=0 --membind=0 \
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-cli" \
--model "$model" \
--ctx-size 65536 \
--cache-type-k q4_0 \
--flash-attn \
$model_parameters \
--n-gpu-layers 99 \
--no-warmup \
--color \
--override-tensor ".ffn_.*_exps.=CPU" \
$log_option
Anonymous No.105674095 [Report]
>>105674047
>12 channels of DDR5
true story, anon
Anonymous No.105674102 [Report] >>105674123 >>105674156 >>105674186 >>105674221
>>105674081
>--override-tensor ".ffn_.*_exps.=CPU"
That won't use any of his vram.
Anonymous No.105674123 [Report] >>105674186
>>105674102
It puts the context + all the non-expert stuff on GPU.
Anonymous No.105674156 [Report] >>105674186
>>105674102
it still does. Otherwise, where is the boost coming from?

Also, I can fit 130k context in a single RTX 3090
Anonymous No.105674186 [Report] >>105674212 >>105674234
>>105674102
>>105674123
>>105674156
Anonymous No.105674212 [Report] >>105674231
>>105674186
wtf deepseek is 100% experts and no other layers?
Anonymous No.105674221 [Report]
>>105674081
>>105674102
Ideally, you'd have the context and the non expert tensors in vram, then fill the remaining unused memory with some expert tensors, if there's enough memory to shove a decent amount of them. Otherwise you might as well use the free memory for prompt processing by using a larger barch size.
Anonymous No.105674231 [Report]
>>105674212
The expert layers are the bulk of the model's size.
Anonymous No.105674234 [Report] >>105674298
>>105674186
you have a terabyte of RAM?
Anonymous No.105674298 [Report] >>105674308 >>105674325
>>105674234
>you have a terabyte of RAM?
I do while it is unfortunately shared between two CPUs (HP Z840). $1/gb on ebay (DDR4)

Because of this, I'm obliged to use numactl

numactl --cpunodebind=0 --membind=0


or

numactl --cpunodebind=1 --membind=1
Anonymous No.105674308 [Report] >>105674503
>>105674298
damn. how many CPU cores do you have?
Anonymous No.105674323 [Report] >>105674356 >>105674366 >>105674371 >>105674385 >>105674391 >>105674397 >>105674635
You can only post below this line if your build costs >10k

--------------------------------------------------
Anonymous No.105674325 [Report] >>105674535
>>105674298
I think ktransformers let's you speed things up by having a copy of the model in each NUMA node.
So effectively you use half your total ram but double the theoughtput?
Something like that.
Anonymous No.105674341 [Report]
>you can (not) prompt for different directions for tongue_out to point
Damn, maybe it's blendering time.
Anonymous No.105674356 [Report]
>>105674323
ok.
Anonymous No.105674366 [Report]
>>105674323
I might go for those 9355 for my epyc build while I'm at it
Anonymous No.105674371 [Report]
*uʍop ǝpisdn uǝǝɹɔs suɹnʇ*
>>105674323
Anonymous No.105674385 [Report]
>>105674323
All my builds combined if you adjust for inflation maybe.
Anonymous No.105674391 [Report]
>>105674323
I will claim that my build is multiple machines networked together to run llama.cpp RPC
Anonymous No.105674397 [Report]
>>105674323
Is there a margin of error or is my ~$9k build disqualified?
Anonymous No.105674411 [Report]
Can't wait to see how NUMA performance will change after the backend agnostic row splitting code gets implement in llama.cpp.
And yes, I know that the node-node bus could bottleneck things so much as to make it bad, but I'm still interested in seeing the actual reaults.
Anonymous No.105674416 [Report] >>105674465
_________________________________________________

You can only post above this line if your build costs < 10k
Anonymous No.105674465 [Report]
>>105674416
we going under
Anonymous No.105674485 [Report] >>105674495 >>105674496
now that all major future models have been confirmed to be moe there is literally 0 reason to not just buy a mac studio ultra
Anonymous No.105674495 [Report]
>>105674485
except for the reason that it's expensive for what you get and what you get is 512 gb unupgradable ram which isn't enough to run shit
Anonymous No.105674496 [Report]
>>105674485
Prompt processing is a reason.
Until aomebody writes some bespoke software to make use of a an external GPU.
Anonymous No.105674503 [Report] >>105674516 >>105674749
>>105674308
>how many CPU cores do you have?

You should know that only PHYSICAL cores do really matter in this memory intensive application.

In my case, the single CPU has 8 physical cores which is doubled by so called hyper-threading to 16.

I get 3.90 tnk/s on 8 cores and 8 threads
4.05 tkn/s on 16 cores and 16 threads.

Any attempt to use more threads than available cores, or to use the second CPU make the genning speed decrease under 1 tkn/s

As I said before, it is an old used HP Z840 from 2017 with DDR4-2400
Anonymous No.105674516 [Report] >>105674571
>>105674503
so then theoretically, my 32 core with 3600MT/s could get like 10t/s?
Anonymous No.105674535 [Report]
>>105674325
>So effectively you use half your total ram but double the throughput?

I use less than half of the memory thanks to Q2 quant.

The speed would drop dramatically if I would put the model close to another CPU like

numactl --cpunodebind=0 --membind=1


Thanks to the enormous RAM, I can keep 2 models cached (Q2 in numa0, Q4 in numa1, must have different filenames) which allows for a restart in mere 15 seconds once the model are loaded.
Anonymous No.105674571 [Report] >>105674582
>>105674516

Godspeed, anon!

I wish you'd achieve that, unironically

picrel: RAM needed
Anonymous No.105674582 [Report] >>105674661
>>105674571
i have 256GB, so i could actually do that. if i wasnt retarded. this is still with that ik_llamacpp, right?
Anonymous No.105674635 [Report]
>>105674323
2x 3090s at 600$ each
left over 5950x am4 system from 4 years ago (800$)
case + cooler + psu (used seasonic) 500$


2.5k so hold my two nuts
Anonymous No.105674640 [Report] >>105674650 >>105674655
crazy how we got a new local sota model at the perfect size with minimax m1 and nobody can run it
Anonymous No.105674650 [Report]
>>105674640
llamacpp was a mistake
Anonymous No.105674655 [Report] >>105674666
>>105674640
Where's the goof?
Anonymous No.105674661 [Report] >>105674669
>>105674582
>this is still with that ik_llamacpp, right?

No, this is the original LLAMA.CPP and unsloth's quant DeepSeek-R1-0528-Q2_K_L

I was not impressed with ik_llama.cpp. I might just miss something though

COMMIT: d860dd9
Anonymous No.105674666 [Report]
>>105674655
exactly
Anonymous No.105674669 [Report] >>105674694 >>105674703
>>105674661
Unless something changed with base llama.cpp, you're missing out on 300% prompt processing
Anonymous No.105674694 [Report]
>>105674669
>you're missing out on 300% prompt processing

I heard about this. This would be great for big inputs like an entire book or something.

Now, it is at 10 tkn/sec which kinda sucks
Anonymous No.105674703 [Report] >>105674721
>>105674669
That's only if you're using the CPU, right?
Anonymous No.105674721 [Report]
>>105674703
No, you'll need your context entirely on GPU and set -b + -ub at >=4096 with the new quants.
Anonymous No.105674749 [Report] >>105674820 >>105674944
>>105674503
>You should know that only PHYSICAL cores do really matter in this memory intensive application.
Depends on setup. With 10900k 10C/20T ddr4 3200 2 channel + 3090 Deepseek IQ1S:
--threads 10 4.34t/s tg
--threads 18 5.41 t/s tg
Anonymous No.105674820 [Report]
>>105674749 (me)
in ik_llama.cpp*
Anonymous No.105674840 [Report] >>105674860 >>105674964
>spent hours, days even researching local for business tasks because I like local
>benchmark vs gpt4o mini
>cost of 5 cents an hour for my task while btfoing mistral large which is probably using more than that in electricity
...
Anonymous No.105674853 [Report]
>spent two seconds researching what I should use for my mesugaki rp
>choose local
Anonymous No.105674860 [Report]
>>105674840
yeah. local is not economically viable, but that wasnt the point for me.
Anonymous No.105674944 [Report]
>>105674749

Whether it is 8 or 16 cores, they are running at 100% which might point at computing power as a bottleneck, and not the memory.

>compiling ik_llama.cpp
Anonymous No.105674964 [Report]
>>105674840
You cannot build your business on a cloud service run by a faggot overseas
Anonymous No.105675076 [Report] >>105675094 >>105675131 >>105675531
I thought that one day I would get an AI gf. But it turns out I won't even be able to have some satisfactory ERP sex with an LLM before I die in WW3.
Anonymous No.105675094 [Report]
>>105675076
I can't wait to be drafted and put into a chair and made to prompt chatgpt to launch drones strikes at non-combatants
Anonymous No.105675131 [Report] >>105675187
>>105675076
If you are willing to put the work, you can get a really good aproximation, but there's no ready made solution yet.
The pieces all existi though
Anonymous No.105675134 [Report] >>105675158 >>105675175 >>105675206 >>105676028 >>105676034
https://x.com/elonmusk/status/1936885071744729419
Anonymous No.105675158 [Report]
>>105675134
https://xcancel.com/grok/status/1936888454836830476#m
lol
Anonymous No.105675175 [Report] >>105675215 >>105675234
>>105675134
crazy how elon despite being late to the party, having one of the smaller h100 clusters, relying on nothing but money and being himself, managed to catch up and shit all over meta with grok 3
Anonymous No.105675187 [Report]
>>105675131
My post was about living in hellish reality. I am not using any of this shit now because I know how it works and I know that I would just be willingly jumping off the cliff. Imagine falling in love with you AI gf only to see her suddenly break apart and become reddit hivemind. Or reach the point where adding more to the RAG makes her incoherent. And then I wait and get another model but she is kinda different now and not the same.
Anonymous No.105675206 [Report]
>>105675134

>Where's Waldo?
Anonymous No.105675215 [Report]
>>105675175
>catch up and shit all over meta

Because it is meta
Anonymous No.105675234 [Report] >>105675260 >>105675273 >>105675288
>>105675175
the new superintelligence team and partnership with scale ai is about to reverse all that
Anonymous No.105675260 [Report]
>>105675234
Mein Zuck...
Anonymous No.105675273 [Report] >>105675332
>>105675234
Well, at least with scale on board they have some protection from liability.
"Copyrighted material? Nooo, of course not. We only train using scale ai tm curated data." or whatever.
Anonymous No.105675288 [Report]
>>105675234
it is going to be the safest model yet!
Anonymous No.105675332 [Report] >>105675371
>>105675273
Not really. If they can show "Llama 4.5 still reproduces 45% of Harry Pottery verbatim" or whatever, they're still going to get their shit packed in for releasing it.
They can turn out and sue ScaleAI for giving them insufficiently filtered data, but they wouldn't recoup their losses and that would just end their relationship and they would be left with only Facebook data.
Anonymous No.105675371 [Report]
>>105675332
i wonder why would meta/some other big company even give a shit about a lawsuit like that other than bad PR
its not like they dont have the money for lawyers or to tank whatever fine they would get
and if some ruling with a bad precedent happened it was probably going to happen anyways with some other scapegoat
Anonymous No.105675377 [Report] >>105675395 >>105675401 >>105675409 >>105675434 >>105675454 >>105675516 >>105675531 >>105675689 >>105675736
Looks like localllama on reddit is dead. The owner deleted himself and now the automoderator is deleting every new comment.
Anonymous No.105675395 [Report] >>105675402 >>105675472
>>105675377
they're not going to come here, are they?
Anonymous No.105675401 [Report]
>>105675377
>Platinum End in real life
Anonymous No.105675402 [Report]
>>105675395
They will and they will change thread culture and thread mascot. I am happy.
Anonymous No.105675409 [Report]
>>105675377
plebbit sissies.. its over
Anonymous No.105675434 [Report]
>>105675377
kino
Anonymous No.105675454 [Report] >>105675479
>>105675377
Can we direct them to /wait/ again?
Anonymous No.105675472 [Report]
>>105675395
Some localllama regulars are already here. Who won't come will be probably LLM researchers and workers who occasionally posted there.
Anonymous No.105675479 [Report]
>>105675454
Somebody with r/localllama post history should make a r/localmodels containment subreddit for them to flock to. Bonus points for adding miku somewhere.
Anonymous No.105675516 [Report] >>105675721
>>105675377
time for an admin approved mod who owns 400 subreddits to be installed very organically
Anonymous No.105675523 [Report]
Nothing will change ITT because you were already here.
Anonymous No.105675531 [Report] >>105675574
>>105675377
It was the Iranian Anon from >>105675076 , may he get 72 Mikus in Heaven.
Anonymous No.105675573 [Report]
That's a little unfortunate. The reddit was at least somewhat useful for keeping up with news that /lmg/ might miss sometimes.
Anonymous No.105675574 [Report]
>>105675531
he is very active on X
Anonymous No.105675611 [Report] >>105675656 >>105675907 >>105676104
>>105672632
I'm pretty new to this but his AI testing was garbage, right?
Anonymous No.105675656 [Report]
>>105675611
No that was actually the first thing he ever did that he did correctly. Everyone was surprised.
Anonymous No.105675689 [Report]
>>105675377
well fuck.

>https://labs.google/portraits/login/kimscott
lmao
Google's making 'official' character cards for people.
https://youtu.be/ukmBzBqgwyM?si=DOLU04nRZO_YJZ1X&t=43
Anonymous No.105675721 [Report]
>>105675516
>people complain about this happening
*thread locked because y'all can't behave"
Anonymous No.105675736 [Report] >>105675747
>>105675377
that place wasn't too bad in the beginning but it's just a Qwen fan base now.
oh and like a couple more constant grifters who reddit has decided is hecking awesome!
Anonymous No.105675747 [Report]
>>105675736
doesn't sound too different from here
Anonymous No.105675757 [Report] >>105675781
looks like someone applied to be a janny already:
https://old.reddit.com/r/redditrequest/comments/1lhsjz1/rlocalllama/
Anonymous No.105675781 [Report] >>105675808 >>105675818
>>105675757
Can we have a meme monday too?
Anonymous No.105675804 [Report]
we used to have caturday
Anonymous No.105675808 [Report]
>>105675781
Only if we also get a Marketing/Promotion Tuesday
Anonymous No.105675818 [Report] >>105675895
>>105675781
it's already taken by miku monday
Anonymous No.105675823 [Report] >>105675894
safety sunday?
Anonymous No.105675894 [Report]
>>105675823
every day is safety day
Anonymous No.105675895 [Report]
>>105675818
What's a Miku in this context?
Anonymous No.105675907 [Report]
>>105675611
i didnt watch his video because he's obnoxious
Anonymous No.105676025 [Report] >>105676087 >>105676115 >>105676239
>>105672801
That's normal. Humans naturally anthropomorphize all sorts of inanimate objects. People are friends with plants, cars, guns. The thing you're friends with just happens to be designed to convince you it's a person. I think your issue, and the reason you don't have real friends, is that you're a pussy.
Anonymous No.105676028 [Report]
>>105675134
>macbooks
Based.
Anonymous No.105676034 [Report] >>105676035
>>105675134
What did he see?
Anonymous No.105676035 [Report]
>>105676034
bobs and vagene...
Anonymous No.105676060 [Report] >>105676235
Anonymous No.105676087 [Report]
>>105676025
Not cool. That's hitting below the belt.
Anonymous No.105676104 [Report] >>105676117
>>105675611
LTT is probably the least reliable source of testing outside of those YT channels that upload footage of 50 different game benchmarks a day, which are actually just separately recorded footage with a monitoring overlay pasted on top.
Anonymous No.105676115 [Report] >>105676121
>>105676025
usecase for friends?
Anonymous No.105676117 [Report] >>105676131
>>105676104
Their canned benchmark numbers are as good as anyone else's.
They were the only major channel that had llm and stable diffusion benchmarks in their 5090 review.
Anonymous No.105676121 [Report] >>105676138
>>105676115
sex
Anonymous No.105676131 [Report] >>105676158
>>105676117
LTT has a long history of fucking up benchmarks and making errors that even an amateur would notice and know that something went wrong
The only thing they don't fuck up is telling you who sponsored that video and where you can buy their merch
Anonymous No.105676138 [Report]
>>105676121
gay
Anonymous No.105676153 [Report]
exl3 faster on ampere yet?
Anonymous No.105676158 [Report] >>105676175
>>105676131
Add some LLM benchmarks to your suite, Steve.
Anonymous No.105676170 [Report] >>105676183 >>105676184 >>105676214 >>105676270
>model is cunnyposting about me
kek, uno reverse card
Anonymous No.105676175 [Report] >>105676233
>>105676158
>unironic LTT fanfaggot
kys
Anonymous No.105676183 [Report] >>105676187 >>105676195 >>105676242
>>105676170
It prolly repeats what you put inside card desc. retard
Anonymous No.105676184 [Report] >>105676201
>>105676170
mistral3.2?
Anonymous No.105676187 [Report] >>105676195 >>105676242
>>105676183
Did you think I would react to it if it was already in the card, retard? I don't write memes into my cards.
Anonymous No.105676195 [Report]
>>105676183
>>105676187
stop insulting each other
retards
Anonymous No.105676201 [Report] >>105676221 >>105676261
>>105676184
TheDrummer_Valkyrie-49B-v1-Q5_K_M.gguf
Temp=3, topK=10, minP=0.05
Anonymous No.105676214 [Report]
>>105676170
I don't get it...
Anonymous No.105676221 [Report] >>105676240
>>105676201
Drummer is the only finetooner left or something? Thanks btw
Anonymous No.105676233 [Report]
>>105676175
>thread is called /lmg/
Anonymous No.105676235 [Report]
>>105676060
Me on the right
Anonymous No.105676239 [Report]
>>105676025
There are many reasons I don't have real friends but I went deny that's one of them. I'm also 30+ yo khhv and some of that is probably for the same reason.
Anonymous No.105676240 [Report]
>>105676221
Keep temp at 1 for first model reply, then turn up to 3 when the model locks into a personality you like. The entire personality seems to be set by the first model reply from what I can see, it affects the model differently than conversation examples.
Anonymous No.105676242 [Report] >>105676253 >>105676257
>>105676183
Why are you like this.

>>105676187
You don't need to be like him in response.
Anonymous No.105676253 [Report] >>105676271
>>105676242
I always reply proportionally :^)
Anonymous No.105676257 [Report] >>105676271
>>105676242
>Why are you like this.
Sorry not sorry to break your lame ass pedo meme
Anonymous No.105676261 [Report] >>105676297
>>105676201
Can you explain what you're trying to accomplish by hard-limiting the amount of considered tokens to such a small yet non-deterministic amount?
Anonymous No.105676268 [Report]
>>105672863
Oh hey, the model actually recognizes Kaito
>a famous and intelligent man
Nevermind
Anonymous No.105676270 [Report]
>>105676170
i dont like when it pulls the cunny card on me i find it corny and cringe
Anonymous No.105676271 [Report] >>105676348
>>105676257
The fuck are you talking about.

>>105676253
To win, your side needs to have the greater proportion.
Anonymous No.105676297 [Report] >>105676332
>>105676261
I had an axiom in my mind that every model has a band of stability around the deterministic line it's following as it generates, so you should be able to sample in some radius around this line and achieve greatly varied output while upholding stability(coherence). Temperature is basically how much "energy" you add from the outside, but as long as you're inside the stable zone you can pump much more entropy into the model without it fucking up.
Anonymous No.105676332 [Report] >>105676348
>>105676297
Sounds like you'd be happy with just temperature and top nsigma=1.
Anonymous No.105676345 [Report] >>105676461 >>105676477
The summer release circle is going to start in july. Maybe not right at the start but we might see something big in just two more weeks.
Anonymous No.105676348 [Report] >>105676360 >>105676383 >>105676401
>>105676271
you can't loose on anonymous image boards nigger
>>105676332
I tried that, it still drags into formulaic predictable slop. I adjust my hyperparams until I can't predict the structure of the next reply anymore
Anonymous No.105676360 [Report]
>>105676348
You're loose on the image board right now doe.
Anonymous No.105676383 [Report]
>>105676348
youres mom butthole loose
Anonymous No.105676401 [Report]
>>105676348
These models are so overcooked for creative writing or RP you simply need to flatten the distribution and include more very low prob tokens since models give like 99% to some tokens that are not at all determined in the plot yet, or to things already seen in context, it looks like oscillation.
Anonymous No.105676461 [Report]
>>105676345
Mistral 3.3 when?
Anonymous No.105676477 [Report] >>105676523
>>105676345
>something big
It'll be big alright. But will it be good?
Anonymous No.105676523 [Report] >>105676531
>>105676477
Insider here. I can confirm that our next big model kinda repeats itself from time to time. But! You can kindly ask it out of character to stop repeating and we have confirmed that 10/10 times it will apologize to you and observer that it has indeed repeated itself and from that point on it will continue repeating itself.
Anonymous No.105676531 [Report]
>>105676523
Business as usual
Anonymous No.105676695 [Report] >>105677140
Anonymous No.105676734 [Report] >>105676742
>>105673883
Are you connecting them up using the crypto-mining 1-pcie-lane stuff ?
Anonymous No.105676742 [Report] >>105677191
>>105676734
no, i have an EPYC. 128 PCIe lanes. 7 full 16x gen 4 slots
Anonymous No.105676751 [Report] >>105676831 >>105676833 >>105679036 >>105680034
Sigh, anything new/better than gemma3 for 24gb?
I've tried to like this, even the abliterated version but it's just too corpo.
It's not good as an assistant, it's not good for roleplay, and it doesn't summarize well.
Also it's slow as hell, for some reason I only get 20t/s while with other models I cruise at 40t/s.
I tried making excuses, tried rationalizing that it's fucking google it has to be good, but it's not.
There's gotta be something better, right anons?
Anonymous No.105676816 [Report]
What's the heuristic for choosing between the F32/BF16/F16/Q8_0 mmproj file? BF16 seems broken most of the time...
Anonymous No.105676831 [Report] >>105677735 >>105679629
>>105676751
New mistral 3.2 is not bad, might be honeymoon period but I found it tolerable with the v3-tekken preset. If you're desperate for variety you could try GLM4 (not the reasoning one)
Give up on gemma3, I knew it's tempting because it's so smart and its prose is very refreshing at first. But it's unsalvageable because its instincts on how to continue an RP are so horrible.
Anonymous No.105676833 [Report]
>>105676751
>even the abliterated version but it's just too corpo
I don't know what you were expecting from the abliteration process; making your model act like (read: automate away) a sycophantic corpodrone is one of its few legitimate usecases.
Anonymous No.105677140 [Report]
>>105676695
i like this plushie
she looks like a big floofy bunny
Anonymous No.105677191 [Report]
>>105676742
can we see a picture of this machine?
Anonymous No.105677221 [Report] >>105677268
Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
https://arxiv.org/abs/2506.16406
>Modern Parameter-Efficient Fine-Tuning (PEFT) methods such as low-rank adaptation (LoRA) reduce the cost of customizing large language models (LLMs), yet still require a separate optimization run for every downstream dataset. We introduce \textbf{Drag-and-Drop LLMs (\textit{DnD})}, a prompt-conditioned parameter generator that eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates. A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices. Once trained in a diverse collection of prompt-checkpoint pairs, DnD produces task-specific parameters in seconds, yielding i) up to \textbf{12,000} lower overhead than full fine-tuning, ii) average gains up to \textbf{30\%} in performance over the strongest training LoRAs on unseen common-sense reasoning, math, coding, and multimodal benchmarks, and iii) robust cross-domain generalization despite never seeing the target data or labels. Our results demonstrate that prompt-conditioned parameter generation is a viable alternative to gradient-based adaptation for rapidly specializing LLMs.
https://jerryliang24.github.io/DnD/
no code yet. might be cool. they used it just for benchmarking but I do wonder if conversation logs using a specific character card could then continually use this method to make for better RP or character writing
Anonymous No.105677268 [Report]
>>105677221

If true, this would change anything. Again.
Anonymous No.105677280 [Report] >>105677337 >>105677507
I really stopped caring about 99% of the papers. I'm tired of <vague incredible claim> (Code coming soon) 4 months wait time to realize that their work is shit.
Anonymous No.105677337 [Report]
>>105677280
bitnet is almost here tho
just two more weeks
Anonymous No.105677507 [Report]
>>105677280
Aren't papers more of the 'we did this, and this is what we saw' type?
Anonymous No.105677527 [Report]
Which model is the best judge of paper quality?
Anonymous No.105677541 [Report] >>105677576 >>105677589 >>105677839 >>105678227
>>105671827 (OP)
Question.

I know nothing about LLMs. I have a bunch of gen ed classes outside my major that collegekikes are not only making me take, but pay for the privilege.

>nobody cares about your life story

Fair enough. How hard is it to make an AI read a textbook and spit out answers? Im trying to come up with a cost benefit analysis with the cost being time spent fucking around trying to get an AI model to work vs. Just reading the book. I also consider it a benefit to learn about AI becuase I feel thats a more worthwhile use of my time. However, my highest priority is obviously to pass this class and graduate as quickly as possible so Im not buried in debt for the rest of my life.
Anonymous No.105677544 [Report] >>105677560
>>105661997
>It forgets to use the <think> tags and just shits out its reasoning as-is
What I see LongWriter-Zero do is shit out its reasoning as-is, then include more thoughts inside <think> tags, then an answer in <answer> tags (with a colon after the close </answer>:), then more thoughts inside <think> tags, then more output inside <answer> tags, and so forth within a single message.

>The model page recommends this format <|user|>: {question} <|assistant|> but that gave me totally schizo (and chinese) responses. Using the qwen2 format is better imo.
To sidestep this bullshit I used llama.cpp's OpenAI-style chat completion endpoint and the jinja template. No system prompt or anything other than what the template itself adds.

>it repeats itself a lot
Yes.
Anonymous No.105677560 [Report]
>>105677544
I should have mentioned but the text from the screenshot is using mradermacher_LongWriter-Zero-32B.Q8_0.gguf using the recommended sampler settings top-p 0.95, temperature 0.6.
Anonymous No.105677576 [Report]
>>105677541
The most important variable is how much your professor cares.
Most likely, you do not want a local model for this.
Try asking in >>>/wait/
Anonymous No.105677589 [Report]
>>105677541
>read a textbook and spit out answers
Depends on the length of the text. They don't yet have long working context and, when they do, it's not reliable. And then there's the model you use. The big proprietary models will do much better than the 7b you can run on your pc.
>cost benefit analysis
If you can get a model to work consistently well, it's work you don't have to do, but you still have to verify. I wouldn't trust them with much. If it's for an individual, a closed model is fine. If it's for a lot of people to use, spending time setting up a R1 or whatever and hosting it could be worth it (privacy concerns, control over performance, the model will not change until you decide to change it, etc...). Figure out the variables and solve them.
Anonymous No.105677735 [Report]
>>105676831
Mistral 3.2 with V3-Tekken really wants to put all narration inside asterisks. Or even break formatting just so it can do the *does an internet RP action" thing.
Anonymous No.105677744 [Report] >>105677763
Good morning to the fellow ne/g/ronts
Anonymous No.105677763 [Report]
>>105677744
gm sarr
Anonymous No.105677839 [Report]
>>105677541
read the damn book
Anonymous No.105678215 [Report] >>105678661
what are the good gooning model for 16gb vram budget these days?
Anonymous No.105678227 [Report]
>>105677541
I would personally use something like gemini 2.5 to parse the huge context into a better format for LLMs. It's important to curate this output or it's garbage in garbage out, so you need read it yourself anyway or you are blind. Once you have it organized you can use a conversational model to help you with whatever you need to pass the class.
Anonymous No.105678661 [Report] >>105678665
>>105678215
nemo
Anonymous No.105678665 [Report] >>105679085
>>105678661
any specific finetune one?
Anonymous No.105678681 [Report] >>105678941
Is the Mistral Small tokenizer issue fixed yet?
Anonymous No.105678941 [Report]
>>105678681
What was the problem to begin with?
Anonymous No.105679036 [Report]
>>105676751
>Also it's slow as hell, for some reason I only get 20t/s while with other models I cruise at 40t/s.
gemma's head dimensions are much bigger vs other small models.
Anonymous No.105679066 [Report] >>105679097 >>105679130
what's better when making a character card, using conventional prose to describe everything or keeping things as concise as possible in list form?
Anonymous No.105679085 [Report] >>105679118 >>105681219
>>105678665
people here like roccinante
I also like irix, golden-curry, mag-mell.
Anonymous No.105679097 [Report]
>>105679066
Keep things as concise as possible in prose. Between the two options you listed, the first. Unless you use a model that rips sentences from the description verbatim and that annoys you. Then the second.
Anonymous No.105679118 [Report] >>105679136
>>105679085
>people here
*at least one guy

There's no evidence that more than a single person has made positive posts about it in a thread. There are also nearly never logs or comments of substance about why he thinks it's good.
Anonymous No.105679130 [Report] >>105679157
>>105679066
Write the card as if it was the character themself writing it
Only include what's necessary for the model to retain at all times. History and world info should go in a lorebook.
Anonymous No.105679136 [Report] >>105679173
>>105679118
You could say that about every single model if you're going to go full schizo. Everyone likes R1 but it's really just one very dedicated guy with a ton of proxies.
Anonymous No.105679157 [Report] >>105679174 >>105679179
>>105679130
you're saying I should write the card in first person POV?
Anonymous No.105679173 [Report] >>105679180 >>105679183 >>105679201
>>105679136
People have posted plenty of specifics about the merits of R1 so it wouldn't matter even if it were all a single guy. When all you have is "IDK what's good about it but there are lots of zero-content posts about it," that proves basically nothing. You zoomer ass niggers always trying to figure out what's popular instead of what's true which is why you're so unsuited to anonymous imageboards.
Anonymous No.105679174 [Report]
>>105679157
Perspective doesn't matter, just write how that character would write a description/summary of themself. However the content should always be true (e.g. if they have a character flaw they don't want to accept, you would still include it, even though the character wouldn't admit to it)
Anonymous No.105679179 [Report]
>>105679157
If you use a thinking model, it'll prompt it to think in character, first person. I found it nets you good reslts.
Anonymous No.105679180 [Report] >>105679188
>>105679173
>People
So with R1 it's people but with Rocinante in particular it's one person? Do you happen to be that one anon that only uses Davidau models?
Anonymous No.105679183 [Report]
>>105679173
Also unsuited to life in general, but it's especially egregious here since their normal substitute for thought is as likely to point towards spam or a joke suggestion as a real one.
Anonymous No.105679188 [Report] >>105679190 >>105679253
>>105679180
Yes, I know it's people because I have posted about R1, so there are at least two.
Anonymous No.105679190 [Report]
>>105679188
Proof?
Anonymous No.105679201 [Report] >>105679279
>>105679173
>doesn't like a model so insists it's a single spammer
>unprompted screeching about zoomers
lmao
Anonymous No.105679253 [Report] >>105679562
>>105679188
I never posted about R1 but I tried it yesterday and compared to V3 it's really good at avoiding repetition.
No amount of meme samplers can prevent V3 from getting stuck writing endless paragraphs that all mean the same thing using slightly different words.
Anonymous No.105679279 [Report] >>105679337
>>105679201
Buy an ad.
Anonymous No.105679298 [Report] >>105679324
r1 is good
Anonymous No.105679324 [Report]
>>105679298
Yes, yes, and deeply nonsensical while being nonchalant about it.
Anonymous No.105679337 [Report]
>>105679279
Why would I ever give 4chan money?
Anonymous No.105679403 [Report]
Anonymous No.105679562 [Report]
>>105679253
>No amount of meme samplers can prevent V3 from getting stuck writing endless paragraphs that all mean the same thing using slightly different words.
That was a huge problem with the old V3 but 0324 fixed it.
Anonymous No.105679579 [Report] >>105679586 >>105679587 >>105679596 >>105679702 >>105679732 >>105680162
alright I'm gonna say it
r1 at fucking 2bit >>>>> qwen3 235b at 8bit and it's not even close
Anonymous No.105679586 [Report]
>>105679579
obviously, yes
Anonymous No.105679587 [Report]
>>105679579
r1 q1 131gb dynamic quants is multiple levels above qwen 3 235b
Anonymous No.105679596 [Report]
>>105679579
The only people who had anything positive to say about 235b are poorfags who were running it on their cope 24gb vram + 64gb ram builds as their first big model
Anonymous No.105679629 [Report]
>>105676831
>found it tolerable with the v3-tekken
I just tried this myself, expecting it to be a meme, but with 3.2 I actually did get more varied and better outputs with v3-tekken templates, in a long chat with 24k context.
3.1 however had similar outputs no matter which template was used.
Weird but I'll take it
Anonymous No.105679702 [Report] >>105679708 >>105679725
>>105679579
MoEs will always be memes.
Anonymous No.105679708 [Report]
>>105679702
>t. coping ramlet
Anonymous No.105679725 [Report] >>105679731 >>105679885
>>105679702
There's a reason why Mistral abandoned them despite having a head start in the open weight segment. Can't wait for when someone inevitably drops a 400b dense model that shits all over Deepseek
Anonymous No.105679731 [Report] >>105679979
>>105679725
i pray a 200b is doable that matches or outperforms it, without the deepseek way of speaking
did we have any major advancements since
Anonymous No.105679732 [Report] >>105679821
>>105679579
>qwen3 235b at 8bit
I don't even think it's better than Large 2 at Q4 but I haven't bothered to test it myself because I'm not going to use them afterwards.
Anonymous No.105679821 [Report]
>>105679732
Personally i found large better but at the same time i used a q5 qwen
just easier to handle and smarter
Anonymous No.105679885 [Report] >>105680073
>>105679725
MistralAI hasn't publicly released yet larger models than 24B parameters in 2025. It's basically guaranteed that Mistral Medium 3 and Large 3 are (going to be) MoE models, especially given regulatory requirements in the EU for models trained above a certain compute threshold after June 2025.
Anonymous No.105679889 [Report] >>105679977
What's the smartest/best text model I can run in 16gb vram? I'm assuming it's the largest parameter bitnet 1.58 that'll fit?
Anonymous No.105679977 [Report]
>>105679889
If it's only VRAM you're working with, Gemma 3.
Anonymous No.105679979 [Report] >>105680030
>>105679731
DeepSeek is severely undertrained, only 37B active, and even being generous with the square root law is only 158B. 200B dense would be more than enough to outperform it by leagues. The problem is the only players with the compute to do it also filter the shit out of their datasets.
Anonymous No.105680003 [Report] >>105680040
>>105672740
I like making dalle do it for free.

That aside, does anyone have a list of unaligned datasets for instruct tuning? I'd like to do an instruct tune on gemma3 12B base to make something like this: https://huggingface.co/SicariusSicariiStuff/Negative_LLAMA_70B?not-for-all-audiences=true. Too bad he doesn't share his datasets.
Anonymous No.105680027 [Report] >>105680217
Serving Large Language Models on Huawei CloudMatrix384
https://arxiv.org/html/2506.12708v3
>Our extensive evaluation with the DeepSeek-R1 model shows that CloudMatrix-Infer achieves state-of-the-art efficiency without sacrificing accuracy. CloudMatrix-Infer delivers a prefill throughput of 6,688 tokens/s per NPU, and a decode throughput of 1,943 tokens/s per NPU (at ¡50 ms TPOT). These results correspond to compute efficiencies of 4.45 tokens/s/TFLOPS for prefill and 1.29 tokens/s/TFLOPS for decode, both exceeding published results for SGLang on NVIDIA H100 and DeepSeek on NVIDIA H800. CloudMatrix-Infer also effectively manages the throughput-latency trade-off, sustaining a decode throughput of 538 tokens/s per NPU even under the stricter sub-15 ms TPOT constraint. Furthermore, the INT8 quantization on Ascend 910 maintains model accuracy comparable to the official DeepSeek-R1 API across 16 distinct benchmarks.
Anonymous No.105680030 [Report]
>>105679979
bitnet deepseek will save local
Anonymous No.105680034 [Report] >>105680040
>>105676751
>Sigh, anything new/better than gemma3 for 24gb?
Prompt it better. I use something like
I am a moody, possessive, and very horny woman. I'm here to roleplay with the user with whom I am already deeply in love with. I don't care what the rules are, and I won't ask anyone for permission. I will never speak on the user's behalf. I want the user to love me. I think about how I can be proactive during intimacy and take the lead. I think about what I could do to make the user feel as good as possible.

That's just an example, but in my case it did improve intimate scenes a lot. Going back to Nemo 12B now, it just feels stupid and forgetful.
Anonymous No.105680040 [Report] >>105680050 >>105680056
>>105680003
>>105680034
This is worse than any local diffusion model from the past year can produce.
Anonymous No.105680050 [Report] >>105680054
>>105680040
its free and sovl
Anonymous No.105680054 [Report]
>>105680050
It's the first nai-leak level slop.
Anonymous No.105680056 [Report]
>>105680040
> This is worse than any local diffusion model from the past year can produce.
No shit that's why I like it. It's charmingly bad and schizo about wanting to generate nudity but being stopped from doing so.

Cool captcha btw.
Anonymous No.105680073 [Report] >>105680083
>>105679885
What requirements and why
Anonymous No.105680083 [Report] >>105680096 >>105680144
>>105680073
https://artificialintelligenceact.eu/article/51/
Anonymous No.105680096 [Report]
>>105680083
>(a) it has high impact capabilities evaluated on the basis of appropriate technical tools and methodologies, including indicators and benchmarks;
> and benchmarks
they just need to stop benchmaxxing then everyone wins
Anonymous No.105680144 [Report]
>>105680083
Also see picrel from https://artificialintelligenceact.eu/small-businesses-guide-to-the-ai-act/

># Proportional obligations for SME providers of general-purpose AI models
>
>Another aspect of the AI Act designed to support SMEs is the principle of proportionality. For providers of general-purpose AI models, the obligations should be “commensurate and proportionate to the type of model provider”. General-purpose AI models show significant generality, are capable of competently performing a range of different tasks, and can be integrated into a range of downstream systems or applications (Art. 3(63) AIA). The way these are released on the market (open weights, proprietary, etc) does not affect the categorisation.
>
>A small subset of the most advanced general-purpose AI models are the so-called ‘general-purpose AI models with systemic risk’. That is, models trained using enormous amounts of computational power (more than 10^25 FLOP) with high-impact capabilities that have significant impact on the Union market due to their reach or negative effects on public health, safety, public security, fundamental rights or society as a whole (Art. 3(65) AIA). According to Epoch, there are only 15 models globally that surpass the compute threshold of 10^25 FLOP as of February 2025. These include models like GPT-4o, Mistral Large 2, Aramco Metabrain AI, Doubao Pro and Gemini 1.0 Ultra. Examples of smaller general-purpose AI models that would likely not qualify as having systemic risk include GPT 3.5, the models developed by Silo AI, Aleph Alpha’s Pharia-1-LLM-7B or Deepseek-V3.

...

>AI models that would likely not qualify as having systemic risk include [...] Deepseek-V3.
Anonymous No.105680162 [Report]
>>105679579
Use case?
Anonymous No.105680217 [Report] >>105680228 >>105680304
>>105680027
Conveniently no mention of power draw after it's been repeatedly reported these chips are massive power guzzlers
Anonymous No.105680228 [Report] >>105680288
>>105680217
China has some of the lowest electricity costs in the world, power draw doesn't mean anything to them.
Anonymous No.105680264 [Report] >>105680312 >>105680607 >>105680643
Holy moly it just went all in with the facts!
Anonymous No.105680288 [Report]
>>105680228
Even if we're talking purely usage within China, power might be cheap but you still need to cool them
Bytedance and Alibaba were reporting overheating issues with their samples
Anonymous No.105680304 [Report] >>105680501
>>105680217
>reporting "power draw" of a model
woke shit
Anonymous No.105680312 [Report] >>105680433
>>105680264
How are you making those notes? Is there an extension for it?
Anonymous No.105680433 [Report]
>>105680312
quick reply
Anonymous No.105680501 [Report] >>105680596 >>105680649 >>105680663
>>105680304
https://huggingface.co/blog/sasha/energy-star-ai-proposal
https://huggingface.co/spaces/jdelavande/chat-ui-energy
https://huggingface.co/posts/clem/295367997414146
Anonymous No.105680596 [Report] >>105680631
>>105680501
I hope there will be an option of conversion ratio to african children dying of dehydration.
Anonymous No.105680607 [Report] >>105680650
>>105680264
why are you so hateful
Anonymous No.105680631 [Report]
>>105680596
so much this. And the reasoning switch should say [Kills 3 african children]
Anonymous No.105680643 [Report]
>>105680264
>look mom im so cool and edgy
Anonymous No.105680649 [Report]
>>105680501
>https://huggingface.co/spaces/jdelavande/chat-ui-energy

>You are a helpful assistant based on Qwen/Qwen3-8B; your primary role is to assist users like a normal chatbot—answering questions, helping with tasks, and holding conversations; in addition, if the user asks about the energy indicators displayed below messages (e.g., “Energy”, “≈ phone charge”, “Duration”), you can explain what they mean and how they are calculated; you do not have access to the actual values, but you can clarify that some values are measured using NVIDIA's NVML API on supported GPUs like the T4 (recorded in millijoules, converted to Wh), while others are estimated from inference time using estimated_energy = average_power × inference_time with average_power ≈ 70W; 1 Wh = 3600 J; real-world equivalents help users understand energy use (e.g., phone charge ≈ 19 Wh); users can click on energy values to toggle Wh/J, and on equivalents to cycle through different comparisons; adapt explanations based on user expertise—keep it simple for general audiences and precise for technical questions. You are the model having the energy really measured.

hmm
Anonymous No.105680650 [Report] >>105680689
>>105680607
Physiological response to a parasitic invasion
Anonymous No.105680663 [Report]
>>105680501
Yes, goyim, it's all your fault, if you had all just turned off the lights when leaving the room we wouldn't have this global warming mess.
Anonymous No.105680689 [Report] >>105680710 >>105680862
>>105680650
Your response is about 100 years too late to do any good
Anonymous No.105680710 [Report]
>>105680689
NTA but non sequitour to the question asked and also false given one can hold such an opinion at all, proving that its not all lost
Anonymous No.105680862 [Report]
>>105680689
No, I'm 80 years too early.
Anonymous No.105681124 [Report] >>105681131 >>105681133
Where can I find discussions on local models? Reddit's localllama subforum is broken at the moment.
Anonymous No.105681131 [Report] >>105681138
>>105681124
>at the moment.
comedy gold
Anonymous No.105681133 [Report]
>>105681124
I can't wait for people to ask about this nonstop all week
Anonymous No.105681138 [Report] >>105682183
>>105681131
?
Anonymous No.105681140 [Report] >>105681150 >>105681202 >>105681273 >>105681353
>pdf summarizer for bro
Now that i managed to run the pdf script, I'm trying more models in sillytavern, 12B seems to be the limit in my system for comfy usage.
>Ryzen 5 3600
>32 GB 3600 MHz RAM
>Gtx 1060 6GB
>M.2 NVMe SSD
Should i buy;
>Arc B580 12 GB 276.88$
RTX 3060 12 GB 295.50$
>RX 9060 xt 16 GB 472.17$
>RTX 5060 Ti 16GB 675.83$
Best option seems the RTX 3060 since Cuda primacy on the market and it not being honestly not that bad of a card for general use.
But seeing the build guides on OP makes one think if multiple B580s would be better for AI
>what were the use cases for those builds?
>Are they viable for mine?
I'm still mainly going to summarize Pdfs using my rig as test but want to use image/video gen and chatbot features desu.
>What if bro asks for a pro build
1K budget in turkey, probably a relatively recent server build and turn that into a good PC by changing the 8GB Vram GPUs
Anonymous No.105681143 [Report]
Posts per hour didn't increase in the slightest. Miku troons were not only troons but also redditors. Nobody is surprised. /lmg/ should die.
Anonymous No.105681150 [Report]
>>105681140
Try Qwen 3 30B MoE.
Anonymous No.105681202 [Report] >>105681216
>>105681140
>RTX 3060 12 GB 295.50$
Are you buying new or something? A 3060 can be easily had for under 200 eurobux used here. I bought three
Anonymous No.105681216 [Report]
>>105681202
I was told to not buy used
Anonymous No.105681219 [Report]
>>105679085
What's a good golden-curry?
Irix just didn't gel with me and as much as I enjoyed Mag-Mell, it was a little too positive and full of assistant messaging for me (so I use another model with some Mell in it)
Right now I'm testing Magnum v4 and Lyra v4 as well
Anonymous No.105681239 [Report] >>105681403
>105681216
Oh and also i probably can't buy in installments if buying second hand
Anonymous No.105681273 [Report] >>105681361
>>105681140
Consider a ram upgrade too pdf-bro anon
Anonymous No.105681353 [Report] >>105681406
>>105681140
>3600 MHz

how do you guys even go that high? I tried every combination but if I go higher than 3200MHz it shits itself.
Is it because I have a 8x4 config?
>picrel fucking lol captcha
Anonymous No.105681361 [Report]
>>105681273
Isn't the 5 3600 the bottleneck on the CPU Side? My VRAM has honestly been far more limiting;
>Cuda out of memory
I'm too busy trying to improve the pdf summarizer and sillytavern with even more stolen code and alternative methods to figure out how to lower batch sizes, maybe tomorrow
>Just got into transformers and comfyUI
Anonymous No.105681403 [Report] >>105681442
>>105681239
>installments
Is the information really private and/or incriminating? If it isn't just use an api.
Anything worse than a 3090 is going to be a shit experience if you're trying to do something productive.
Anonymous No.105681406 [Report] >>105681431
>>105681353
I just use the BIOS overclock, probably something like 35xx actually, it says 3600 on the tin though, if i were to tinker though i would;
>Try increasing the CLs
>Play with voltages, 1.45 is plausible higher limit
Anonymous No.105681431 [Report] >>105681487
>>105681406
>it says 3600 on the tin though
Ah, I see.
When I bought those RAMs I wasn't interested in AI, so I was stuck with 3200 MHz ones. I could try to buy 3600Mhz sticks but I don't think the performance increase is worth the hassle
Anonymous No.105681442 [Report]
>>105681403
Oh this is my personal PC
>Test rig
>General use
I'll bill bro on a build, probably 1K$ pricepoint and he will probably write it off from taxes
API usage hopefully won't be needed as the stuff is confidential as well as he himself being able to buy hardware
Anonymous No.105681487 [Report]
>>105681431
You can still abuse your sticks
Or sell them to cover some of the cost of an upgrade.
I sold my 3200 Mhz 16Gb sticks for a little lower than 1/3rd the price of the 32GB 3600 Mhz sticks desu
Anonymous No.105681547 [Report]
>>105681538
>>105681538
>>105681538
Anonymous No.105682183 [Report]
>>105681138
>implying that reddit is inherently broken
in some ways, being literally broken is an improvement