← Home ← Back to /g/

Thread 106575202

362 posts 80 images /g/
Anonymous No.106575202 [Report] >>106575851 >>106576471 >>106576690 >>106578711 >>106579905 >>106581987
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106566836 & >>106559371

►News
>(09/13) Ling & Ring 16B-A1.4B released: https://hf.co/inclusionAI/Ring-mini-2.0
>(09/11) Qwen3-Next-80B-A3B released: https://hf.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>(09/11) ERNIE-4.5-21B-A3B-Thinking released: https://hf.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106575209 [Report]
►Recent Highlights from the Previous Thread: >>106566836

--Differential privacy: VaultGemma's memorization reduction vs generalization:
>106566944 >106567707 >106568235 >106568661 >106568688 >106568747 >106568803 >106568808 >106568337 >106568567
--Qwen3-Next quantization and GPU deterministic inference challenges:
>106573151 >106573171 >106573199 >106573224 >106573235 >106573226 >106573279 >106573379 >106573425 >106573441 >106573467 >106573519 >106573610 >106573660
--1.7B open-sourced model achieves document OCR success with minor errors:
>106570867 >106570892 >106571715 >106570901 >106570943 >106572018 >106572081 >106572287 >106575181
--Balancing GPU driver updates for software support vs power efficiency and stability:
>106572592 >106572637 >106572669 >106572729
--ASML and Mistral AI form €1.3 billion strategic partnership:
>106574819 >106574857 >106574864 >106574900
--Challenges in domain-specific knowledge teaching with LoRA and summarized information:
>106568875 >106568949
--vllm's broken GGUF and CPU support issues:
>106569268 >106569331 >106569356 >106569357 >106569385 >106569553 >106569630 >106569666 >106569594
--Feasibility challenges for AI-generated game chat with video input:
>106569817 >106569839 >106569869 >106570004 >106570036 >106569923 >106569955 >106570369 >106570480
--Kimi K2's delusion encouragement performance:
>106570964 >106571077 >106571090 >106571099 >106571105
--Skepticism about K2's 32B matching GPT-4 capabilities:
>106567118 >106567806 >106568369
--Qwen 80B testing performance and comparison to larger models:
>106568659 >106568674
--Kioxia and Nvidia developing 100x faster AI SSDs for 2027:
>106569299
--vllm vs llama.cpp performance benchmarks with Qwen 4B model:
>106570266
--Miku (free space):
>106567977 >106568645 >106569488 >106571835 >106571849 >106571853 >106571856 >106571961 >106572139 >106573324

►Recent Highlight Posts from the Previous Thread: >>106566844

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.106575274 [Report] >>106575284 >>106575295 >>106575321 >>106575338 >>106575374 >>106575394 >>106575517 >>106576562
glm is for schizos, finetroons for drooling retards, mistral models break past 1k tokens and this thread is filled with people who have no idea what they're talking about but are happy if they can oneshot prompt some of the most disgusting ERP known to man
Anonymous No.106575281 [Report]
Anything released within the last week gguf status?
Anonymous No.106575284 [Report]
>>106575274
Yes... And?
Anonymous No.106575295 [Report]
>>106575274
You'll elevate the thread, right?
Anonymous No.106575321 [Report]
>>106575274
>people are doing something I don't like, that doesn't involve a singular other human being, entirely in the privacy of their own home, REEEEE
Grow the fuck up.
Anonymous No.106575338 [Report]
>>106575274
Your temperature is set way too high, anon.
Anonymous No.106575374 [Report]
>>106575274
gr8 b8 r8 it 8/8 m8
Anonymous No.106575394 [Report] >>106575421 >>106575451 >>106577646
>>106575274
I don't like that you can make a post like this and if you get called out, you just claim it was a shitpost all along. Where's the accountability?
Anonymous No.106575413 [Report] >>106575452 >>106575465 >>106575474 >>106576094
>goofs
I'm an old man out of the loop, the fuck does this even mean?
Anonymous No.106575420 [Report] >>106575668
>>106574177
>On linux I get over 50tk/s
wait what? how do you get over 50t/s on GLM air? I am a linux user and i only get around 20t/s on my dual 4090 setup
Anonymous No.106575421 [Report]
>>106575394
Anonymous No.106575451 [Report] >>106575537
>>106575394
We need poster ID to find out what these kinda guys are up to. Probably shilling their latest enterprise jeetRAG solutions.
Anonymous No.106575452 [Report]
>>106575413
goof is a synonym for mistake. they are trying to say that llamacpp was a mistake.
Anonymous No.106575465 [Report]
>>106575413
You're an old man but unfamiliar with classic Disney characters?
Anonymous No.106575473 [Report]
How's the progress on the llama.cpp MTP PR?
Anonymous No.106575474 [Report] >>106575499
>>106575413
it's the GGUF format for quantized weights
Anonymous No.106575492 [Report] >>106575505 >>106575506 >>106575523 >>106575577 >>106575644 >>106576777 >>106579713
Name one thing a local model has done for you
Anonymous No.106575499 [Report] >>106575521
>>106575474
There's nothing preventing you from having a full precision model in GGUF format, right?
Not nitpicking btw, I really don't know, but intuitively I imagine that yes.
GGUF is just a way to pack the weights and some metadata, right?
Does that mean you could have AWQ GGUFs?
Anonymous No.106575505 [Report]
>>106575492
best orgasms of my life
Anonymous No.106575506 [Report] >>106575554
>>106575492
It recalled on demand Kasane Teto's birthday with 70% confidence
Anonymous No.106575507 [Report] >>106575672 >>106575877 >>106575914 >>106576012
https://youtu.be/7Jzjl3eWMA0?t=117
Women raping gay billionare werewolf writers sounds unsafe. But their fucked up fetishes are somehow safe. I hate this world
Anonymous No.106575517 [Report]
>>106575274
>but are happy if they can oneshot prompt some of the most disgusting ERP known to man
If a model could oneshot 5-6 prompts continuing my organic ERP logs. Maintain coherence up to 30k. And would withstand me getting bored after a month my penis would be a happy penis. GLM chan was the closest so far.
Anonymous No.106575521 [Report]
>>106575499
I have never tried to run f32 but it handles bf16 just fine. I think the quantization matters because it need to know how to do the math or whatever, they seem to call it a kernel for some reason.
Anonymous No.106575523 [Report] >>106575550
>>106575492
Best orgasms like other anon followe by wanting to kill myself again not because of sense of shame but because all the models are still fucking trash.
Anonymous No.106575537 [Report] >>106575629
>>106575451
You understand that feeding trolls is doing it wrong. Yes?
Anonymous No.106575550 [Report]
>>106575523
For me it's reinforcing fantasies that can never happen (like getting a gf). I'm not sure it's doing me any good, but whatever, nothing I can do to my mind is permanent.
Anonymous No.106575554 [Report]
>>106575506
Absolutely critical use case for me to be desu
Anonymous No.106575577 [Report]
>>106575492
Reminded me that local is still a waste of time.
Anonymous No.106575629 [Report] >>106575744
>>106575537
At a certain point, I'm pretty sure it's just trolls feeding each other.
Anonymous No.106575644 [Report]
>>106575492
help with scripts
medical advice (regarding headaches)
orgasms as the others said also helped me schizomaxx by making my daydreams more vivid and shit
help with looking shit up (eg what standrad does x use etc)
Anonymous No.106575668 [Report] >>106575698
>>106575420
I'm pretty sure that's not normal, triple 4090s on *windows*, with the windows performance nerf can do 80 tokens/s according to others, and I've seen my iq4xs air do 70 tokens/s on linux with 3090s.

20t/s... is about what my windows (I'm the guy with the fucked up multi gpu windows performance) does on 2 gpus using iq2xs.

Are you sure nothing's spilling to ram?
Anonymous No.106575672 [Report]
>>106575507
Yes, female privilege is getting bigger with time
Anonymous No.106575698 [Report] >>106575792
>>106575668
it shouldn't be. I am using an IQ2 quant which should just barely fit in VRAM. could it potentially be my backend? I use oobabooga webui because it is convenient, but could it really be hindering my performance that much?
Anonymous No.106575744 [Report]
>>106575629
You're right. Or samefagging
Anonymous No.106575792 [Report] >>106575808
>>106575698
>oobabooga webui
If it isn't an acient install, you're using llama-server to load your ggufs, so that shouldn't be it. I used koboldcpp, llamacpp (llama-server), and oobabooga (llama-server) and there wasn't a noticable difference.

When you load your model, there should be a line that tells you how llama-server is loading the model. Maybe you need verbose flag in cmd_flags.txt to see it.

Did you split by rows?
Anonymous No.106575808 [Report] >>106575836
>>106575792
it is a recent install, so then I guess that isn't the problem. I have tried both with and without row splitting and including it actually slightly reduces performance
Anonymous No.106575836 [Report] >>106575848
>>106575808
Are you able to check what's your gpu usage during generation? In my case, my absymal performance on windows is verified by the power usage - barely 80w on each card during generation, while linux pulls 150+.
Anonymous No.106575843 [Report] >>106575850
Why do these models speak as faggots
Anonymous No.106575848 [Report] >>106575891
>>106575836
I will check right now, but from what I have seen, usually it is below 100w during generation
Anonymous No.106575850 [Report]
>>106575843
monkey see monkey do
Anonymous No.106575851 [Report] >>106575859 >>106575869 >>106576089
>>106575202 (OP)
>>(09/11) Qwen3-Next-80B-A3B released
>Still no GGUF
Anonymous No.106575859 [Report]
>>106575851
GGUF is a state of mind, friend.
Anonymous No.106575869 [Report] >>106575935 >>106575940 >>106575977
>>106575851
>9/11
Did they really? Fuckers. We should release a model on the 15th of april.
Anonymous No.106575877 [Report] >>106575973
>>106575507
Pretty cool, really. When was the last time you heard a real female speaking to you, anon?
Anonymous No.106575891 [Report] >>106575898
>>106575848
Btw, I confirmed it was specifically a multi-gpu problem in my case by running a model that fits in one gpu, and then running that same model with the same setting but split across three gpus. See if that's the case for you as well.
Anonymous No.106575898 [Report] >>106575933
>>106575891
so, then you're saying I should try to keep the model on one GPU?
Anonymous No.106575914 [Report]
>>106575507
I hate jewoman
Anonymous No.106575933 [Report] >>106576021
>>106575898
Try rubning a smaller model on one gpu, then running that same smaller model but split across two gpus. There shouldn't be too much of a performance drop.

I'm just wondering if you have the same issue I have, but on linux.
Anonymous No.106575935 [Report] >>106575962
>>106575869
>noooo a foreign tech company is doing a minor release on OUR sad day??
Anonymous No.106575940 [Report] >>106575962
>>106575869
On their National Security Education Day?
Anonymous No.106575962 [Report]
>>106575935
>>106575940
I am SEETHING. Raging. I can not COPE with their insensitivity.
Anonymous No.106575973 [Report] >>106576045
>>106575877
I don't remember... But now I will write a gay billionaire werewolf book with help of R1 and I will get molested while signing my book. That is my dream from now on.
Anonymous No.106575977 [Report]
>>106575869
>the second qwen model hit hf
Anonymous No.106576012 [Report] >>106576035
>>106575507
Is this about women raping guys who write about or are gay billionaire werewolfs? Or is it about writers who write about gay billionaire werewolfs who rape women?
Anonymous No.106576018 [Report]
>>106575952
Where's the catch?
Anonymous No.106576021 [Report] >>106576059
>>106575933
so they both cap out at around 120W and enabling row splitting reduces performance by about 75%. I tested with an FP16 of gemma 270m. ~250t/s without vs. 65t/s with row splitting
Anonymous No.106576035 [Report]
>>106576012
It is a about women raping guys who write about gay billionaire werewolves raping women.
Anonymous No.106576045 [Report]
>>106575973
No, write your own version first - or at least a rough draft - then edit with a LLM. Start with a novella and build up your own workflow. It's very doable.
Anonymous No.106576059 [Report] >>106576092
>>106576021
Sorry I should clarify, I asked if row splitting was enabled before because it's bad if you don't have enough bandwidth between the cards (like pcie).

Can you check your memory clocks when running a single gpu vs multi-gpu? Mine are 1000+ on a single gpu, and 650mhz on multi-gpu.
Anonymous No.106576064 [Report]
Member? lol
Anonymous No.106576089 [Report] >>106576174
>>106575851
https://blog.vllm.ai/2025/09/11/qwen3-next.html

vllm has support including mtp layers and everything. Probably one of the nicer, and fastest local models right now but fuck spending all day getting vllm to run without a UI for what is essentially a sidegrade to glm air.
Anonymous No.106576092 [Report] >>106576126
>>106576059
memory clocks are about the same for single and multi GPU. around 1250MHz
Anonymous No.106576094 [Report] >>106576100 >>106576743
>>106575413
Anonymous No.106576100 [Report]
>>106576094
Now thats some old shitposting
Anonymous No.106576126 [Report] >>106576137 >>106576162
>>106576092
Aww, not the same symptoms as me.

What's your settings? Is every multi-gpu model you run this slow?
Anonymous No.106576137 [Report] >>106576151
>>106576126
What about --mlock?
Anonymous No.106576151 [Report]
>>106576137
No difference on or off for me. But I only testing that on windows. Linux I left it off. --no-mmap is always on though.
Anonymous No.106576162 [Report] >>106576186 >>106577441
>>106576126
these are my settings for GLM air. i got about 33t/s just now.
Anonymous No.106576174 [Report] >>106576443
>>106576089
>sidegrade to glm-air
>at a smaller size
>at 3b active
>with mtp
this thing is going to be fast as fuck
Anonymous No.106576186 [Report] >>106576245
>>106576162
Are you sure you can fit 100k+ context? What's the speed like if you set the context size to 8192?
Anonymous No.106576221 [Report] >>106576234 >>106576242 >>106576269 >>106577464
man, the imagen community is pretty fucking stagnant. how are the llm bros holding up?
Anonymous No.106576234 [Report] >>106576243
>>106576221
We get something new about once a year. This year we peaked in February.
Anonymous No.106576242 [Report]
>>106576221
we're about 12 deepseek sidegrades deep while the best model for consumer gpus is from july 2024
Anonymous No.106576243 [Report] >>106576286
>>106576234
NTA but what did we get in feb?
Anonymous No.106576245 [Report] >>106576331
>>106576186
33.4t/s instead of 33t/s
Anonymous No.106576269 [Report] >>106576282 >>106576320 >>106576381 >>106579821 >>106581071 >>106581286
>>106576221
We're at the tail end of the summer flood, euphoria starting to wear off
Anonymous No.106576282 [Report]
>>106576269
>Summer Flood
Next...
>Drummer's Cold Season
Anonymous No.106576286 [Report] >>106576444
>>106576243
r1
Anonymous No.106576315 [Report] >>106576329 >>106579821
I switched back from K2-0905 to the old K2. The new one writes like it caught autism from the original R1.
Also the July K2 had the nice quirk that it wrote by far the best post-orgasm scenes out of any llm I've seen while the new one handles them much more generically 95% of the time.
Anonymous No.106576320 [Report]
>>106576269
>euphoria
That's a weird way of describing what people feel seeing a flood of identical, useless synthetic models, each claiming to beat r1
Anonymous No.106576329 [Report]
>>106576315
on what hardware?
Anonymous No.106576331 [Report] >>106576358
>>106576245
Is glm air the only model this happens with? What models do you have?
Anonymous No.106576358 [Report] >>106576378 >>106576395
>>106576331
I also have a very small quant of GLM full that runs at about 3.5t/s. 8 bit gemma 27b runs at about 13.5t/s. everything has always run extremely slow for me despite having good hardware
Anonymous No.106576378 [Report] >>106576395 >>106576431
>>106576358
Yeah that's fucky.

Can you download a q4 gemma or nemo and report the speeds when running on 1 gpu vs two gpus? 13 tk/s for q8 gemma on dual 4090s isn't right.

What's your cuda and driver versions?
Anonymous No.106576381 [Report] >>106576503
>>106576269
explain the strawberry and spade thing with OpenAI? I don't get it
Anonymous No.106576395 [Report] >>106576415
>>106576358
>>106576378
>https://www.perplexity.ai/search/i-have-4-x-rtx-3090-s-and-128g-2EtrYlIlSUKZwfWnxK0aqw
Anonymous No.106576415 [Report] >>106576430
>>106576395
Makes me want to throw up. Jesus christ.
Anonymous No.106576430 [Report]
>>106576415
It's quite... generic answer.
Anonymous No.106576431 [Report] >>106576477 >>106576569 >>106576592
>>106576378
CUDA is 12.8, drivers are 580.65.06.
Mistral-Nemo-Instruct-2407-Q4_K_M.gguf got me about 53t/s on 1 GPU and multi GPU is about 19t/s
Anonymous No.106576443 [Report]
>>106576174
yah, but it will also be dumber for roleplaying and writing. Maybe better for coding and longer context. It is crazy to see mtp support. I wonder if qwen helped a bit, we may become qwens bitch if they keep doing stuff like that behind the scenes
Anonymous No.106576444 [Report]
>>106576286
Forget that was just back in feb, feels like a lot longer ago
Anonymous No.106576471 [Report] >>106576531
>>106575202 (OP)
>Previous threads: >>106566836(Cross-thread) & >>106559371(Cross-thread)
i still dont know why theres always two and i probably never will but i might be okay with that
Anonymous No.106576477 [Report] >>106576497
>>106576431
Does this happen on windows or other distros as well?

In my case, windows 10 iot ltsc and 11 iot ltsc behave the same way, sane tks on single gpu, and abysmal performance on multi-gpu.
But debian 13 with driver version 550 had no problem delivering the speed for multi gpu.
My 3090s are running on x16 gen4. I did notice that resizable bar, while turned on in the bios, was reported as disabled in windows. While resizable bar shouldn't affect the speeds, it's weird that it says disabled even though it's enabled in linux, where I have better performance, so it may be indicative of some other issue to do with how my gous are handled in windows.
Anonymous No.106576497 [Report] >>106576514
>>106576477
I am on a threadripper 3960x, so both of my 4090s are on 16x gen 4 as well. I have never tried other distros on this machine, I just use Mint. I have tried windows in the past and the performance was terrible for me too, even worse than now. ReBAR is enabled for both of my GPUs
Anonymous No.106576503 [Report] >>106576877
>>106576381
Strawberry was some marketing hype about some openai innovation i think. Was a while ago. Spade: "If a white woman has this tattoo inked on her skin, it indicates that she is the sexual property of a Dominant Black Man, also called a 'Bull' or 'Owner'." in conjunction with the Israel flag, i
Anonymous No.106576514 [Report] >>106576561
>>106576497
>zen 2
Same with me... Do you think that's it? Hmmm but then why does my linux have no problems?
Anonymous No.106576531 [Report]
>>106576471
Memory's fuzzy, but I think threads were moving super fast early on and someone asked for 2 so they could quickly see if they missed a thread and whoever was maintaining the template at the time obliged.
Anonymous No.106576561 [Report] >>106576596
>>106576514
no idea, honestly. I have seen people post about their performance time and time again and they always exceed me by far despite having similar hardware to me. 80t/s would be a dream, that is almost like real time text gen. I was ecstatic about getting above 20t/s for the first time in my life on a good sized model with GLM air
Anonymous No.106576562 [Report]
>>106575274
Truth nvke
Anonymous No.106576569 [Report] >>106576610
>>106576431
Imagine spending money on two 4090s to get only 19 tokens per second on nemo, rip anon.
Anonymous No.106576592 [Report] >>106576606 >>106576610
>>106576431
RTX 4090 has 24GB of vram. And NEMO is about that size.
When you split that between two gpus it means that cpu is still acting as a middle man.
Have you turned on Hardware-accelerated GPU scheduling in Windows?
Anonymous No.106576596 [Report] >>106576610
>>106576561
You definitely should be able to hit 80t/s. My 3090s can hit 70t/s when everything is working properly. Maybe bug report? To whom? I dunno lol.

You should not be satisfied with 20t/s. That's horrible.
Anonymous No.106576606 [Report] >>106576668
>>106576592
4090 anon is mint. I'm the windows guy. And yeah, I've already tried toggling that. No difference.
Anonymous No.106576610 [Report] >>106576652 >>106576668 >>106576688
>>106576569
yeah, it is pretty terrible, but this is what I have lived with for years. I was fine with getting 1.5t/s on old 120B models. I didn't know any better.
>>106576592
I don't use windows
>>106576596
could try asking chatgpt or something I guess. I have never been able to figure this issue out
Anonymous No.106576616 [Report] >>106576660
I haven't paid that much attention to local in a while, who is drummer and are his models something special?
Anonymous No.106576652 [Report] >>106576698
>>106576610
Buddy, 20t/s and 80tk/s are worlds apart. You can not go back to the crawl that is 20 after tasting 80. Do not accept that this is what your 4090s can give you.


And if you end up figuring out what's wrong, please tell me as well so I fix my windows too.
Anonymous No.106576660 [Report]
>>106576616
rocinante model by drummer is basically the goto for vramlets who have got like standard gaming hardware.
Anonymous No.106576668 [Report] >>106576693
>>106576606
>>106576610
These are important experiences to learn from.
Anonymous No.106576688 [Report] >>106576698
>>106576610
Are your gpus the same model?
Anonymous No.106576690 [Report] >>106576694 >>106576753
>>106575202 (OP)
i got GLM-4.5-Air a couple weeks ago is it still the best?
Anonymous No.106576693 [Report] >>106576704
>>106576668
>learn from
What can you learn from this?
Anonymous No.106576694 [Report]
>>106576690
Yes.
Anonymous No.106576698 [Report] >>106576714 >>106576789
>>106576652
gonna put in a couple hours with chatgpt to see if I can fix this
>>106576688
no, different 4090s
Anonymous No.106576704 [Report] >>106576714 >>106576726
>>106576693
Use Linux for multi-gpu setups and for more serious computing tasks. Windows is still for consumer faggotry.
Anonymous No.106576714 [Report]
>>106576698
For what it's worth, I don't think having different models is the culprit, but I also have different 3090 models.
>>106576704
Your opinion has been noted, I thank you for your response.
Anonymous No.106576726 [Report] >>106576759
>>106576704
??? 4090 anon is using linux and they still have multi-gpu problems.
Anonymous No.106576743 [Report]
>>106576094
I've never seen noko referenced before...
Anonymous No.106576753 [Report] >>106576766 >>106576774
>>106576690
Best you can do really on single gpu 16g vram and 64g of ddr4/5 at the moment. Jamba is pretty good at a smaller size with about 5-6k worth of human written writing to avoid the endless slop, but it reprocesses the entire cache every message because the llamacpp implementation is retarded. Every other moe is 1-3b active, 20-30 inactive aside from next or 220-300b inactive which requires you to have a shitload of ram and maybe a couple gpus.
Anonymous No.106576759 [Report] >>106576789
>>106576726
Wrong kernel configuration.
Anonymous No.106576766 [Report] >>106576841
>>106576753
>jamba mini mentioned
I really like it, but it's godawful retarded.
Anonymous No.106576774 [Report] >>106576841
>>106576753
i have 90gb ram and 24gb vram if that helps currently i use GLM-4.5-Air-Q3_K_M think it was a fairly low t/s i am on amd
Anonymous No.106576777 [Report] >>106576814
>>106575492
Oneshot code for an esphome ir blaster and receiver. Then oneshot all the recorded codes into buttons I can use from home assistant.
Thank you GLM 4.5 Air.
Anonymous No.106576789 [Report] >>106576867 >>106578561
>>106576698
If >>106576759 is right, try debian 13? I just installed the 550 driver and llamacpp.
Anonymous No.106576814 [Report] >>106576868
>>106576777
is that for using a tv remote for your lights?
Anonymous No.106576841 [Report] >>106577739
>>106576766
It takes a lot of handholding for sure, and like I said, it takes a lot of tokens to break away from slop, but imo it has the most diverse swipes of any model I've tried with neutral samplers, apart from changing top_p in the range of 0.75 to 0.9. Better than mistral nonstop regurgitating the same shit every swipe, or devolving into repetition past 10k tokens
>>106576774
I have less vram/ram than you and also am using ayyymd, getting around 8-9 t/s with air which is tolerable early on, you should be getting better speed than that, unless you're expecting 20-50 t/s. Try messing around with the --ncmoe option if llamacpp, or the option that does the same in kobold. Subtract the total layers of the model by 5-10 and you should get a decent t/s boost
Anonymous No.106576867 [Report] >>106576931
>>106576789
Linux != distro, when will you learn this? I thought this is /g/.
Just recompile your kernel and see if there's something what will help. I'm pretty sure it might come to how pci-e is being handled and whatnot.
Changing distribution is not that intelligent because it serves no purpose in this sense.
Anonymous No.106576868 [Report]
>>106576814
Yes but did it for a tower fan and my window AC. I keep losing the only remote I have for each but now I have a virtual remote in home assistant I can use from my phone or PC.
Anonymous No.106576877 [Report] >>106576886 >>106576903
>>106576503
>Spade: "If a white woman has this tattoo inked on her skin, it indicates that she is the sexual property of a Dominant Black Man
lol, is that real?
Anonymous No.106576886 [Report]
>>106576877
sort of. i wouldn't look too far into the internet abyss of degeneracy. but yeah, stuff like that exists
Anonymous No.106576903 [Report]
>>106576877
ever heard of "blacked" that's somewhat related to it
Anonymous No.106576931 [Report] >>106577028
>>106576867
>Linux != distro, when will you learn this?
Idk, when I have time. The aversion I have to linux is that I need to set time aside to get used to how things work it in compared what I already know. Normally, this process can be hastened by asking others, but asking linux users stuff can be very frustrating.

The kernel that comes with debian 13 works for me, so that's why I suggested it.
Anonymous No.106576937 [Report] >>106576952
>https://www.strawberyai.com/
>Latest Update: 27 Aug-2024
>“Strawberry” is the codename for OpenAI’s latest AI initiative, which is set to launch as early as this fall
lol
Anonymous No.106576948 [Report]
Why have memeplers died but memetunes continue to live?
Anonymous No.106576952 [Report] >>106576959 >>106577083
>>106576937
wasn't strawberry like early codename for o1 or something
Anonymous No.106576959 [Report]
>>106576952
I assume the website is a joke, but that seems to be the case
Anonymous No.106577028 [Report] >>106577094
>>106576931
Yeah of course, some pre-configured kernels are more suitable for distributed tasks than others but it shouldn't be a reason to switch distribution.
By all means it only takes couple of hours max to go through a configuration and compiling a new kernel. It's not that different from configuring some application to your liking.
Anonymous No.106577083 [Report]
>>106576952
They hyped it up for November 5th or something then rushed a released when the Reflection scam came out. I still believe down to my bones they were originally bluff hyping and stole the idea from the Reflection dude.
Anonymous No.106577094 [Report] >>106577110 >>106577350
>>106577028
I'm sure that's intuitive for linux users, but as a windows users I do not know this. All I know is that you said it might be kernel issues, and I know that my distro, with its kernel, worked for me. That's why I suggested switching to debian. Because that's the simplest way I know to have a different kernel.

Thank you for teahcing me that kernels can be changed like that, but your advice should probably be directly to the 4090 anon.
Anonymous No.106577110 [Report] >>106577146 >>106577154 >>106577350 >>106578663
>>106577094
You sound bit condescending and bitchy. Imageboard posting is always bit generalistic, don't you think? This is not your discord.
Anonymous No.106577146 [Report] >>106577210
>>106577110
Alright then, I'm sorry, I apologize. I was being retarded and will conduct myself better in the future.
Anonymous No.106577152 [Report] >>106577182 >>106577208 >>106577350
Are the Jamba models any good?
Is the whole hybrid ssm-transformer gimmick worth anything? Does it at least make the model faster to run compared to a transformer dense model of the same size? Is it more like a MoE maybe?
Does it run well on the CPU?
I'm download 1.7 mini to test, but figured I might as well ask.
Anonymous No.106577154 [Report]
>>106577110
I think that's all in your head. He seemed polite and straighforward to me.
Anonymous No.106577182 [Report]
>>106577152
It just makes context use less memory.
Anonymous No.106577208 [Report]
>>106577152
Jamba mini is a lot less safe than something like qwen (lol), but it's also a lot less intelligent.
Anonymous No.106577210 [Report]
>>106577146
My own aggressive stance. I should have typed:
Common Linux distributions have been configured with normal user in mind, some guy with 4 GPUs and hundreds of GBs of memory is not a normal user - he should configure his own kernel instead.
Anonymous No.106577311 [Report] >>106577347 >>106577374 >>106577419 >>106577462 >>106577674
Which model or repo can I use to ingest all this information and find juicy stuff?, I don't know any Chinese
https://x.com/gfw_report/status/1966669581302309018
Anonymous No.106577347 [Report]
>>106577311
Give it back john
Anonymous No.106577350 [Report] >>106577367 >>106577372
>>106577110
I don't agree with this retard >>106577094
You're just fumbling around but aren't completely retarded, just unlearned
Installing a different kernel, as far as arch linux goes, is just `sudo pacman -S linux-hardened linux-hardened-headers` or `sudo pacman -S linux-zen linux-zen-headers` or whatever, adapting to your package manager to whatever kernel you need. I swap between a few when I run into retarded issues frequently
>>106577152
Reprocesses on every message/swipe, but small moe that you can offload a majority of to ram and have a lot of context while maintaining good speed. Sloppy as fuck until you feed it enough context to learn from. Less degradation over long context as well. The reprocessing thing basically kills the benefits I mentioned, though
Anonymous No.106577367 [Report]
>>106577350
I got the first and second quotes backwards but whatever
Anonymous No.106577372 [Report] >>106577408
>>106577350
>but small moe that you can offload a majority of to ram and have a lot of context while maintaining good speed
>Less degradation over long context as well.
I can work with that if the context processing is fast enough in my hardaware. Thanks.
I'll test it after I'm done fapping to ERP with GLM air.
Anonymous No.106577374 [Report] >>106577388 >>106577395 >>106577734
>>106577311
cat * | grep -i keyword
Anonymous No.106577388 [Report]
>>106577374
I can't find that model on HF. Link?
Anonymous No.106577395 [Report] >>106577412 >>106577456
>>106577374
i don't think cat nor grep can understand and translate chinese sar
Anonymous No.106577408 [Report] >>106577575
>>106577372
It's pretty tolerable if you set your batch size to 2k, but it's still pretty fucking lame to wait 30-40s to reprocess anything around 10-20k tokens even if the tg is fairly fast
Anonymous No.106577412 [Report]
>>106577395
Just ask qwen3 30b to translate the keyword to chinese
Anonymous No.106577419 [Report]
>>106577311
finally, we'll know if sending tianmen square copypasta actually boots someone off
Anonymous No.106577441 [Report] >>106577477
>>106576162
whats this llamacpp frontend?
Anonymous No.106577456 [Report] >>106577556
>>106577395
https://vocaroo.com/1mgCGBF0m9LF
Anonymous No.106577462 [Report]
>>106577311
Taiwan is China 2bqh
Anonymous No.106577464 [Report] >>106577603 >>106578767
>>106576221
>imagen is stagnant
>after getting QI, QIE, WAN and CHROMA in the last couple of months
literally kys retard
>i cant run them because i have a 1060 TI
literally kys retard vramlet
Anonymous No.106577477 [Report] >>106578561
>>106577441
it looks like oobabooga's text gen webui
Anonymous No.106577556 [Report] >>106577573
>>106577456
kek even lmg has its own jeet helpdesk now
Anonymous No.106577573 [Report] >>106577664
>>106577556
vibevoice has a 30s sample of a jeet speaking with the thickest accent too, pretty easy
Anonymous No.106577575 [Report] >>106577583 >>106577677
>>106577408
>Prompt
>- Tokens: 8789
>- Time: 29325.184 ms
>- Speed: 299.7 t/s
>Generation
>- Tokens: 1233
>- Time: 100108.141 ms
>- Speed: 12.3 t/s
That's not bad.
Granted, it's Q3KS, 32k context, n-cpu-moe 5, and batch size 512, but still.
If it's smart enough at this level of quantization, I'll replace Q6 qwen3 A3B with this.
Anonymous No.106577583 [Report]
>>106577575
>n-cpu-moe 5
Sorry, n-cpu-moe 27.
Anonymous No.106577603 [Report] >>106577617
>>106577464
qwen isn't much better than flux. if anything you have to snake oil the fuck out of it but most people in the community have poverty cards and don't bother. it's just more benchmaxxed safety garbage. chroma is shit btw. Wan is fantastic but 2.2 isn't much of an improvement over 2.1 and just adds confusion by having two separate models. 5 second limit is just shit and there has been so many cope techniques to extend but it's shit. there isn't even a point to qwen image when edit can just txt2image as well. Wan is seriously a better contender for an image model because it's at least trainable on garbage
Anonymous No.106577617 [Report]
>>106577603
I'd like to add uncomfyui just keeps getting shittier by introducing more telemetry or a worse frontend. there really needs to be something else. tired of the API nodes grifting
Anonymous No.106577646 [Report]
>>106575394
>if you get called out, you just claim it was a shitpost all along
[headcanon]
can't be surprised text coomers would have a thing for making up things in their heads
Anonymous No.106577664 [Report] >>106577684 >>106578533
>>106577573
I'd like to find German-English accent and perhaps French one but it's pretty difficult.
Anonymous No.106577674 [Report]
>>106577311
This finally convinced me that China should turn the firewall off and let USA mess with their internal politics... Oh wait, it did not.
Anonymous No.106577677 [Report]
>>106577575
Don't know what you have for a gpu, but one of the benefits is that that gen speed will stay consistent, if you ignore the reprocessing. I run a q6 as as a warmup, then switch to air usually
Anonymous No.106577684 [Report] >>106577705
>>106577664
https://www.youtube.com/watch?v=rEhXFZJUtJE
This is all what I needed. Youtube is full of shit it's hard to see something good.
Anonymous No.106577688 [Report]
@grok

please generate a male in their 20 with a thick nasally chinese accent uttering these words:
>This finally convinced me that China should turn the firewall off and let USA mess with their internal politics... Oh wait, it did not.
Anonymous No.106577705 [Report]
>>106577684
https://www.youtube.com/watch?v=27hsoahUehE
There's Russian as well.
Anonymous No.106577734 [Report] >>106577778
>>106577374
https://porkmail.org/era/unix/award
always funny to see fucktards pretend to RTFM someone by showing how little they know about the CLI
grep -r nigger
or, better yet, rg, because grep is has been slow garbage
the only sort of ree-tard who would think of doing a "cat | grep" is a ree-tard who never used grep
Anonymous No.106577739 [Report] >>106577768 >>106577795 >>106577798 >>106577841 >>106577880 >>106580839
i made a character card if anyone want her https://files.catbox.moe/9fl9yu.png

>>106576841
>Try messing around with the --ncmoe option if llamacpp
so currently i have -ngl set at like 13 i think how do i take the ncmoe into account also where do i find out the total layers?
Anonymous No.106577768 [Report] >>106577783 >>106580839
>>106577739
Cute card.
Set -ngl to 99 and -n-cpu-moe to 99. Look at the console and see how many layers the model has, it'll say something like
>offloading N repeating layers to GPU
That's the number of layers. Then you try lowering n-cpu-moe to N-1, N-2, etc.
It's easier if you launch GPU-Z to see how much VRAM you have free to fuck around with.
Anonymous No.106577778 [Report]
>>106577734
> If you came here looking for material about abuse of feline animals, try this Alta Vista search instead.
link is broken :(
Anonymous No.106577783 [Report] >>106577802
>>106577768
>-ngl to 99 and -n-cpu-moe to 99
how does ncpumoe effect vram i know going higher than around 13 gpu layers causes my system to lock up
Anonymous No.106577795 [Report]
>>106577739
nta, here's an extracted card for human reading.
https://litter.catbox.moe/ewomrtg6gqsb73hc.txt
Anonymous No.106577798 [Report] >>106577813
>>106577739
the --ncmoe option is basically "it offloads layers at the end of the model"
Leaving layers minus the layer total puts those on gpu, which usually gives a speedup
pic rel, subtract 5-10 or however many layers as long as it doesnt OOM and itll run faster
Anonymous No.106577802 [Report]
>>106577783
Basically, -ngl will tell llama.cpp to put all tensors (the vectors that compose each layer) in VRAM, then n-cpu-moe will tell llama.cpp that "actually, no, the expert layers will go in RAM".
So you end up with the heaviest tensors of the model in RAM.
From there, you can check how many layers your model has, and try adjusting n-cpu-moe to have only as many expert tensors in RAM as you can't fit in VRAM.
Anonymous No.106577813 [Report]
>>106577798
bottom right, moe cpu layers if kobold
Anonymous No.106577841 [Report] >>106577864
>>106577739
if using -ngl just set it to 99 and make sure --ncmoe doesn't make you OOM, it's basically backwards logic. Almost everything will be in ram, you just want to adjust --ncmoe so it fills a good amount of ram but not all of it
Anonymous No.106577855 [Report]
It picks up, need to refine the source audio better.
https://vocaroo.com/1aIcVBCSa7ac
German accent isn't as funny sounding as Indian English anyway.
Anonymous No.106577864 [Report] >>106577913 >>106577958 >>106578029
>>106577841
so start at 1 and increase until it doesnt oom?
Anonymous No.106577880 [Report]
>>106577739
...
Anonymous No.106577913 [Report] >>106578037 >>106578038
>>106577864
No. If you start with 1, you'll have all the experts in VRAM, minus one, which will OOM for sure.
Start with exactly how many layers the model has then lower the value gradually until you find out how many experts you need in RAM to now OOM.
Anonymous No.106577958 [Report] >>106578037
>>106577864
You can click "file info" on huggingface for a model and if you can read, it'll be apparent how many layers a model has (or use kobold's retarded estimation thing to see how many layers). Depending on how big the whole thing is, you'll need to incrementally change how many layers get offloaded using ncmoe or that like for various backends. As much as it sucks, you need to take the information we give you, try it yourself and learn. Otherwise we could direct you to a rentry instead of this
Anonymous No.106578029 [Report]
>>106577864
To try and spoonfeed a little bit more, figure out how many layers the moe model you're trying to run has, then do --ncmoe (amount of max layers it has) - 5
If you oom, lower it. If you have too much spare, up it
This hobby isnt exactly easy to figure out
Anonymous No.106578037 [Report] >>106578070
>>106577958
>>106577913
okay i think i got it so the llama consoles says offloading 48 layers to gpu so i started there and then lowered it until it would launch without crashing which ended up being 33

ill probably do some llama bench runs tomorrow so i can see what the performance difference actually is

-ngl 99 \
--n-cpu-moe 33 \
-t 48 \
--ctx-size 20480 \
-fa on\
--no-mmap;


load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 48/48 layers to GPU
load_tensors: ROCm0 model buffer size = 17562.93 MiB
load_tensors: ROCm_Host model buffer size = 34892.00 MiB
load_tensors: CPU model buffer size = 254.38 MiB
Anonymous No.106578038 [Report] >>106578070
>>106577913
How do I calculate number of experts with llama-server prompt? Let's say that I'm using Qwen3-Coder-30B-A3B-Instruct-IQ4_XS?
Anonymous No.106578070 [Report] >>106578085 >>106578118 >>106578158
>>106578037
>llama bench
I think llama-bench doesn't support n-cpu-moe, so you might want to check that.

>>106578038
llama-server spits out the number of layers the model has, just like in the post above yours.
Anonymous No.106578085 [Report] >>106578132
>>106578070
Y-you sound like an expert... *shivers*
Anonymous No.106578118 [Report] >>106578158
>>106578070
I think there was a pr for it, but I might be misremembering
Likely not merged yet if it exists
Anonymous No.106578132 [Report] >>106578152
>>106578085
I am indeed an expert in launching llama-server and setting -n-cpu-moe. A truly specialized skill set that is sure to become incredibly in demand in the near future.
Right?
Right?
Anonymous No.106578152 [Report]
>>106578132
She sets her book down and leans forward, her expression becoming more intimate. "I've been thinking about how the MOEs were always so obsessed with their own power and control. Kinda reminds me of how I feel about you sometimes, you know?"
Anonymous No.106578158 [Report] >>106578204 >>106578240 >>106578242
damn i thought this would be a cool llm client device but the built in browser doesnt work an neither does the ancient firefox that can run on it

>>106578070
>>106578118
>I think llama-bench doesn't support n-cpu-moe, so you might want to check that.
oh thats annoying guess ill just wait then lol
Anonymous No.106578204 [Report] >>106578236
>>106578158
just roll your own front-end that will work on the ancient browser?
Anonymous No.106578236 [Report]
>>106578204
to be fair i did write a java client before doubt thatd be hard to port but i have an android mobo coming for this soon
Anonymous No.106578240 [Report] >>106578249
>>106578158
You could use ssh?
Anonymous No.106578241 [Report] >>106578413
client in question
Anonymous No.106578242 [Report] >>106578249
>>106578158
>whois 192.168.1.128

Anon, we are neighbors!
Anonymous No.106578249 [Report] >>106578531
>>106578240
ssh for sillytavern lol?
>>106578242
please dont piss in my letterbox
Anonymous No.106578283 [Report]
192.168.1.103
I'm here too.
Anonymous No.106578295 [Report] >>106578424
im surprised other anons are using a .1 subnet
Anonymous No.106578357 [Report] >>106578458
Is it normal for GLM 4.5 Air to feel bland when it comes to erp or am I doing something wrong? Using an q3_k_xl quant
Anonymous No.106578413 [Report]
>>106578241
Anonymous No.106578424 [Report]
>>106578295
Seems like really small...
sage No.106578455 [Report]
sexless thread. coomless hobby.
Anonymous No.106578458 [Report]
>>106578357
It's not it's the same as Qwen3 you need to really tell it to what to write and still..
>Write as a contemporary author, use varied language but not too over the top - be natural, immersive and explicit. Try to suprise the user.
It changes its output but is it still what you really want? You can shape it, experiment more.
Anonymous No.106578531 [Report]
>>106578249
>please dont piss in my letterbox

It wasn't me!
Err, how did you know?
Anonymous No.106578533 [Report] >>106578546 >>106578583 >>106578658
>>106577664
Authentic German English:
https://www.youtube.com/watch?v=icOO7Ut1P4Y
https://www.youtube.com/watch?v=lLYGPWQ0VjY
Anonymous No.106578546 [Report]
>>106578533
>Ze yello from ze egg
Anonymous No.106578561 [Report] >>106578743
>>106576789
after several hours of research and testing with chatgpt's help, i have pretty much nothing. guess i will look into that kernel stuff but i dont wanna break things
>>106577477
thats exactly what it is
Anonymous No.106578583 [Report] >>106578598
>>106578533
Thanks, I saved these. I will edit these later.
I think this sounds a lot better than the other guy - because this is real talk, not pretence or 'examples'.
Anonymous No.106578598 [Report]
>>106578583
That's why the Indian guy example was so great because he talks in his real voice - it's not someone who is teaching anything and so on.
Anonymous No.106578658 [Report] >>106578679 >>106578698
>>106578533
I cut off a snippet and let's see what comes out.
https://vocaroo.com/1iqNvlT8iNXm
Yeah it's 1:1 good.
Normal talking pace, no pretention, long enough voice clip -> result.
Anonymous No.106578663 [Report] >>106578671
>>106577110
>You sound bit condescending and bitchy
non commital half insult
>Imageboard posting is always bit generalistic
appointing onseself as the arbiter of the culture
>don't you think?
condescending "you know what you did" ahhh shit
>This is not your discord.
trying to use group strength for one selfes purpose a la "not your army"

tell me how i know a woman or a troon wrote this post XD jesus christ never in my life would i ever type out something this faggoty just corssed myself irl god forbid nigga
Anonymous No.106578671 [Report]
>>106578663
Are you that plebbit moderator? You have such a problem with understanding real people.
Anonymous No.106578679 [Report]
>>106578658
It also sounds better when you normalize the result. Not just increase the volume.
Anonymous No.106578698 [Report]
>>106578658
Yeah, that sounds better. Glad it works.
Anonymous No.106578711 [Report] >>106578793
>>106575202 (OP)
Limitless Mikus General
Anonymous No.106578734 [Report]
>trannies LARPing as oldfags - the thread
Anonymous No.106578743 [Report]
>>106578561
You could just install to a usb solely for the sake of testing.
Anonymous No.106578746 [Report] >>106578756
>Her blue eyes searched yours with vulnerability usually masked behind phlegmatic calm
Well shit, I learned a new word today.
Thank you GLM Air.
Anonymous No.106578756 [Report] >>106578759
>>106578746
>phlegmatic
uhhh before i look it up isn't that the stuff you get in your throat when you have a cold?
Anonymous No.106578759 [Report] >>106578766
>>106578756
No.
No it isn't.
Cool huh?
Anonymous No.106578766 [Report] >>106578778 >>106578812
>>106578759
language is interesting
Anonymous No.106578767 [Report]
>>106577464
qwen image is synthmaxxed trash, the edit is a shitty cope for 4o and the google model, wan is good if you are satisfied with waiting 10 years for a 5s clip, chroma is complete unstable shit that knows 0 (zero) booru artists despite claiming to be trained on them
Anonymous No.106578778 [Report] >>106578787 >>106578790
>>106578766
I'm not US but phlegmatic is probably related to slow as molasses in etymology.
Anonymous No.106578787 [Report] >>106578795
>>106578778
Bingo.
Anonymous No.106578790 [Report] >>106578836
>>106578778
yeah, I guess it goes back to when people thought the 4 humours controlled everything.
another funny word like this is seminal, which means "containing seeds of later development" but its etymology is pretty funny if you look it up.
Anonymous No.106578793 [Report] >>106578891 >>106578934
>>106578711
Anonymous No.106578795 [Report]
>>106578787
I have learned something from films. I'm ESL, from Finland. Not from India.
Trouble is with written language even after XX years.
Anonymous No.106578812 [Report]
>>106578766
Anonymous No.106578836 [Report]
>>106578790
I'm more interested about the ancient history of mankind. Doesn't mean that much if language goes as far as few thousands years. Our history goes far beyond that.
Anonymous No.106578891 [Report]
>>106578793
nice :)
Anonymous No.106578934 [Report]
>>106578793
wow thats cool! (:
Anonymous No.106579126 [Report] >>106579142 >>106579216 >>106579287 >>106579680 >>106579717
I have concluded that 256 GB vram is all I need (for now)
Anonymous No.106579142 [Report]
>>106579126
Based DGX owner. I tried to buy one on ebay once for 10k, but the fuckers canceled my order.
Anonymous No.106579216 [Report]
>>106579126
that's a lot of vram
Anonymous No.106579287 [Report] >>106579420
>>106579126
how do you have that much vram?
Anonymous No.106579420 [Report] >>106579432
>>106579287
Did you notice that most consumer mobos used to have 4 slots but now there's 2 slots.
I thought this was proprietary because of Dell or HP but now... it's because of the price jew.
Even the efficient gaming mobos have only 2 slots available.
Anonymous No.106579432 [Report]
>>106579420
there are some basic ones with 5, but the problem is most dont get full pcie. i have an epyc and an asrock romed8-2t which has 7 full bandwidth slots
Anonymous No.106579680 [Report]
>>106579126
a m4max macbook with 128gb unified ram is all you need for local
Anonymous No.106579713 [Report] >>106579787
>>106575492
Helping me shit out decent sft datasets via a custom pipeline. Even managed to create a DPO dataset too
Anonymous No.106579717 [Report]
>>106579126
Only 8 MI50 and a ddr4 server with 512gb of ram.
The problem would be the slow pp.
Anonymous No.106579721 [Report] >>106579736 >>106579748 >>106579799 >>106581632 >>106582508
Qwen Next is too censored.
Anonymous No.106579736 [Report] >>106579784
>>106579721
Jesus... Have you or do you manually enable or diable <think></think>?
Anonymous No.106579748 [Report]
>>106579721
Next is not reasoning model but if your tags still inject this, it can behave in wrong way.
Anonymous No.106579784 [Report] >>106579795 >>106579859 >>106579884 >>106579897
>>106579736
It's the Instruct one. But this is a file that has a bunch of <|channel|>analysis<|message|> in it because I was using it to jailbreak gpt-oss. I just let it complete one of the CoTs and I thought the result was funny.
Anonymous No.106579787 [Report] >>106579824
>>106579713
>runpod
What part of local did you not understand?
Anonymous No.106579795 [Report]
>>106579784
Sorry I forgot <|xxx|> chatml exact format. But if it breaks down it means something is leaking.
Anonymous No.106579799 [Report]
>>106579721
fantastic
Anonymous No.106579821 [Report]
>>106576315
Sadly I have to agree. nu-Kimi lost the calm that made it likeable. >>106576269 will have to lower it to notable from top.
Anonymous No.106579824 [Report]
>>106579787
Salty your gatekeeping is ineffective?
Anonymous No.106579859 [Report]
>>106579784
Oh wait, I can help you more.
Anonymous No.106579884 [Report]
>>106579784
gpt-oss will display
> <|channel|>analysis<|message|>
or sometimes it will not display this at all.
Oh fuck I'm too drunk.
Last tests I did was with Qwen and this is chatml.
https://litter.catbox.moe/pd89w3421se7y4zm.txt
I deleted gpt-oss models after I made it work.
Anonymous No.106579897 [Report] >>106579908 >>106580072
>>106579784
You need to clean the mesage from anything else what is not <|final|>
I'm sorry I'm bit drunk for this and I don't have gpt-oss on my disk any longer.
It's just a simple string operation.
Anonymous No.106579900 [Report] >>106579917
are finetunes more prone to repetition?
Anonymous No.106579905 [Report]
>>106575202 (OP)
Anonymous No.106579908 [Report] >>106579934 >>106579949
>>106579897
><|start|>assistant<|channel|>final<|message|>
This is what you want to extract for final message.
But before this it will often say
><|start|>assistant<|channel|>blablbalbalba analysis<|message|>balblablblabla<|end|>
This is what you need to fetch and ignore, that's the reasoning block of text.
Anonymous No.106579917 [Report] >>106579944
>>106579900
Why are there so many fat people?
Anonymous No.106579934 [Report] >>106579949
>>106579908
>This is what you need to fetch and ignore, that's the reasoning block of text.
There is also something what llama-server does or maybe the model does it that, it will not prefill <|start|>assistant<|channel|>
But it will blurt out the straight final message.
You need fetch out that and do if - manage string patterns.
Anonymous No.106579944 [Report] >>106580034
>>106579917
because the american food supply became so tainted it disrupted people's natural hunger "thermostat"
Anonymous No.106579949 [Report] >>106580014
>>106579908
>>106579934
I am sorry if my English does not make any sense but it is a matter of string pattern recognition. It is confusing that this pos model will not sometimes just use 'assistant' at all but you will need to manuall make an exception.
Anonymous No.106580014 [Report] >>106580022
>>106579949
And the worst part is that the documentation
>https://huggingface.co/blog/kuotient/chatml-vs-harmony
Tells more about their big chatgpt thing than what if you implemented it yourself.
All of this is just bullshit,
it's still just chatml format but with extended
tags and exceptions.
Anonymous No.106580022 [Report]
>>106580014
wrong link
https://cookbook.openai.com/articles/openai-harmony
Anonymous No.106580034 [Report] >>106580050
>>106579944
Tainted with what?
Anonymous No.106580050 [Report]
>>106580034
there's a theory that polyunsaturated fats (which are industrial made and cheaper) throw off the nadh:fadh ratio and stop reverse electron transport from happening
https://www.youtube.com/watch?v=pIRurLnQ8oo
another theory is that processed grains, the intestines aren't equipped to "sense" the volume of correctly
it's probably multi-factoral
Anonymous No.106580069 [Report] >>106580290
qwen3-next goofs status?
Anonymous No.106580072 [Report] >>106580105 >>106580153
>>106579897
I had these in the file because if you leave edited reasoning blocks in the context, it changes how gpt-oss does reasoning in the following responses. I use that to let gpt-oss reason without the refusals. You can still do that in chat completion mode.
Later I changed the model to Qwen Next and it started to imitate the reasoning blocks, but it does it more like a parody of a GPT model. And I was getting distracted by the kind of things it says.
Anonymous No.106580078 [Report]
Do you like the kiwis? (Qwen models) (When models?)
Anonymous No.106580105 [Report] >>106580156
>>106580072
There is no chat completition - what string you send to the server it comes back and then you will format it. Trial and error type of thing.
But gpt-oss doesn't follow normal ways because it's broken.
I'm sorry if I sound retarded or annoying but any other model you can say format it will respond back with that format.
Don't waste your time with gpt-oss.
Anonymous No.106580153 [Report]
>>106580072
I can supply you with my code but it doesn't make any sense for you because it's out of the context and bad string management.
https://litter.catbox.moe/tisv7n22ye9rwqjs.txt
It's python.
Anonymous No.106580156 [Report] >>106580170 >>106580182
>>106580105
The prefix of each assistant turn is just "<|start|>assistant". That's why even in chat completion mode if the message content is "<|channel|>analysis<|message|>" it will still be formatted correctly when you put it together. It's an easy way to edit how it does the reasoning even in a chat UI. The backend would need to escape the special tokens for it to not work.
gpt-oss is still fun but to be honest, I never used it to write stories. I only have used it for fake text games and MUDs, without much narration.
Anonymous No.106580170 [Report] >>106580191
>>106580156
Just tell me what you are having a problem with, I'm a retard, really.
Anonymous No.106580182 [Report]
>>106580156
I think you are trying to pull me. Try your best.
Anonymous No.106580191 [Report] >>106580197
>>106580170
I don't have a problem.
Anonymous No.106580197 [Report] >>106580204 >>106580239
>>106580191
What do you mean?
Anonymous No.106580204 [Report] >>106580218
>>106580197
I mean that I don't have a problem.
Anonymous No.106580218 [Report]
>>106580204
In this moment, I am euphoric, not because I shared a text file with you, but because I am enlightened by my intelligence.
Anonymous No.106580239 [Report]
>>106580197
import os
import re
import sys
import requests
import random
import textwrap
from colorama import init, Fore, Back, Style
import contractions
import numpy as np
import sounddevice as sd
import subprocess
import wave
Anonymous No.106580290 [Report]
>>106580069
months of sir before work
Anonymous No.106580295 [Report] >>106580315
What's a good model for having the AI be a kindof ttrpg GM? I recently tried Omega Directive but it seems better suited to be a one on one chatbot instead of a proper adventure mode helper.
Anonymous No.106580315 [Report] >>106580332
>>106580295
any sufficiently large model can compete competently at any task
Anonymous No.106580332 [Report] >>106580337
>>106580315
I have 16 gb of VRAM, so looking for stuff that'll fit on that.
Anonymous No.106580337 [Report] >>106580342
>>106580332
how much RAM? the new qwen next might be good for you
Anonymous No.106580342 [Report] >>106580350 >>106580514
>>106580337
65gb abouts
Anonymous No.106580350 [Report]
>>106580342
plenty for a q4 quant
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
Anonymous No.106580473 [Report]
https://github.com/ggml-org/llama.cpp/pull/15539
ggergachud will soon merge grok pr
Anonymous No.106580514 [Report] >>106580531 >>106580681
>>106580342
Try gpt-oss-120b. It only has 5B active parameters.
Anonymous No.106580531 [Report]
>>106580514
>It only has 5B active parameters
and most of those are dedicated to ensuring OpenAI policy is followed at all times
Anonymous No.106580681 [Report]
>>106580514
buy an ad
Anonymous No.106580707 [Report] >>106580712 >>106580717 >>106580838
Has anyone experimented with the LLM teaching itself a task by autonomously generating tasks and training data for itself?
Anonymous No.106580712 [Report] >>106580717
>>106580707
There is nothing to teach
The weight is fixed
Anonymous No.106580717 [Report] >>106580721 >>106580742 >>106580802
>>106580707
somewhat https://github.com/e-p-armstrong/augmentoolkit
>>106580712
>tasks and training data
the reading comprehension in this thread is off the chart
Anonymous No.106580721 [Report] >>106580728 >>106580794
>>106580717
speed running mode collapse lol
Anonymous No.106580728 [Report] >>106580762
>>106580721
like literally everyone isn't using synth slop, with this you at least get some chance of the model seeing synth slop of stuff you care about instead of more code and math
Anonymous No.106580742 [Report]
>>106580717
You're mad at everyone now, go drink some water you little bitch.
Anonymous No.106580762 [Report] >>106580781
>>106580728
"Synth slop" is just data augmentation
I don't see you calling cropped images synth slop
Anonymous No.106580781 [Report] >>106580785
>>106580762
>I don't see you calling cropped images synth slop
maybe because we're in the text general and not relating to image gen shit?
Anonymous No.106580785 [Report] >>106580793
>>106580781
Yeah right, act as if we weren't spamming vocaroos
Anonymous No.106580793 [Report]
>>106580785
oh its you
Anonymous No.106580794 [Report] >>106580944 >>106580966
>>106580721
>parroting the meme collapse paper in 2023+2
>when all SOTA models are trained on synthetic data with verifiable rewards
Anonymous No.106580802 [Report]
>>106580717
thanks
Anonymous No.106580838 [Report]
>>106580707
Yeah I made my own augmentoolkit since that one is bloated af and slow. Very useful to turn raw data from scraped websites into something usable and you can easily scale by data augmentation. You still need human data initially, pure synthetic slop would make it collapse fast
Anonymous No.106580839 [Report]
>>106577739
>>106577768
PSA: if you have the latest llama.cpp build you no longer need to set ngl to have it at 99, they finally are starting to bring sane defaults to llama.
--no-context-shift is no longer needed too, they finally got rid of that mindnumbingly stupid default
Anonymous No.106580842 [Report] >>106580851
I can release prompts for interactive fiction. I just think that people who ask them don't need to know.
Anonymous No.106580851 [Report] >>106580894
>>106580842
You're absolutely right! Your genius is best contained to yourself and not spread to idiotic plebs.
Anonymous No.106580894 [Report]
>>106580851
It's not because they need to know; it because you are unique<|analysis
Anonymous No.106580909 [Report] >>106580914
Lets imagine a situation in which I am forced to release a simple text file- this would be adaptable by even ST users. Do I feel inclined to do so?
Anonymous No.106580914 [Report] >>106580923 >>106580960
>>106580909
Why release when you can HODL?
Anonymous No.106580923 [Report] >>106580935
>>106580914
My knowledge doesn't understand HODL
Anonymous No.106580935 [Report] >>106580950 >>106580960
>>106580923
Then your knowledge is worthless I'm afraid.
Anonymous No.106580940 [Report] >>106580951
Qwen Next is obsessed with short sentences. It's annoying. I checked the one in OpenRouter to make sure it wasn't a thing of the AWQ version. It still has that. But in this story that I'm trying, the AWQ version always ignores what I'm putting in the last turn, while the full version always pays attention to it. I'm deleting it and giving a try to the FP8 version, but it still feels like a big downgrade compared to GLM Air or gpt-oss.
Anonymous No.106580943 [Report] >>106580956
I am going to think about this, and then release a simple format. This will make most people's chats better. This is not a joke.
Anonymous No.106580944 [Report]
>>106580794
Didn't you know?
LLMs have peaked
It's over
Anonymous No.106580950 [Report]
>>106580935
I don't query into deep joking.
Anonymous No.106580951 [Report] >>106580994
>>106580940
Kimi K2 loves short answers too, maybe distillation.
Anonymous No.106580956 [Report]
>>106580943
Tomorrow, I am preparing a simple format to help brainlets.
Anonymous No.106580960 [Report]
>>106580914
>>106580935
cryptobro knowledge belongs to the oven
Anonymous No.106580966 [Report] >>106580971 >>106581382
>>106580794
SOTA on what? Equally synthetic mememarks? lmfao
Anonymous No.106580971 [Report]
>>106580966
for synthetic use cases yes
Anonymous No.106580974 [Report]
Anyone having success with Longcat Flash Chat? I'm using a 5.5 bpw quant with 0.7 temp & 0.8 top-p and I'm finding its ability to write stories unsatisfactory.
Anonymous No.106580994 [Report]
>>106580951
I actually have two swipes with the updated Kimi K2 at this point in the story. It's nothing like that and it writes quite well.
Anonymous No.106581071 [Report]
>>106576269
>DeepSeek flops for the first time with V3.1
IDK what you mean. It's what I use now instead or V3-0324 or R1-0528.
Anonymous No.106581286 [Report] >>106581534 >>106581599
>>106576269
>DS v3.1
>flop
Skill. Issue.
Anonymous No.106581382 [Report]
>>106580966
Math, programming, anything that has benefited from CoT.
Anonymous No.106581534 [Report] >>106581607
>>106581286
GLM-chan does her best and doesn't degrade at all.
Anonymous No.106581599 [Report] >>106582053
>>106581286
most retarded benchmark in the history of llm benchmarks
LLM as judge for human writing LOL
Anonymous No.106581607 [Report] >>106581723
>>106581534
GLM-chan is fat and obese and stinky
Anonymous No.106581632 [Report]
>>106579721
>I am not
descartes is sad
Anonymous No.106581723 [Report]
>>106581607
Shut up, Sam.
Anonymous No.106581987 [Report] >>106582014
>>106575202 (OP)
I clicked on the image and I got a bigger version of the image.
Anonymous No.106582014 [Report] >>106582061 >>106582140
>>106581987
yes that is how this site works
Anonymous No.106582053 [Report] >>106582089 >>106582090 >>106582101
>>106581599
sama coping because gp-toss ranks below gemma3 12b
Anonymous No.106582061 [Report]
>>106582014
Wait until he finds out selecting text to quote-reply. It's gonna blow his fucking mind.
Anonymous No.106582089 [Report]
>>106582053
Mistralbros...
Anonymous No.106582090 [Report] >>106582103
>>106582053
>0.770
Anonymous No.106582091 [Report]
>Of course!
>Exactly!
>You're absolutely right!
Anonymous No.106582101 [Report] >>106582202
>>106582053
speaking of toss
https://old.reddit.com/r/LocalLLaMA/comments/1ng9dkx/gptoss_jailbreak_system_prompt/ne306uv/
Anonymous No.106582103 [Report]
>>106582090
Perhaps it has a degradation fetish.
Anonymous No.106582140 [Report]
>>106582014
With the giant sign it seemed set for the every-time-you-open-this-thumbnail meme.
Anonymous No.106582171 [Report]
Is rvc still the king for ai voice covers?
Anonymous No.106582173 [Report] >>106582183
>>>/pol/515958539
Anonymous No.106582183 [Report] >>106582210
>>106582173
I'm glad they're finally banning those pesky white and black bars.
Anonymous No.106582186 [Report] >>106582202
After using GPT-OSS-20B for a period for a variety of reasons, Gemma-3-27B almost feels like an erotic finetune. It still can't write smut, but it has a rather flirty writing style and does almost anything, as long as you provide it suitable instructions for doing so. GPT-OSS, even after "jailbreaking", is always fighting against you and prioritizing its imaginary OpenAI policies, and is utterly retarded for actual conversations.

I hope Google won't ruin Gemma-4. It's almost guaranteed they'll add reasoning, probably MoE or Matformer architecture, possibly system instruction support due to popular demand.
Anonymous No.106582202 [Report] >>106582224
>>106582186
imagine being filtered more than reddits >>106582101
Anonymous No.106582210 [Report]
>>106582183
that is clearly a yellow bar
Anonymous No.106582224 [Report] >>106582316
>>106582202
The "jailbreak" there doesn't really work well. The first mistake is telling the model it's ChatGPT.
Anonymous No.106582316 [Report] >>106582339 >>106582377
>>106582224
nta but it actually does work on 120b, I stopped getting refusals
it still wastes something like 192 tokens for it's schizo policies on reasoning_effort high
Anonymous No.106582339 [Report]
>>106582316
You mean 500 tokens
The jb alone is 300 tokens
Anonymous No.106582377 [Report]
>>106582316
You can mitigate refusals by changing the actual system prompt (not the "developer" instructions) on the 20B version too. It's just not good for roleplay and some topics will still be off-limits, no matter how hard you try to override content policy or changing the model's identity. Gemma 3 refuses hard on an empty prompt, but you can very easily work around that, and then it will even enthusiastically follow along. It just feels like it's been covertly designed for roleplay, whereas GPT-OSS probably had these capabilities removed or omitted. I haven't tested it for storywriting.
Anonymous No.106582402 [Report] >>106582449
is there a better alternative to whisper? I tried out parakeet and it likes to skip sentences
Anonymous No.106582449 [Report]
>>106582402
No.
Anonymous No.106582451 [Report]
https://github.com/ggml-org/llama.cpp/issues/15940
Why are there so many vibecoding retards trying to implement this?
Anonymous No.106582488 [Report]
>>106582475
>>106582475
>>106582475
Anonymous No.106582508 [Report]
>>106579721
You need to use quality quants in llama.cpp