← Home ← Back to /g/

Thread 106364639

382 posts 70 images /g/
Anonymous No.106364639 >>106364672 >>106364684 >>106364734 >>106364785 >>106364805 >>106367419
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106358752 & >>106351514

►News
>(08/23) Grok 2 finally released: https://hf.co/xai-org/grok-2
>(08/21) Command A Reasoning released: https://hf.co/CohereLabs/command-a-reasoning-08-2025
>(08/20) ByteDance releases Seed-OSS-36B models: https://github.com/ByteDance-Seed/seed-oss
>(08/19) DeepSeek-V3.1-Base released: https://hf.co/deepseek-ai/DeepSeek-V3.1-Base
>(08/18) Nemotron Nano 2 released: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106364646
►Recent Highlights from the Previous Thread: >>106358752

--Grok-2 release and licensing limitations prevent local use and model distillation:
>106360573 >106360583 >106361215 >106361234 >106361251 >106361491 >106361508 >106361292 >106363242 >106361355 >106361361 >106361382 >106361396 >106361554 >106361358 >106361677
--Achieving ultra-fast local ERP inference with aggressive quantization and high-memory setups:
>106359914 >106359949 >106359984 >106359998 >106359974 >106359978 >106360195 >106360219 >106360243 >106362566 >106362839 >106362874 >106360238 >106360272 >106360289 >106360298 >106362206 >106363781 >106362235 >106359989 >106360005 >106360022 >106360042 >106359993
--Quantization tradeoffs: Q4_K_M often sufficient, but higher quants better if resources allow:
>106363201 >106363258 >106363281 >106363328 >106363371
--Command A Reasoning released with strong safety and competitive performance:
>106358780 >106358832 >106358856
--MoE models require neutral prompting to avoid schizophrenic behavior:
>106359448 >106359923 >106360407
--Timeline chart of LLM evolution from LLaMA2 to projected Chinese dominance era:
>106358892 >106358922 >106358959 >106359070 >106359105 >106359241 >106359351 >106359450 >106359474 >106359780
--Skepticism over Elon's claim that Grok 3 will open-source in six months:
>106362417 >106362439 >106362483 >106362602 >106363177 >106363241 >106363972
--Investigate prompt and tokenization differences causing qwen3-30b looped thinking on llama.cpp:
>106362795 >106362812 >106362823 >106362840
--DeepSeek-V3.1 tradeoffs: better context handling but more autistic behavior:
>106362644 >106362661 >106362676 >106362699
--Lightweight TTS options for low VRAM and fast LLM inference:
>106360462 >106360524 >106360638
--Miku (free space):
>106358887 >106362792

►Recent Highlight Posts from the Previous Thread: >>106358757

Why?: >>102478518 (Dead)
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.106364672 >>106364684 >>106364755
>>106364639 (OP)
I like this Miku
Anonymous No.106364684 >>106364844
>>106364672
>>106364639 (OP)
Isn't this from the cover art for Heinlein's "Stranger in a Strange Land"?
Anonymous No.106364734
>>106364639 (OP)
>Nemotron Nano 2
Why no one quantized it? Is it a failure?
Anonymous No.106364755
>>106364672
I like the recap miku better, primarily because of the "o-oe-oo". Love that shit
Anonymous No.106364785 >>106364801 >>106364850 >>106364866 >>106367573
>>106364639 (OP)
Hello, I am new to local LLMs - what is the purpose of them?
I entertain the thought of having a sub 70b LLM, but I just cannot fathom a use-case for it.
Can you guys let me know what they could be used for?
Anonymous No.106364801
>>106364785
Advanced masturbation
Anonymous No.106364805 >>106364847
>>106364639 (OP)
stranger in a strange land sucked
mike sucks
Anonymous No.106364844
>>106364684
Yes... the origin of the word grok.
Anonymous No.106364847 >>106364899
>>106364805
>stranger in a strange land sucked
It went real quick from an interesting premise to an author's barely disguised fetish.

>mike sucks
you suck
Anonymous No.106364850
>>106364785
the 3 Cs of LLMs
>coding
>cooming
>coming up with a third C
Anonymous No.106364855
Anonymous No.106364865
LLMs are useless
Anonymous No.106364866 >>106366501 >>106369275
>>106364785
RLM simulator
My friend don't watch that many movies
Redditoors and letterbox niggers suck ass
X is jeetslop

So i can only talk to LLMs about the movies i like.
I also use coding models for making small games and apps for myself.
Anonymous No.106364876
LLMs are useful
Anonymous No.106364899 >>106364909
>>106364847
That's pretty much every Heinlein book.
>dude dies
>has his brain transplanted into voluptuous female body
>explores his new sexuality
or
>dude has himself cloned with his Y chromosome replaced with another copy of his X
>...
>fucks his clone
Sci-fi at its finest.
Anonymous No.106364900 >>106366046 >>106367183
Anonymous No.106364909
>>106364899
repressed tranny
Anonymous No.106364945 >>106364952 >>106367586
Deepseek coin when? I need funds for gpus
Anonymous No.106364952
>>106364945
Just ask the model what shitcoin to buy
Anonymous No.106365043 >>106365112 >>106366052
audio input when? video input when? image generation when? llama.cpp is lagging behind
Anonymous No.106365112
>>106365043
As long as the underlying modality is LLM tokens, they will all be shit
Anonymous No.106365481
grok 2 killed local
Anonymous No.106365538
i want bytedance seed inside me
Anonymous No.106365569 >>106365576 >>106365583 >>106366515
Is there anything better than GLM-4.5 Air that I can run on Ryzen 9800 with 64 GB DDR5 along with a 8GB 3060 Ti?
GLM Q3 seems to be the sweet spot so far. It makes both my CPU and GPU work and just barely fits into memory with 32k context, while running at an enjoyable speed.

I think I'm giving up on small 8GB models that fit into VRAM. Any good MoE models other than GLM?
Are there any GLM mixes or variants trained on roleplay slop, or is it too early for that?
Anonymous No.106365576
>>106365569
GLM Air is probably the best thing your rig can run, all the better MoE's are much larger, and all the smaller MoE's aren't very good.
Anonymous No.106365583 >>106365589
>>106365569
arent you satisfied enough with GLM sex?????
Anonymous No.106365589
>>106365583
It's great. But I'm always keeping my eyes open for better models.
Anonymous No.106365592 >>106365613 >>106365670 >>106366493
Has anyone managed to install
>https://github.com/KittenML/KittenTTS
Did
>pip install https://github.com/KittenML/KittenTTS/releases/download/0.1/kittentts-0.1.0-py3-none-any.whl
But it complains about no matching distribution found for misaki (which I have installed anyway). I'm not that technical with python dependencies and especially wheels are bit of a mystery to me.
Anonymous No.106365613 >>106365636
>>106365592
Why do you want it anyway
Anonymous No.106365617 >>106365623 >>106365624 >>106365634 >>106367605
why are local llms so fucking wordy? I ask it to check grammar in one sentence and I get a whole fucking essay.
Anonymous No.106365623 >>106365636 >>106365637 >>106365648
>>106365617
You're supposed to use another LLM to summarize it
Anonymous No.106365624 >>106365637
>>106365617
*why are llms
Anonymous No.106365634 >>106365662
>>106365617
Just write Json instructions for repeated tasks,
that's what i do
>Grammar checks
>One word answers
>Bullet points
>Rp
>story writing
>criticism

etc
Anonymous No.106365635 >>106365774 >>106366445 >>106368339 >>106368366
mistralbros we are so back
Anonymous No.106365636
>>106365623
>>106365613
Get fucked, cretin.
Anonymous No.106365637
>>106365624
>>106365623
is that how all the online llm do it? Thats crazy.
Anonymous No.106365648 >>106365659
>>106365623
And then have a third LLM grade its performance
Anonymous No.106365659
>>106365648
fourth llm generates an image for your query
Anonymous No.106365662 >>106365668
>>106365634
so something like
you are a meticulous grammar assistant who extracts grammar mistakes from provided text in structured JSON format use the following json template: { [ { "mistake_id": 1, "mistake_text": "betwen", "corrected": "between" } ] } Here is the provided text:

mostly checking if it can built something like languagetool
Anonymous No.106365668 >>106365718
>>106365662
schizo prompt
llms aren't trained with or
Anonymous No.106365670 >>106366331
>>106365592
this works for me
https://github.com/clowerweb/kitten-tts-web-demo
Anonymous No.106365682 >>106365699
Mistral small 3.2 is unironically better than Gemma for RP
Not just for sex, Gemma is fucking terrible at understanding subtext and struggles at character development, if you define X trait in a character card as a starting point, that character will ALWAYS be defined by X, no matter the context. Also if you make a character angry it will NEVER fucking calm down, it will act like a hysterical 45 year old HR worker who just saw a 30 year old man kiss an 18 year old woman, until the heat death of the universe.
tl;dr total frog victory
Anonymous No.106365699 >>106365707 >>106365731 >>106368179
>>106365682
mistral small 3.2 gets mogged by Rocinante its time to upgrade anon
Anonymous No.106365707
>>106365699
I used to use nemo/rocinante, small is much better, even 3/3.1 were better.
Anonymous No.106365709 >>106365732 >>106366442 >>106366457
Total noob here. For Grok 2 it says

>Use the command below to launch an inference server. This checkpoint is TP=8, so you will need 8 GPUs (each with > 40GB of memory).

To run this thing locally I need, say, an NVIDIA A100 ($15,000 each)? $120,000 for the GPUs alone?
Anonymous No.106365718 >>106365768
>>106365668
just copying stuff from google no idea how this works. So far it works fine but fucks up inserting random bullshit.
Anonymous No.106365731 >>106365739 >>106365741
>>106365699
>user: hello
>char: walks over and grabs your cock, "fuck me, anon. Make me yours"
wow, great erp!! So immersive.

I can name 4 nemo finetroons that beat out rocinante by a mile.
Anonymous No.106365732 >>106365773
>>106365709
did you expect elon musk was using consumer cards or something?
Anonymous No.106365739
>>106365731
go ahead anon humor me with that massive cock of yours
Anonymous No.106365741 >>106365762 >>106366365
>>106365731
Not him, but what are some good non-drummer nemo tunes?
Anonymous No.106365762 >>106365767 >>106365783
>>106365741
Bigger-Personality
Irix-12B-Model_Stock
MN-12B-Mag-Mell-R1
patricide
Golden-Curry
Anonymous No.106365767
>>106365762
>curry mentioned
I LOVE INDIA
Anonymous No.106365768
>>106365718
Well then read a few pages on how these instructions work.
Anonymous No.106365773
>>106365732
No. I just wanted to know whether I, a total noob, understand this correctly. Reading the previous thread it seems so. How sad.
Anonymous No.106365774
>>106365635
Can't wait for them to not release it
Anonymous No.106365783 >>106365794
>>106365762
>frankenmerges
>coomtunes
>model cards recommending temp 1.25 for nemo
Anonymous No.106365784 >>106365787
mistral 2508 bros???????
Anonymous No.106365787
>>106365784
There's a week left in august. But I'd wait at least one extra week afterwards, to be safe.
Anonymous No.106365794 >>106365916
>>106365783
...and yet better than rocinante
Anonymous No.106365797 >>106366725
Quixk initial impression: I like the way GLM-4.5 (355B-A32B) writes more than ERNIE-4.5-300B-A47B in caee you were about to download one of these and were deciding which to try first.
Anonymous No.106365812 >>106365815 >>106366182
>have long script ai helped me make
>ask it to help me clean it up, reorganize it
>it nails it
>ask it a question about copying an existing feature to another part
>it explodes
mistral small 24b. it just doesn't want to it no matter how i ask. dling devstral to try, but thats based on the old mistral 3.1 small. i'm amused it could nail something first pass, then just shit the bed like this
Anonymous No.106365815 >>106365830 >>106365931
>>106365812
Mistral models don't really excel in coding, especially the small ones. You should be using qwen 32b-coder in that size range.
Anonymous No.106365830 >>106365837
>>106365815
yeah i used qwen 2.5 code in the past and its pretty good. i need to see what the equiv of 3 is for similar size. i just like mistral small 24b (with thinking) as a smart small model that fits in my vram. its always been pretty good. but now i just watched it impress me, then shit itself. its even having trouble with some basic html stuff
Anonymous No.106365837 >>106365847 >>106365862 >>106365868
>>106365830
>i need to see what the equiv of 3 is for similar size.
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct
Anonymous No.106365847
>>106365837
anon im in love thank you
Anonymous No.106365862
>>106365837
thanks, dling now. how would you compare this to qwen 2.5 coder or older codestral 22b? i was using the 32b qwen
Anonymous No.106365868 >>106365884
>>106365837
In the one test I ever tried Devstral-Small-2507 and 2505 shat on Qwen3-Coder-30B-A3B-Instruct.
Anonymous No.106365884
>>106365868
and?? tell us more
Anonymous No.106365916
>>106365794
Bait used to be believable
Anonymous No.106365931 >>106365947 >>106366387
>>106365815
>You should be using qwen 32b-coder

I see Qwen coder being recommended again and again ad nauseam

is it really that good? Don't tell me "for its size and speed"
Anonymous No.106365947 >>106365973 >>106365979
>>106365931
>Don't tell me "for its size and speed"
Are you a retard or something? If you want to run something locally and you don't have a cpumaxx rig with well over 128GB of memory then yes, it's about the best you can get. If your dad works at nvidia and can buy you a few H100s for your birthday then go run deepsneed locally, or pay for access to one of the big flagship models hosted by corpos.
Anonymous No.106365973 >>106365978
>>106365947
>you don't have a cpumaxx rig

I do have a cpumax rig
So, what's the point?
Anonymous No.106365978
>>106365973
Then you should be capable of doing your own research and testing
Anonymous No.106365979
>>106365947
>or pay for access to one of the big flagship models hosted by corpos

You can use ds for free online
Anonymous No.106366017 >>106366028 >>106366043 >>106366069 >>106366074 >>106366176 >>106366938
The LLM industry is built on a house of cards and it's just a matter of time before it crumbles. Most "best practices" or "recommendations" are a lie. Scaling never actually worked as claimed and if anything made the problem worse. It's going to crash so hard I'm actually worried for the economy.
Anonymous No.106366028
>>106366017
LLMs help me coom and I can do that with models that already exist on my hard drive so I dont care
Anonymous No.106366043
>>106366017
>I'm actually worried

A desperate attempt of attention seeking

You won
Anonymous No.106366046 >>106366126
>>106364900
So GGUF is the sound she makes when the drink goes down the wrong pipe?
Anonymous No.106366052
>>106365043
Lagging behind what?
Anonymous No.106366060
We already reached the endgame(Nemo). What are you still doing here?
Anonymous No.106366069
>>106366017
Consider this: the amount of information that LLMs can store in their weights is limited by their parameter size, around 2~4 bits of information per parameter (in the information-theoretical sense), depending on who you ask. There's no way they're actually learning efficiently from tens of trillions of tokens of data with current large-scale pretraining methods.
Anonymous No.106366074
>>106366017
>It's going to crash so hard I'm actually worried for the economy.
There a so many reasons you should be worried about the economy that LLM's shouldn't even make the top 50 anon, get some perspective.
Anonymous No.106366126 >>106366168
>>106366046
that drink isn't for her, it's for you.
Anonymous No.106366168 >>106366171
>>106366126
I'm not thirsty.
Anonymous No.106366171
>>106366168
you don't have a choice
Anonymous No.106366176
>>106366017
Just short big tech if you believe so. There's money to be made if you think something will crash.
Anonymous No.106366182
>>106365812
AI labs tunnel vision hard on zero-shot problems instead of multi-turn. That's why. Just start over and make a new prompt for your new problem.
Anonymous No.106366205 >>106366372
Are there any slop scoring models, for data filtering and such? Sure, counting words and n-grams would get you far, but what about deeper stuff? I bet even a classifier model that tells apart male and female prose would be useful.
Anonymous No.106366305
Is there a way to roughly calculate what speed you can expect on a cpu + gpu build for MoE models running -ot exps=cpu?
In particular, what would happen if you upgraded the GPU part of your server with something that's a lot faster like a 5090 or switched out the CPU+RAM part for one of those server boards with lots of channels.
By splitting your model across both RAM and GPU it obviously stops scaling linearly with the bandwidth of either which is confusing. Is there a bottleneck hidden here somewhere? Like, that even if you had a 12-channel ddr5 server that it wouldn't make much of a difference in token generation speed between a 3090 and a 5090 despite the latter having twice the memory bandwidth because of something retarded like the PCI-E bandwidth getting in the way.
Anonymous No.106366331 >>106366357
>>106365670
Instead of kitten, I implemented Piper solution for my client. Took 15 minutes to set it up in the python script.
>https://github.com/OHF-Voice/piper1-gpl
It's fun but the voices are so monotonous that it will get pretty dull fast.
Anonymous No.106366345 >>106366528 >>106367123
.
Anonymous No.106366357
>>106366331
To add: there is something more relaxing about being able to hear the voice instead of just staring at the wall of text. Maybe I'm somewhat dyslexic or it's an ergonomical issue...
Anonymous No.106366365 >>106366402
>>106365741
>good
>tunes
lol, lmao even
lay off the troonshine and learn to prompt
Anonymous No.106366372 >>106366414
>>106366205
It won't get you anywhere. We've come far since early kaioken's slop issues, and I barely encounter those early phrases outside of their appropriate context now. But the current iteration of slop is no less annoying. It's the inherent repetitiveness of the patterns that's making the vein pop, not their fotm class.
Anonymous No.106366387
>>106365931
>Don't tell me "for its size and speed"
It's really "for its size and speed" otherwise you would just go paypig for Claude nigger
with that said, all Qwen code models are best in class in their size range. That includes the 480B-A35B if you can run that. Qwen makes models for almost all sizes except the gigawhale range (where DeepSeek and Kimi K2 are only game in town anyway)
Anonymous No.106366402
>>106366365
tongue my anus
Anonymous No.106366414
>>106366372
Maybe what I mean isn't slop but the opposite of show-not-tell, both have a lot in common. So I thought maybe approaches like style clustering could help filter it out, but i don't think there are any embedding models for this.
Anonymous No.106366442
>>106365709
This is all bloat they rushed out to catch up with the big players. Requirements for Grok 3 may be less demanding.
Anonymous No.106366445
>>106365635
Gemini 2.5 is still #1 jai hind
Anonymous No.106366457
>>106365709
If anyone cared to implement this trash fire in llama cpp, the requirements would be lowered with quantization, but yes, these requirements are typical of any unquanted API model and all of us are in fact coping hard running local models quanted because even a small model wouldn't fit our GPUs without quant
desu I would be more interested in an open source release of the mini versions of Grok
Anonymous No.106366465 >>106366471
If one more person criticizes my ability to prompt I'm going to cry
Anonymous No.106366471 >>106366524
>>106366465
promptlet
Anonymous No.106366493 >>106366531 >>106366658
>>106365592
Will the suckers ever learn that conda is an essential tool...


install conda on your computer then run these commands one by one

conda create -n "py310" python=3.10
conda activate py310
pip install https://github.com/KittenML/KittenTTS/releases/download/0.1/kittentts-0.1.0-py3-none-any.whl --no-cache-dir


--no-cache-dir is not needed, but it will force pip to download all stuff instead of using cached data.
When the installation finishes, start python and paste this when you see the python prompt '>>':

from kittentts import KittenTTS
m = KittenTTS("KittenML/kitten-tts-nano-0.2")

audio = m.generate("This high quality TTS model works without a GPU", voice='expr-voice-2-f' )

# available_voices : [ 'expr-voice-2-m', 'expr-voice-2-f', 'expr-voice-3-m', 'expr-voice-3-f', 'expr-voice-4-m', 'expr-voice-4-f', 'expr-voice-5-m', 'expr-voice-5-f' ]

# Save the audio
import soundfile as sf
sf.write('output.wav', audio, 24000)



Look for the file output.wav
Anonymous No.106366501 >>106369364
>>106364866
don't we have /tv/ here or something
>inb4 4chins suck more ass than redditors
prompt issue
Anonymous No.106366515 >>106368591
>>106365569
Give me your llama.cpp args, because somehow I can't even fit a single layer of GLM-Air-chan into my 8G of VRAM, it always complain about being out of memory even if I set -ngl 0 and -c 512
Anonymous No.106366518 >>106367088 >>106367644
Leak please, I want a bingo
Anonymous No.106366524
>>106366471
Anonymous No.106366528
>>106366345
It's so over
Anonymous No.106366531 >>106366637
>>106366493
nta, my conda experience was waiting for 10 hours for it's retarded package manager to fail. It doesn't even tell you why it's hanging, fucking disrespectful software. Also venv creates the environment in whatever directory I want, i don't understand why anyone would use anaconda.
Anonymous No.106366559 >>106366660
I use Anubis-70B-v1-IQ4_XS.gguf [llama.cpp] for ERP. Sell me on a better model that works well on two 3090s.
Anonymous No.106366567 >>106366579 >>106366590 >>106366681 >>106370898
===FRIENDLY REMINDER===
DOWNGRADE YOUR CUDA IF YOU ARE USING RTX 3000 SERIES CARDS
NVIDIA IS RUINING PERFORMANCE ON THOSE WITH THE NEW VERSIONS OF CUDA
I accidentally updated torch yesterday from CUDA 12.6 to CUDA 12.8 and it ran like shit and even OOMed. After going back to CUDA 12.6 the issues were gone.
Anonymous No.106366579
>>106366567
Never upgrading unless absolutely necessary chads, where we at? Still at 12.5 here.
Anonymous No.106366585
forever cuda 12.4
Anonymous No.106366590
>>106366567
>old software
>new version
>better, faster
>modern software
>more workarounds, slower, more bugs
Why is this?
Anonymous No.106366637
>>106366531
my experience was so far rather positive. Python is a bloatware, so what

One way or another you will have to isolate your installation in a separate venv. I had cases where ome module installs torch2.8.0 for the other module to downgrade it to 2.7 something, and the whole thing did not fly because of this.

Python's native venv would do the trick as well
Anonymous No.106366642 >>106366720 >>106366897
https://github.com/LostRuins/koboldcpp/releases/tag/v1.98
Anonymous No.106366658 >>106366659 >>106366691
>>106366493
Thanks, I'll take a look at it.
As fun as python can be for retards I hate the bloat. Install one thing and you'll need 20 extra modules.
Anonymous No.106366659
>>106366658

Godspeed, anon
Anonymous No.106366660
>>106366559
Alpaca.
Anonymous No.106366681
>>106366567
windows doesn't have this problem
Anonymous No.106366691 >>106366716 >>106366741 >>106366775
>>106366658
How is this bloat? Those modules are explicitly required by the module you need. Sometimes, rarely, modules will require more than they actually need (like requiring stuff for training when you just need to infer), but that's not a python's "bloat" problem, just the authors of those modules being careless.
Anonymous No.106366716 >>106366751
>>106366691
>How is this bloat?

if you create a venv, all stuff needed will be copied in there. You'll end up with multiple copies of torch etc.
Anonymous No.106366720 >>106366763
>>106366642
Alright how do I into thinking budget for seed 36b?
Anonymous No.106366725
>>106365797
I agree with this. ERNIE-4.5 isn't that impressive as a text model.
Anonymous No.106366732 >>106366853 >>106366855 >>106366906
https://github.com/ggml-org/llama.cpp/pull/15524
AMD prompt processing speed more than doubling on vulkan for MoE models
Anonymous No.106366741 >>106366751
>>106366691
The only design decision of python that I think is really bad is allowing a >= version specification in the requirements. It's the only way you can get an error under normal circumstances
Anonymous No.106366751 >>106366786 >>106366795
>>106366716
OK I agree that this can be seen as bloat, but if you make symlinks to already installed stuff instead, edits in one venv can affect another one. Choosing between two separation and disk space, I'd choose disk space.

>>106366741
>different OSes
>different python versions
>also implying author won't just reupload a different thing under the same version
Anonymous No.106366763
>>106366720

It's all in the system prompt anon. Just add it into you system prompt in this format

"You are an intelligent assistant with reflective ability. In the process of thinking and reasoning, you need to strictly follow the thinking budget, which is {{thinking_budget}}. That is, you need to complete your thinking within {{thinking_budget}} tokens and start answering the user's questions."
Anonymous No.106366767 >>106366938
Thoughts on JEPA?
Anonymous No.106366775
>>106366691
I meant in general, dumbass.
Anonymous No.106366786
>>106366751
>different OSes
>different python versions
could be useful
>also implying author won't just reupload a different thing under the same version
what kind of idiot would do this
Anonymous No.106366795
>>106366751
>OK I agree that this can be seen as bloat, but if you make symlinks to already installed stuff instead, edits in one venv can affect another one. Choosing between two separation and disk space, I'd choose disk space.

This unironically
Anonymous No.106366836
>i'm a nigbophile
Anonymous No.106366853
>>106366732
so many delicious pulls blue balling me
Anonymous No.106366855
>>106366732
And that's a good thing, since rocm is fucking dead in the water
Anonymous No.106366897 >>106366901
>>106366642
The maintainer wants to fuck shortstack dragon lizards.
Anonymous No.106366901
>>106366897
If you don't then you don't belong here
Anonymous No.106366906 >>106366927 >>106366930
>>106366732
man the radeon vii is so funny, they made a consumer gaming card with insane memory bandwidth, twice as much vram as anyone needed, but disappointing gaming performance. there was literally no good use case for the thing on release, but if they did an experiment like that today they'd have made the perfect local llm card for people who don't want to build a home datacenter
Anonymous No.106366927 >>106366941 >>106366978
>>106366906
Why does AMD manage to suck so much at GPUs(besides obvious cousin at nvidia) compared to CPUs? They should just sell off GPU division to someone competent.
Anonymous No.106366930
>>106366906
AMD has since become 100% controlled opposition, so that'll never happen. You think it's a coincidence that AMD drops out of high-end at the same time that nvidia releases their worst generation of cards, ever?
Anonymous No.106366938 >>106367082
>>106366017
>pic related
>>106366767
The currently released JEPA models like V-JEPA 2 are primarily for robotics and still need improvements before they can actually be used for serious tasks. The LANG-JEPA model that the regular Joe would care about and use is not ready and there aren't any regular updates on the status of that model.
Anonymous No.106366941
>>106366927
The US government should buy it
Anonymous No.106366969
Those retards are basically ollama of quanters. I have no idea what they are doing with this shit but you can quant all attention layers to at least 5bpw. shared experts to 4.5bpw and routed experts to 2.1-2.3bpw and get the same filesize.
Anonymous No.106366978
>>106366927
>They should just sell off GPU division to someone competent.
It's not native to them. They just bought ATI.
Anonymous No.106367082 >>106367086
>>106366938
lecunny btfo by an auto-regressive llm
Anonymous No.106367086
>>106367082
btw this is a fresh chat
there is no previous context
Anonymous No.106367088 >>106367112
>>106366518
>we are 2/3 through the year
>nobody, not even proprietary, has figured out how to do long(1M+) context properly
it's over
Anonymous No.106367112 >>106367125 >>106367141 >>106367143 >>106367157 >>106367178 >>106367190
>>106367088
Consider the following: There are no 1M token long documents to train on.
Anonymous No.106367123
>>106366345
>2023->2024
Llama1->Miqu
>2024->2025
Miqu->R1

This gives me hope for the future.
Anonymous No.106367125 >>106367136 >>106367392
>>106367112
lotr trilogy is ~500k words so around ~600k tokens
Anonymous No.106367136
>>106367125
nta

count all characters and divide by 4 for English texts.

Will be different for non-English text

The Bible is 3.11m letters. idk if spaces are included
Anonymous No.106367141 >>106367147 >>106367186
>>106367112
WRONG!
Anonymous No.106367143
>>106367112
Consider the following: Just concat multiple smaller documents together.
Most uses for long context are either having it work on many source files from a code base, analyze lots of web search results, or writing novels.
Anonymous No.106367147 >>106367161
>>106367141
You see those multiple colors? Those are series with multiple books, not a single document.
Anonymous No.106367157 >>106367399
>>106367112
Codebases, anon, codebases! Take for example llama.cpp.
Anonymous No.106367161
>>106367147
The context of previous book matters for the next book.
Anonymous No.106367178
>>106367112
just give it the linux source code
Anonymous No.106367183
>>106364900
Ha! I can tell that's qwen because the dumb thing takes "can of beer" to be can + beer, ending up somehow with a transparent can showing the beer inside.
Anonymous No.106367186
>>106367141

None of them is a coherent flow of conscience

Just some nonsensical collection of pseudoscience
Anonymous No.106367190
>>106367112
What about all the im logs that are totally private?
Anonymous No.106367193 >>106367213 >>106367228
It's up!
https://huggingface.co/collections/sugoitoolkit/sugoillm-68aa6049fbd744558d952925
Anonymous No.106367213
>>106367193
Buy an ad nigger
Anonymous No.106367228
>>106367193
>2 months old
>gguf only
>zero information
>Qwen/Qwen2.5-32B
You can fuck right off
Anonymous No.106367304 >>106367313 >>106367326
K2 reasoner... never...
Anonymous No.106367313
>>106367304
Obsoleted by Sugoi anyway
Anonymous No.106367326 >>106367340 >>106367375 >>106368320
>>106367304
Why would you need a reasoner?
Anonymous No.106367340
>>106367326
Reasoners do creative writing better, RP better and code better.
Anonymous No.106367360 >>106367367
I find it funny when reasoning models realizes that they're wrong when writing the final output
Anonymous No.106367367
>>106367360
me at exams
Anonymous No.106367375
>>106367326
Reasoners benchmaxx better
Anonymous No.106367381 >>106367453
Under 70b bros.

What are you using to goon? I've been messing around with Gemma 3 (27b). I have no idea how but a few months back I remember it being awful. But now (maybe my card is better, prompts are better, who knows) it seems ridiculously good. Like, actually impressive levels for ERP (good at remembering my cards details, knowing what I want the card to achieve etc).

Makes me wanna reconsider Qwen models now (also used to be garbage). I tried out that Qwen 30b MoE instruct one but it was just too inconsistent in quality (speed was 10/10 though)
Anonymous No.106367392 >>106367402
>>106367125
That's three documents.
Anonymous No.106367399 >>106367441
>>106367157
Consist of multiple documents.
Anonymous No.106367400 >>106367421 >>106367463 >>106369024
fuck reasoning
and fuck all the retards who think there's actual reasoning under the hood
absolutely retarded models
Anonymous No.106367402
>>106367392
Document != context
Say I want to research something and the LLM decided to use search, it could have returned 50 search results (50 documents) but they all have to be in the context
Anonymous No.106367419
>>106364639 (OP)
>the scent of ozone is suddenly chocked out by the sharp tang of ozone
models are devolving.... it's over...
Anonymous No.106367421 >>106367448
>>106367400
I'd probably fail that too. Why does that dog have an extra leg?
Anonymous No.106367441
>>106367399
Documents related to each other. Without one the other don't function, therefore they should be combined in one context.
Anonymous No.106367448 >>106367481 >>106367626
>>106367421
https://www.bbc.com/news/uk-wales-68017390
and no, you wouldn't "probably fail that", if you saw that dog you would notice something is fucking wrong with it
the point of this sort of image serving as test though is that because that sorta shit isn't benchmaxxed yet so you can see the true nature of LLMs as retarded pattern matchers in action and no amount of minutes spent in muh reasoning tokens can fix it
LLMs are absolutely incapable of reasoning, it's all a simulation that works when you test them on something that was benchmaxxed (like all the math problems that are more than 50% of the datasets used to train modern LLMs), when you're trying to make them reason about something they weren't benchmaxxed on.. you see the reasoning for what it truly is
BULLSHIT
Anonymous No.106367453
>>106367381
unironically rocinante
Anonymous No.106367463 >>106367479
>>106367400
Fuck vision in general, can't OCR, can't count, doesn't know any people/characters, and is cucked more than text
Anonymous No.106367464 >>106367550
loli footjobs....
Anonymous No.106367479 >>106367507
>>106367463
Oldschool vision is still good
https://github.com/roboflow/supervision
Anonymous No.106367481 >>106367494
>>106367448
If you give humans the same question, a decent fraction of them will give similar answers. "He told me to think hard about it, so it can't be the obvious answer."
Anonymous No.106367494 >>106367534
>>106367481
Didn't occur to you that "think hard" part was only added precisely because the model got it wrong in the previous round?
Anonymous No.106367507
>>106367479
Isn't that an ultralytics wrapper which is already a yolo wrapper?
Anonymous No.106367510
le sigh
they can't do it, no matter how you phrase it
Anonymous No.106367528
btw GPT-5 counts it right but thinks it's an illusion
I 100% believe the reasoning is at fault, not the vision part. LLMs can't reason about things they weren't benchmaxxed to reason about.
Anonymous No.106367534 >>106367549
>>106367494
My default assumption with any partial images is that the missing part runs counter to whatever the poster wants to say.
Not a perfect heuristic, as this case shows, but it tends to be accurate.
Anonymous No.106367549
>>106367534
>the missing part runs counter to whatever the poster wants to say.
I linked to the article that has the original picture. Any nigger can go and test it for themselves if they think I'm bullshitting the prompts. The unbelievable thing is how much faith people put in these bullshit generators. In a just world scam altman would get his comeuppance for the mountain of lies he spouted to scam investments.
Anonymous No.106367550
>>106367464
foot lolipops
Anonymous No.106367557 >>106367595
mr cunn will save us right after he defeats wang in a battle to the death for zucks gpus
Anonymous No.106367573
>>106364785
There is no use. Dont even know why they are released. Why are you here? You like posting proprietary/ IP/ personal data to someone else that you do not know? You like someone else fidgeting with your settings and not even revealing what they did? Your model + your info = your control. Not your server, not your data.
Anonymous No.106367586
>>106364945
Comput3 has B200s now.
Anonymous No.106367595 >>106367609 >>106367636
>>106367557
>mr cunn
>wang
>zuck
Anonymous No.106367605
>>106365617
Tell it be brief, simple as.
>If you dont have 10 custom GPT/ json files, ngmi lads.
Anonymous No.106367609 >>106367614
>>106367595
Stop doxxing meta GenAI employees
Anonymous No.106367614
>>106367609
You can't dox public figures
Anonymous No.106367626
>>106367448
Yeah, if you saw it in real life. Not just a random picture.
LLMs won't have the kind of human level reasoning needed to truly hit AGI until they start getting slapped into robots with pseudo-senses to experience the real world.
Anonymous No.106367636
>>106367595
Dam Son...
Anonymous No.106367644 >>106367738
>>106366518
Imagine if we ever got a Sonnet 3.5 leak. Man. Of course that would never happen.
Anonymous No.106367738
>>106367644
Sonnet was shit until 3.7 though
Anonymous No.106367777
claude is overrated
I'd rather a Gemini leak
Anonymous No.106367802
dense :<
moe :3
Anonymous No.106367860 >>106367865 >>106367886 >>106367931
Why is Gemma 27b so much slower on my 4090 that other models of similar sizes (Mistral Small, Qwen even the old grandpappy, Command R).

I really don't get it
Anonymous No.106367865 >>106367884 >>106368340
>>106367860
someone explained in the last thread, part of the model has to be run on cpu because there's no cuda kernel for it
Anonymous No.106367884 >>106368031
>>106367865
oh shit really? Has nobody found a fix? The model is super good besides that
Anonymous No.106367886 >>106367894 >>106368318
>>106367860
In addition to what the other poster said about CUDA support, Gemma also uses a SHITLOAD of memory for context, make sure that it isn't overflowing into your RAM.
Anonymous No.106367894 >>106367918
>>106367886
MAX ram usage peaked out at 11GB (I have 32)
Anonymous No.106367918 >>106367941
>>106367894
Anon the model alone is about 11GB at Q3, is it filling up your entire 24GB VRAM and has another 11GB in system? If so, that's why it's slow. Any dense model spilling into system ram will have a huge speed penalty.
Anonymous No.106367931 >>106367955
>>106367860
Gemma is bit slow because it's thicc...
Anonymous No.106367941
>>106367918
VRAM useage didn't cross over to 24GB when I fully offloaded it (at Q4 as well) with 14k context. I'm using GLM 32B right now (same size, Q4 also at 16k context now) and i'm getting around 35 t/s
Anonymous No.106367955 >>106368013
>>106367931
She does enjoy her curries
Anonymous No.106368013 >>106368029
>>106367955
I LOVE INDIA!!!!
Anonymous No.106368029
>>106368013
A country so great its biggest export is its own people
Anonymous No.106368031
>>106367884
vibe code a fix anon
Anonymous No.106368116 >>106368141 >>106368262
I found out that Apple did a Mixtral 8x7B finetune earlier this year for a research project

>Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level.
https://arxiv.org/abs/2503.03040
apple/sage-ft-mixtral-8x7b

I only tried the Q2 which is naturally dumb and crazy, but the writing style has a flair unlike anything else. It's worth further testing.
Anonymous No.106368141 >>106368177
>>106368116
>but the writing style has a flair unlike anything else
Just means it has a different slop profile. Mixtrals are incredibly dumb by modern standards, may as well use Nemo.
Anonymous No.106368172 >>106368180 >>106368185 >>106368191 >>106368225
How can I make prompt processing faster on a CPU only server without upgrading hardware?
Anonymous No.106368177
>>106368141
Maybe our local finetuners can make use of Apple's research to improve their tune.
Anonymous No.106368179
>>106365699
glm air at a low quant mogs both
I can't go back to sub 70b models now
Anonymous No.106368180
>>106368172
call api on cloud
Anonymous No.106368185 >>106368221
>>106368172
>prompt processing
>on a CPU
oof
go dig up a 1060 out of the trash or something, it's still going to be faster
Anonymous No.106368191
>>106368172
ngram speculative decoding might help
Anonymous No.106368200
How big does a GPU for prompt processing need to be?
Does it need to be able to hold the model?
Anonymous No.106368221 >>106368233
>>106368185
not them but i do have a 1070ti how much am i gonna get from that
Anonymous No.106368225
>>106368172
Prefix caching on vllm
Anonymous No.106368233
>>106368221
about tree fiddy
Anonymous No.106368262 >>106368519
>>106368116
Is it safe?
Anonymous No.106368275 >>106368305
Mistral Small finetune tier list

Worth using
>TDE
>Magnum Diamond

Decent
>Any of Gryphs ones (Pantheon dude)
>Cydonia (I find them the worst out of the commonly used ones but people must like em for a reason so i'll mention. They're just unremarkable to me and seem to handle the things that Drummer proudly highlights about his models worse than the others)

Avoid
>Any Drummer shit
>That Broken Tutu garbage
Anonymous No.106368301 >>106368313 >>106368322 >>106368331 >>106368344
Mistral Small finetune tier list

Worth using
>

Decent
>

Avoid
>Mistral Small finetunes
Anonymous No.106368305
>>106368275
>Any Drummer shit that's not his basic Cydonia model (I tried the R1 Cydonia and it sucked)
Anonymous No.106368313 >>106368341
>>106368301
Name a better small local model for ERP

>inb4 reasoning models
>inb4 Qwen or QWQ

Only one that outright beats it in ERP is Gemma and those models have far more issues that outweigh the positives.
Anonymous No.106368318 >>106368340 >>106368350
>>106367886
>Gemma also uses a SHITLOAD of memory for context
1/ with iSWA turned on, it uses less than any other model (maybe gpt-oss uses less? it's the only other model I've seen with iSWA and it did use very little for context)
llama.cpp should have that by default (--swa-full is a turn-the-feature-off flag)
2/ of the Gemma 3, ironically, 27B uses less VRAM for its context than 12B (attention head size differences) so having a GPU that can fit 27b feels like getting the better deal over running the smaller gemma
Anonymous No.106368320 >>106368465
>>106367326
reasoning can be good for rp consistency and more realistic responses (outside of erp though)
Anonymous No.106368322 >>106368341
>>106368301
Fuck off
Anonymous No.106368331
>>106368301
trvke
Anonymous No.106368339
>>106365635
>proprietary
mistral y u do this
Anonymous No.106368340
>>106368318
this makes more sense to me >>106367865

It's just painfully slow on kobold at least
Anonymous No.106368341 >>106368381
>>106368313
I'm not saying small is bad, I'm saying that the finetunes are garbage. At the best of times they're a placebo and at other times they just cut model's intelligence in half. It takes very little effort to make it write a sex scene, or any depraved shit you can think of. It doesn't need a finetune.
>>106368322
*smooch*
Anonymous No.106368344
>>106368301
Mistral models worth using:
Anonymous No.106368350 >>106368372
>>106368318
Fucking with SWA means no contextshift, so fuck that.
Anonymous No.106368366 >>106368508
>>106365635
redpill me on medium 2508.

How bigs it gonna be (around the 30b range or 70 seeing as large is like 120B)

I swear Mistral is the only thing keeping achievable LLMs alive at this moment. Now everyone jerks off over 3 t/s q2 MoE garbage that they still need a server to run
Anonymous No.106368372 >>106368395
>>106368350
>means no contextshift
why would you want this cancer? I turn it off
if you run out of context summarize and start anew, or delete some messages in the history
Anonymous No.106368381 >>106368407
>>106368341
See I don't get when people say this. I've tried them all, non finetunes also and there's a clear difference to me and many others which is why people even waste time on them.

Such a weird thing to be contrarian over when we can all download them for free and see for ourselves and literally compare outputs between the models.
Anonymous No.106368385 >>106368407
what does it mean when you swipe multiple times and get virtually the exact same fucking response? What samplers do I change to fix this?
Anonymous No.106368395
>>106368372
>why would you want this cancer?
So I don't have to summarize all the fucking time and have small details constantly being lost.
Anonymous No.106368407 >>106368453
>>106368381
Yeah, I've done just that and there's basically none that resulted in a better experience.
>>106368385
Assuming your other sliders are mostly neutral, Temperature
But as you increase it the model becomes gradually more retarded
Anonymous No.106368453 >>106368462 >>106368471 >>106368491
>>106368407
>increasing temperature makes the model more retarded
So these are the geniuses behind the brigade against finetunes.
Anonymous No.106368462 >>106368565
>>106368453
>no argument
as usual
Anonymous No.106368465
>>106368320
It's not. Just compare V3 to R1. R1 loses the plot fast when it's time for some spatial reasoning.
Anonymous No.106368471 >>106368485
>>106368453
>finetrooner doesn't know how the most basic sampler works
Anonymous No.106368485 >>106368498 >>106368505
>>106368471
It adds creativity.
Anonymous No.106368491
>>106368453
>increasing temperature doesn't make the model more retarded
so these are the geniuses defending sloptunes
Anonymous No.106368498 >>106369231
>>106368485
it adds retardation
there's a reason why all recommendations from all model trainers is to use a lower temperature for coding
Anonymous No.106368501 >>106368517
God people have become so fucking awful.
People deserve the inevitable pajeetification of the world.
Anonymous No.106368505
>>106368485
lol
Anonymous No.106368508 >>106368693
>>106368366
Medium is almost certainly another MoE in the 100-200B range
They will also never release it, unless another Miqu happens
Anonymous No.106368515 >>106368559
I did it yay
Anonymous No.106368517
>>106368501
>doesn't quote the posts he thinks is stupid because he's afraid of being called out
coward
Anonymous No.106368519
>>106368262
From its model card
>Not suitable for high-stakes, safety-critical deployment without further evaluation

Also the paper does not say they did anything for safety
Anonymous No.106368538 >>106368547 >>106368548 >>106368571
I just realized fuckhuge MoE's have another huge advantage - they reduce the popularity of finetroons and the population of finetrooners.
Anonymous No.106368547
>>106368538
That's why they keep tuning the same old mistral and qwen models over and over again.
Anonymous No.106368548
>>106368538
extremely based indeed
Anonymous No.106368559 >>106368643
>>106368515
Good job
Anonymous No.106368565 >>106368593 >>106368598
>>106368462
You never presented one.

But go on.

Explain what temperature even does and then directly point the link to "retardation".

Your word, not mine. I'll wait and I will keep reminding you of your post when the goalpost shift inevitably comes. Here's your initial comment:

>Assuming your other sliders are mostly neutral, Temperature
>But as you increase it the model becomes gradually more retarded

You guys have no fucking idea how models work yet you speak with so much authority on them, it's laughable.

>*stubs out his high temperature cigar ashes on your 5ft2 frame*
Anonymous No.106368571 >>106368583 >>106368678
>>106368538
That doesn't make sense. The only people that use finetunes are people that can't run huge models and hope the tune makes the small model better.
Anonymous No.106368583
>>106368571
>listening to people who edge for a week while their 0.25 t/s Q2 MoE generates moon runes for them
lmao
Anonymous No.106368591 >>106368606
>>106366515
Are you using base llama.cpp?
Last time I checked, you need ik_llama for GLM.
https://github.com/ikawrakow/ik_llama.cpp

Can't help you with llama args because I've always used kobold.cpp. I use a fork of kobold.cpp which includes ik_llama's optimizations and model support:
https://github.com/Nexesenex/croco.cpp

The relevant settings in the GUI:
GPU Layers: 999 (just a random number for every layer)
Use contextshift, fastforwarding, flashattention, quantize KV cache: 1 - q8_0
Override tensors: (this is the key)
.ffn_.*_exps.=CPU
Anonymous No.106368593
>>106368565
>Explain what temperature even does
it controls how random the next token choice is (choice which get filtered by other samplers first like top_k, top_p, min_p and whatever UnIQuE snake oil troonery du jour you prefer)
high temperature makes the less likely picks more likely to happen and LLMs being pattern matchers without any real intelligence the less likely pick can be something incredibly stupid
Anonymous No.106368598 >>106368711
>>106368565
Ask your temp=5 model to educate you, I don't care if you stay retarded.
Anonymous No.106368606
>>106368591
>Last time I checked, you need ik_llama for GLM.
That was never the case. In fact, GLM ran like shit on ik_ for the longest time because they had something about their attention fucked up that made GLM run horribly.
Anonymous No.106368625 >>106368635
ik_llama.cpp bros we fucking won
Anonymous No.106368631 >>106368653
ik is a sad meme that is diverging more and more from mainline llama.cpp which in turn will make it more difficult to merge new model and feature support in
this drama whore needs to go back to obscurity where it belongs
Anonymous No.106368635
>>106368625
That's actually pretty dope.
Anonymous No.106368643 >>106368658
>>106368559
I'm tired. I must remember to never argue with "AI". Bullshit technology.
Anonymous No.106368647 >>106368664 >>106368665 >>106368800
Can a 24GB VRAM + 32GB RAM (DDR5) run GLM Air.

Kinda wanna try it out
Anonymous No.106368649 >>106368659
God I can't wait until the Israel/Iran war resumes.
Anonymous No.106368653
>>106368631
maybe once llama.cpp doesn't have shit prompt processing speed on deepseek and other mla models
Anonymous No.106368658
>>106368643
scam altman and dario really convinced much of the world that these things are something they aren't and will never be
the average tech bro is so buckbroken by this shit it's unreal
LLMs are like a religion
Anonymous No.106368659
>>106368649
What does that have to do with LLMs
Anonymous No.106368664 >>106368695
>>106368647
>32gb ram
no, non-retarded quants (q4 and up) only fit on 64gb
Anonymous No.106368665 >>106368675 >>106368695 >>106368735
>>106368647
Q3KM is less than 50gb, so yeah I guess.
It'll be really fucking fast too.
Anonymous No.106368675 >>106368682
>>106368665
>It'll be really fucking fast too.
Not with half of it offloaded to dual channel ram
Anonymous No.106368677
is there a perplexity chart for GLM air, including the trannygram goofs?
Anonymous No.106368678
>>106368571
I would think that when you see the cancerummer release mistral small shittune number 49 you either give up on hobby or buy more ram and run 235B Q2 (it is objectively better than everything drummer shat out). But even in the scenario you presented drummer has a huge problem if he doesn't get any new models to slap his label on. That problem is that people start to notice his grift is worthless. Even the most retarded people will realize finetrooning does nothing after they use like 4-5 finetroons of the same model. The only reason drummer exists is he gets to slap his label on new models that get released and people confuse the base model being different with what drummer did. Btw he should die in a fire.
Anonymous No.106368682 >>106368728
>>106368675
It's a MoE with not that many activated parameters. And in his case almost half of the experts will be in vram.
It'll be fast as fuck.
Anonymous No.106368693
>>106368508
another MoE in the 70b-100b range instead of more small 20b slop or fuckhuge monster models requiring a whole server to run would be nice
too bad nobody cares about the medium tier anymore
Anonymous No.106368695 >>106368710 >>106368751
>>106368664
Would Q3 be more retarded than the models I usually use? (mistral small, nemo)?
>>106368665
Q3KM is 56 GB (my max memory is 56)
Anonymous No.106368710 >>106368721
>>106368695
Q3KS is between 48 and 49 gbs IIRC.
And the last layer of the GGUF is ignore, so that's a couple hundred MBs that don't get loaded to memory.
Anonymous No.106368711 >>106368729 >>106368736
>>106368598
I did it for the keks (topk=20 to keep it somewhat on the rails): https://rentry.org/dpavrps2
enlightening
Anonymous No.106368721
>>106368710
yea that's the one I figured you meant (using that now).

It's 48GB.

Well, i'll test. Really curious to see how it runs (if it runs).
Anonymous No.106368728 >>106368748
>>106368682
"as fuck" seems to translate to approximately 10 t/s or so for moetards
Anonymous No.106368729
>>106368711
High temp + low topk is the way of kings.
Anonymous No.106368735
>>106368665
>really fucking fast
As soon as you reach a decent context size (say 16k) your fast as fuck turns into 1-3tk/s which is excrutiating.
I really do think anyone that tells people to use these big moes either has the patience of a saint or doesn't actually use them.
Anonymous No.106368736
>>106368711
>https://rentry.org/dpavrps2
sounds about as coherent as the finetrooner high temperature enjoyer we were arguing with
maybe his brain is high temperature too
Anonymous No.106368748
>>106368728
That's faster than typical reading speed.
Anonymous No.106368751 >>106368763
>>106368695
consider that you also need VRAM for the context bruh
Anonymous No.106368763
>>106368751
Yea I know, i'm moreso just curious to try it, if it doesn't work then back to my good old sloptunes if need be
Anonymous No.106368765
I used to think top_p (or actually min_p but I got over this meme) at 5% or more is a lot. But if you think about it that means it only works for 1 out of 20 tokens at most.
Anonymous No.106368770 >>106368780 >>106368802 >>106368814
My favorite way to play with OSS models is to fire up a high powered cloud instance for a couple hours and it costs like $10 total. It's cheaper than spending $100k on the machine I'm using, and it's cheaper than spending tokens on model providers.

Is this basic linux system administration too far beyond many people? Does knowing basic ssh make you skilled in 2025?
Anonymous No.106368780
>>106368770
Are you dumb? Just pay for the official API
Anonymous No.106368800 >>106368951
>>106368647
I have the same amount of memory and I can fit iq3_xs on it. Even at that quant it still beats everything else I've used. Maybe some 70Bs come close but I haven't tried enough of them yet.
Anonymous No.106368802 >>106368824
>>106368770
>a couple hours
>$10
>cheap
What that's like 10 times more expensive than straight up paying for fp8 k2. You're running it on somebody else's computer anyway.
Anonymous No.106368814 >>106368849
>>106368770
>and it's cheaper than spending tokens on model providers.
You might want to check those prices again.
Anonymous No.106368824
>>106368802
>You're running it on somebody else's computer anyway.
Exactly.
At that point, just pay for deepseek's API.
Anonymous No.106368849 >>106368857 >>106368859 >>106368889
>>106368814
What happens when OSS models are good enough for all uses so paying for APIs is worthless and the AI infra buildout made GPU compute time worth zero because there's too much excess compute? How does anybody make money in AI after that?
Anonymous No.106368857
>>106368849
As long as the bigger the better holds true, that will not happen. We have "good enough" compared to 2 years ago, but we still want better
Anonymous No.106368859
>>106368849
Burn GPUs until price return to profitable again
It's what they do with diary products
Anonymous No.106368863 >>106368891
I tried running Qwen3-30B-A3B-Thinking-2507_Q8 on ollama and llama.cpp (with CLI for minimal overhead) to test which is faster and ollama is 3-4 times faster than llama.cpp, which is unexpected given I've heard llama.cpp is either just as fast or faster. Any ideas? I tried playing with settings a little (number of threads, number of GPU layers, context length, etc.) and I can never get llama.cpp to be as fast as ollama. I'm not sure what I have to tweak exactly. If llama.cpp is faster in theory, I'd like to switch to it, but it's clearly not the case here.
Anonymous No.106368889 >>106368942 >>106369543 >>106369560
>>106368849
Your entire premise is retarded.
Only a small number of hobbyists and researchers use OSS models to begin with. And almost nobody is running models like K2 locally. Average people aren't going to stop paying for Claude and Gemini APIs en mass because they can spend hours in the terminal to get a braindamaged version running slowly with an OSS model.
Even if APIs decline in usage due to AI bubble pop, excess compute is the opposite of worthless. That's AWS and Azure's entire business model.
Anonymous No.106368891 >>106369140
>>106368863
Keep using ollama until you decide you no longer want to be a retard when it comes to this hobby.
Anonymous No.106368907 >>106368948
re: high temperature discussion, it reminds me of one of the more interesting model capabilities I have ever seen where llama 405b could deliberately output a completely random token sequence like you would see when using a really high temp (including those placeholder control tokens which it had realistically never seen before, like it was actually randomly sampling) and then at will pull itself out of it and regain perfect coherence. it could do this at perfectly normal temps
I think I have logs of this somewhere but I saw someone vouch for it on xitter before too. it's possible to push other models into random token mode but I've never seen one able to slip so easily in and out of it at will.
Anonymous No.106368942 >>106369018 >>106369250
>>106368889
There's vast industries that AI cant touch yet because they can't exfiltrate data to third parties.
Lawyers and doctors with client and patient info come to mind.
On prem AI solutions are the future, there's a massive market for making local models good.
Would you trust Sam Twinkman with your medical info?
Anonymous No.106368948
>>106368907
Sounds like sentience.
Anonymous No.106368951 >>106369109
>>106368800
When you say exact memory, you mean both VRAM and RAM right? What t/s you getting and at what context? How long can your chats go before the memory becomes an issue (i.e you need to summarize and flush the chat)

Cheers
Anonymous No.106369018
>>106368942
Or the AI companies will successfully lobby for changes to the relevant laws so they can vacuum up that data, too.
Anonymous No.106369024
>>106367400
>machine
>optical illusion
????????
Anonymous No.106369109 >>106369245 >>106370067
>>106368951
Just try it out, anon. Most people in this thread ask a model "how many 'r's in Strawberry", laugh at the result and then go back to Openrouter or an API.
As someone with 24gb vram+64gb ram I don't think it's usable beyond a few k tokens.
Anonymous No.106369140
>>106368891
Everyone's gotta learn somewhere
Anonymous No.106369231 >>106369436
>>106368498
I hope that's bait. You can't possibly be that retarded
Anonymous No.106369245 >>106369293
>>106369109
> I don't think it's usable beyond a few k tokens.
What the hell does that mean lmao
Anonymous No.106369250
>>106368942
There was a video about some doctor using chatGPT to write summaries from patient files...
That was a while ago.
Anonymous No.106369275 >>106369344
>>106364866
which model do you use for that anon?
Anonymous No.106369293 >>106370067
>>106369245
Context, anon. A few thousand tokens worth of filled up context.
Anonymous No.106369302 >>106369347
is the lazy getting started guide in OP still relevant? it's from 2024
Anonymous No.106369344
>>106369275
Mistral.
It's the most knowledgeable model when it comes to Western media.
And if it's lacking anything, you can always feed it a page or 2 with all the details and information
Anonymous No.106369347 >>106369357 >>106369574
>>106369302
Yes nothing changed
Anonymous No.106369357
>>106369347
thank you
Anonymous No.106369364
>>106366501
i miss /bcsg/
Anonymous No.106369436
>>106369231
I can never be as retarded as the high temperature enjoyer
Anonymous No.106369484 >>106369533 >>106369557
When doing math or coding, is there any reason why a sampler that always picks the top option wouldn't be ideal?
Anonymous No.106369533
>>106369484
Sometimes it's wrong there, and is right on subsequent try. Just had this today.
Anonymous No.106369543
>>106368889
I suspect that OS models will make things difficult for the big Western AI companies short term, but that's more in the sense of other services being able to use those models to offer their services for a lot cheaper than the big boys. Hell, OpenAI and Google have already started doing it, and Anthropic is probably going to have to sooner rather than later
I think there is going to be a point where hardware advances to the point API services becomes antiquated and everyone can and does run everything locally, but that's true endgame for after we've hit the wall, and probably not within the next 10 years at least
Anonymous No.106369557 >>106369719
>>106369484
In theory that would be the correct approach for anything that needs 100% accuracy, asuming the training is on par, but in practice, it degrades the model's output
>https://arxiv.org/html/2407.10457v1
Anonymous No.106369560 >>106369577
>>106368889
I suspect that OS models will make things difficult for the big Western AI companies short term, but that's more in the sense of other services being able to use those models to offer their services for a lot cheaper than the big boys. Hell, OpenAI and Google have already started cutting their costs to their bone, and Anthropic is probably going to have to sooner rather than later
I think there is going to be a point where hardware advances to the point API services becomes antiquated and everyone can and does run everything locally, but that's true endgame for after we've hit the wall, and probably not within the next 10 years at least
Anonymous No.106369574 >>106369592
>>106369347
one more question, can I still use nemo 12B with an AMD card or do I have to use something else?
Anonymous No.106369577 >>106369664
>>106369560
That's what you hope to be endgame. The trend I see is towards more always online thin clients and everything as a subscription service.
Anonymous No.106369592 >>106369610
>>106369574
or if anyone could suggest a similar model which would work with AMD? It's 16gb
Anonymous No.106369610 >>106369617
>>106369592
Why would you think you can't use nemo on an amd card?
Anonymous No.106369617 >>106369653
>>106369610
I'm low IQ and read Nvidia in the name but I just realized it's because they made it. Please excuse me I am stupid. Now I'm trying to figure out how to download the model from HuggingFace. Pay me no mind.
Anonymous No.106369653 >>106369665
>>106369617
This just triggers my feeding fetish. Also if you're just looking to coom, maybe try a finetune instead of just vanilla nemo
Anonymous No.106369664
>>106369577
I do, just because there are too many reasons that I don't think the "one centralized provider" approach is going to work out in the long run. It's already debatable how much these companies actually make, and even the network strain as more and more applications become more AI focused is unimaginable
I think there could be AI focused applications that are API based too (like now) but those are going to be the products, not the raw materials
Anonymous No.106369665 >>106369696 >>106369718
>>106369653
Yes that is precisely my intention. Any suggestions? And would you like to.. feed me directions towards how I could "download the gguf" I would love you long time indeed.
Anonymous No.106369696 >>106369728
>>106369665
https://huggingface.co/bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/resolve/main/TheDrummer_Cydonia-24B-v4.1-Q4_K_M.gguf

Is a pretty popular one. But q4_km with context may be too big for your 16gb card if you're running off it. Replace Q4_K_M with IQ4_XS if that's the case.

I don't really use the small models, so I don't know what's good. If you can, try running a ~medium~ sized moe. It'll be slower depending on your hardware. But I think 10 tokens/s is achievable on most people's computers, and that's good enough for that use case.
Anonymous No.106369718 >>106369728
>>106369665
https://huggingface.co/bartowski/Rocinante-12B-v1.1-GGUF/resolve/main/Rocinante-12B-v1.1-Q8_0.gguf

And that's a nemo based finetune.
Anonymous No.106369719 >>106369762
>>106369557
Did you even read the paper?
>In summary: 1) Greedy decoding generally proves more effective for most tasks. 2) In the case of AlpacaEval, which comprises relatively simpler open-ended creative tasks, sampling tends to generate better responses.
Anonymous No.106369728 >>106369744
>>106369696
>>106369718
<3 thank you oldGOD
I guess I'll try IQ4_XS first.
Is that last one better?
Anonymous No.106369744
>>106369728
The first one is a bit better but will be slower on your machine
Anonymous No.106369762
>>106369719
they have high temperature brains
we must forgive them
Anonymous No.106369768 >>106369848
>it's 2075
>Sam Altman asks investors for another trillion dollars because AGI is two weeks away
>GPU poorfags are still rping on Nemo because nothing better has been developed so far for smaller models
>GPT-43-omni writes a bestseller book, only two shivers down the spine per page
the future is bright
Anonymous No.106369773 >>106369781 >>106369786
why the fuck is kobold tryingo to connect to the internet?
Anonymous No.106369781
>>106369773
tool calling
Anonymous No.106369786 >>106369791
>>106369773
to upload your logs to the archive we all read
Anonymous No.106369791 >>106369816
>>106369786
STOP
Anonymous No.106369816
>>106369791
>ah ah mistress
anon...
Anonymous No.106369820 >>106369862
ok so I have it loaded in kobold how do I plug into sillytavern? you guys are so cool btw I bet you guys are awesome irl
Anonymous No.106369823 >>106369833 >>106369834
Is there a use case for 2bit and 1bit quants?
It was pretty interesting to see a 27B model run on my older laptop with 8gb ram and 4gb Vram at 0.1t/s
Anonymous No.106369833 >>106369888
>>106369823
2bit DS is still usable for RP and maybe translations.
Anonymous No.106369834
>>106369823
Bragging on /lmg/
Anonymous No.106369848
>>106369768
The catch - it's not GPT-43, it's GPT-4.3
Reception to 5 was so bad that they never went above 4.9 again
Anonymous No.106369849 >>106369872
bitch who's this adam guy you kept mentioning
Anonymous No.106369861
>>106369841
>>106369841
>>106369841
Anonymous No.106369862 >>106369879
>>106369820
Anonymous No.106369872
>>106369849
Adam Wang
Anonymous No.106369879 >>106369898
>>106369862
connrefused?!?!
love u tho
ill ask grok too
Anonymous No.106369888
>>106369833
>translations
Definitely not. If you just want the gist of things it's okay, I guess.
Anonymous No.106369898 >>106369942
>>106369879
Change llama.cpp to kobold bro
Anonymous No.106369942 >>106369947
>>106369898
ya I did, port was wrong too
we are so back !!!!
Anonymous No.106369947 >>106369957
>>106369942
>we
How many of you are behind that keyboard?
Anonymous No.106369957 >>106369974 >>106370038
IT WORKS
but...
>>106369947
i'm the fed??? you guys tell me to use this model and I'M the fed???
Anonymous No.106369974 >>106369985
>>106369957
Keep in mind most finetuners don't bother with unaligning assistant-related tasks. You'll want to do it in a rp-style way.
Anonymous No.106369985 >>106370005 >>106370151
>>106369974
I have to roleplay with my graphics card to convince it to say nigger?
Anonymous No.106370005
>>106369985
Foreplay is important.
Anonymous No.106370016
this IQ4 XS model runs pretty speedy on my 7900 GRE, any recommendations for something more lewd and less sterile?
Anonymous No.106370038 >>106370047
>>106369957
You don't want something like this, do you anon?
Anonymous No.106370047
>>106370038
plz anon
Anonymous No.106370067
>>106369293
>>106369109
I've loaded it up.

How much should I offload to the GPU and how much to the CPU in kobold?
Anonymous No.106370151
>>106369985
All AI use is roleplaying. Online models are just roleplaying as a really boring character called an AI assistant.
Anonymous No.106370898
>>106366567
What's the point in updating something that works?