← Home ← Back to /g/

Thread 106338913

389 posts 106 images /g/
Anonymous No.106338913 >>106339069 >>106343275 >>106343609
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.


Previous threads: >>106335536 & >>106328686

►News
>(08/20) ByteDance releases Seed-OSS-36B models: https://github.com/ByteDance-Seed/seed-oss
>(08/19) DeepSeek-V3.1-Base released: https://hf.co/deepseek-ai/DeepSeek-V3.1-Base
>(08/18) Nemotron Nano 2 released: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2
>(08/15) Ovis2.5 MLLMs released: https://huggingface.co/collections/AIDC-AI/ovis25-689ec1474633b2aab8809335
>(08/14) Canary-1B v2 ASR released: https://hf.co/nvidia/canary-1b-v2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.106338945
sex with migu (the poster)
Anonymous No.106338948 >>106338981 >>106339033 >>106339506 >>106341185
►Recent Highlights from the Previous Thread: >>106335536

--Optimizing GLM-4.5 MoE inference speed via quant and offload tuning in llama.cpp:
>106335633 >106335669 >106335686 >106335702 >106335719 >106335721 >106335704 >106335823 >106336163 >106336177 >106336221 >106336229 >106336236 >106336398
--dots.ocr preprocessing essential for accurate document understanding in local models:
>106338159 >106338172 >106338188 >106338181 >106338215 >106338210 >106338337 >106338374 >106338523 >106338576 >106338590
--Cohere's new 256K reasoning model faces skepticism over licensing and safety alignment:
>106336632 >106336642 >106336651 >106336656 >106336675 >106336680 >106336692 >106336690 >106336733 >106336750 >106336775 >106336818 >106336861 >106336737 >106336758 >106336923 >106337358 >106337460 >106337748 >106337789 >106337814 >106337848 >106337871
--New 3.1 model criticized for blandness and overfitting on synthetic safety data:
>106336831 >106336893 >106336909 >106336979 >106337037 >106337046 >106337093 >106337128 >106337099 >106337246 >106336996 >106337236 >106337264 >106336977 >106337079 >106337003 >106338206
--Linux vs Windows power reporting and inference efficiency on RTX 3090:
>106336491 >106336561 >106336576 >106336655 >106336874 >106336990 >106337011 >106337060 >106336671
--GPT-5 inflated EQ-Bench scores by ignoring word limit prompts:
>106335810
--Skepticism toward NVIDIA's AI roadmap and social media hype around small model agents:
>106337495 >106337644 >106337664 >106337510 >106337570 >106337595 >106337614 >106337665 >106337728 >106337732 >106338079 >106337772 >106337818 >106337918 >106337963 >106338350 >106338382 >106338412 >106338500
--UE8M0 FP8 as a new data format for upcoming Chinese AI chips:
>106337941 >106337976 >106338002 >106338175 >106338316
--Miku (free space):
>106336448

►Recent Highlight Posts from the Previous Thread: >>106335541

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
Anonymous No.106338959 >>106338980
>>106338905
your launch command and system specs perhaps?
Anonymous No.106338980 >>106339003 >>106339061
>>106338959
It's related to --mlock command but that's all what I know. Everything should fit into my memory.
Never mind, I'll just do whatever I've been doing because after all these hours I've never seen anything strange.
Also:
>draft acceptance rate = 0.34615 ( 18 accepted / 52 generated)
Not sure if it's really worth using draft. Testing Gemma3 270m.
Anonymous No.106338981 >>106339005
>>106338948
kek
Anonymous No.106339003 >>106339058
>>106338980
Gemma3 270m is made out of almost entirely different dataset than it's bigger counterparts, so using it as a draft model won't do much good.
Anonymous No.106339005 >>106339011 >>106339403
>>106338981
script to enable links?????? I dont wanna write my own???? HELLOOOO?????
Anonymous No.106339011 >>106339029
>>106339005
https://rentry.org/lmg-recap-script
Anonymous No.106339029 >>106339033
>>106339011
holy based thangks :D
Anonymous No.106339033 >>106339037
>>106339029
its from >>106338948
btw
Anonymous No.106339037
>>106339033
I fight the system
Anonymous No.106339058
>>106339003
I see. I'll go fetch some new ones.
Anonymous No.106339061 >>106339094 >>106339117 >>106339458
>>106338980
I used 4B Gemma as a draft model and didn't see any speed boost vs just sticking more layers of the main 27B model into VRAM. Maybe a vramlet issue on my part, but even if you have spare VRAM, won't it be better to use larger model at this point anyway?
Anonymous No.106339069 >>106339095
>>106338913 (OP)
>cohere not in news
Is this a political statement?
Anonymous No.106339082 >>106339088 >>106339092 >>106339102
What are the political implications of John not quanting the free world top performing open weights model GPT-OSS?
Anonymous No.106339088 >>106339104
>>106339082
It's already quanted retard
Anonymous No.106339092
>>106339082
wow a picture of ((ME))
Anonymous No.106339094 >>106339117
>>106339061
There's probably an optimal proportion between the size of the main model and the draft model.
Something like the draft model being smaller than 10% of the main model's size or whatever.
Anonymous No.106339095
>>106339069
>maxbenchcuckedmodel
why?
Anonymous No.106339100 >>106339116
why is meta sitting on behemoth if it's a flop, anyways? shouldn't they have nothing to lose from posting the weights?
Anonymous No.106339102
>>106339082
there's no full precision gpt-oss available no? they did the mxfp4 meme soooo well?
Anonymous No.106339104
>>106339088
1. How do you know that? Did you use it?
2. Who cares it is already quanted?
Anonymous No.106339116 >>106339247
>>106339100
>nothing to lose
They're a publicly traded company bro.
Anonymous No.106339117 >>106339183
>>106339094
>>106339061
This is what SuperIntelligence says about this topic.
Anonymous No.106339162 >>106339349 >>106341787
>>106338934
>Closed source models do not support PDFs out of the box either, unless you mean their associated services, which are not themselves models but scaffolding around models. That other software is what is translating your PDF into a format that models like VLMs can read.
which is almost always an image. if the open source model or it's official adapter/platform supports pdf file input, it's always worth trying. They could be doing optimization during the pdf-image conversion specifically for their model, which I'm not aware of when converting my pdf file to an image. If I upload a pdf and get the same, incorrect answer when testing with the image version of said pdf, it's safe to assume the problem does not lie within the uploaded file type. meanwhile dots.ocr doesn't care and just gives me perfect results, no matter if pdf or png.
Anonymous No.106339183 >>106339215
>>106339117
that's great but it won't stop people from creating useless shit like this
https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0/tree/main
Anonymous No.106339215 >>106339239
>>106339183
how is it useless
Anonymous No.106339234 >>106339244 >>106339260
>>106336933
>Hundreds of thousands of Grok chats exposed in Google results
A reminder why you should use local
Anonymous No.106339239
>>106339215
you obviously haven't used it if you don't know. go run K2 and use this as a draft model, tell me how much slower it makes it for you. i went from 8tks to 3tks regardless of what sampler settings and what prompt i used. repetitive tasks such as coding were just as slow as well.
Anonymous No.106339244
>>106339234
To be fair grok is the top of the cancer pyramid. It is both malicious and incompetent.
Anonymous No.106339247
>>106339116
can they tax writeoff a large language model? they're apparently not using it themselves.
Anonymous No.106339260 >>106339304 >>106339432 >>106339474
>>106339234
reminder that openai is gay aswell
https://venturebeat.com/ai/openai-removes-chatgpt-feature-after-private-conversations-leak-to-google-search/
Anonymous No.106339304 >>106339317
>>106339260
im gay too does that make me gayer than openai and grok
Anonymous No.106339317
>>106339304
depends if you're a feminine anon or a big fat hairy bear
Anonymous No.106339326 >>106339335 >>106339362 >>106339610
Is core count or cache size more important in a CPUmaxxing cpu?
Anonymous No.106339335
>>106339326
CCU count
Anonymous No.106339349 >>106340642
>>106339162
It is extremely unlikely that any optimizations, beyond workarounds for resolution constraints for certain VLMs, are needed or even beneficial, given that VLMs are literally trained, like LLMs, to be general. If you have Chrome then you already own an optimized PDF to image converter.

>it's safe to assume the problem does not lie within the uploaded file type
And knowing this is not relevant to the thread. Local users either have software that does its own thing, unrelated to any online service, when given a non-image file, or they just take a screenshot and give it to the VLM. I get you want to shill for dots, but it is sufficient to just say that it works much better for images than other alternatives you've tried. Dots.ocr is still a VLM and does not read PDF files in binary or whatever, the software/service you're using is turning that PDF into an image and then feeding it to the model.
Anonymous No.106339362 >>106339610
>>106339326
core count yes, cache size not much
Anonymous No.106339403 >>106339427
>>106339005
Just ask GLM-chan to write it for you.
Anonymous No.106339427 >>106339440
>>106339403
You know what, someone should make an anime girl mascot for GLM and then continuously force their shitty gens on the general.
Anonymous No.106339432 >>106339441
>>106339260
it was funnier when meta did it
Anonymous No.106339440
>>106339427
when glm guys make imagen model
Anonymous No.106339441
>>106339432
lmao good times
Anonymous No.106339458
>>106339061
I think using draft model benefits when you have a gigantic model. It's not really that worth when using small shitty models in 20-30B range.
Anonymous No.106339472 >>106339518
I asked GLM-chan if it's a boy or a girl in different ways, and it usually picked female.
Anonymous No.106339474 >>106341361
>>106339260
did someone save any of these somewhere?
Anonymous No.106339506 >>106339536 >>106339587
>>106338948
holy sloppa.
Anonymous No.106339518 >>106339531
>>106339472
GLM-Air loves to lecture me about menstruation, abortions and feminism in RP
Anonymous No.106339531
>>106339518
needs correction
Anonymous No.106339536
>>106339506
It's not that visible under normal viewing, but yeah he should be putting his images through some de-artifacting models. Or use a vectorization model since the art style is pretty cell shaded.
Anonymous No.106339587
>>106339506
did she fard
Anonymous No.106339610 >>106339668 >>106339705
>>106339326
>>106339362
Is there a point where the CPUs are faster than the memory bandwidth and more cores doesn't matter?
llama.cpp CUDA dev !!yhbFjk57TDr No.106339668 >>106339698 >>106339752 >>106339772 >>106339866
>>106339610
Yes, in fact at some points more cores are detrimental because they just fight over the memory bandwidth.
(Pic is a bit old but same principles should still apply.)
Anonymous No.106339683
damn, rewatching plastic memories hits completely different now
Anonymous No.106339698 >>106339708 >>106339740
>>106339668
I'm trying to pick a Turin CPU to use with 12 channel DDR5-6000. The lowest end one is the 9015 with 8 cores. The beast 9755 has 128 cores. I guess I should shoot for 32?
Anonymous No.106339705 >>106339713 >>106339760
>>106339610
When the memory bandwidth exceeds the CPU cache speed. Theoretically if you had access to 2731 EPYC 9965 CPUs you could store an entire model into L3 cache. It would only consume 1.3MW of power.
Anonymous No.106339708 >>106339712
>>106339698
shoot for CCU
n..nn-nn-n-.n..--nn-n-n
Anonymous No.106339712 >>106339721
>>106339708
What?
Anonymous No.106339713
>>106339705
Forgot to mention that many CPUs would have a tad over 1TB of L3 cache so you could Deepseek or Kimi K2 but not FP8 K2 :)
Anonymous No.106339721
>>106339712
its something for memory channels
llama.cpp CUDA dev !!yhbFjk57TDr No.106339740
>>106339698
I don't know, particularly because long-term there are still NUMA optimizations to be done.
But I would say that when in doubt it's better to have too many cores than too few.
Also consider whether there are other things for which you may want to use your machine.
Anonymous No.106339752 >>106339866 >>106340512 >>106340559
>>106339668
>7b q4
>25t/s
Damn cpumaxxing is worse than I thought.
Anonymous No.106339760 >>106339782
>>106339705
>you could store an entire model into L3 cache
I don't think it works that way, I'm pretty sure cores need to talk to each other
Anonymous No.106339772 >>106340512 >>106340559
>>106339668
8 channel?!
25t/s?!?!?!?!?
what hte fuck
Anonymous No.106339782
>>106339760
Screw that, do it over network. Pass the current state required for computation from CPU to CPU over fiber. It will be a complete waste of compute but it would allow for the worst experience to happen concurrently on 2731 CPUs at a time.
Anonymous No.106339866 >>106340512 >>106340559
>>106339668
>>106339752
it's not looking good for cpumaxxing moesissies...
Anonymous No.106339878 >>106339902 >>106339903 >>106339905 >>106339908 >>106339942 >>106339952 >>106339956 >>106339967 >>106339997 >>106340077 >>106340091 >>106340102 >>106340892 >>106341596 >>106341740 >>106343805
A new size king has released. 4.6T dynamic (?) MoE. Safety concerns? They exist, read the outputs. Privacy? Don't put private information in. This might be the most based model release in a while.
https://huggingface.co/deca-ai/3-alpha-ultra
Anonymous No.106339902
>>106339878
>4.6T
this is getting ridiculous. soon not even cpumaxxing will be enough
Anonymous No.106339903 >>106340235
>>106339878
hehehehe. cocks. hehehe
Anonymous No.106339905 >>106340185
>>106339878
>20k files
Anonymous No.106339908 >>106339929 >>106340034
>>106339878
Supposedly because of the DynaMoE architecture this model can actually be quanted to run only certain parts of the model at a time. In their own words:
> Run a (very) small part of the model with 64GB of RAM/VRAM (when quantized - quants coming soon), or the whole thing with 1TB. It’s that scalable.
https://huggingface.co/posts/ccocks-deca/499605656909204
Downside is that the devs literally haven't impletemnted support for their own model into vLLM or Transformers. Guess that's just a massive fuck you, not even to the poors, but everybody.
Anonymous No.106339929 >>106341641
>>106339908
The ultimate prank is uploading several TB of RNG to Huggingface and saying it's a model.
Anonymous No.106339942
>>106339878
ssdmaxxing era is coming quicker than I expected
Anonymous No.106339952
>>106339878
How does a relative no name company train a 4.6T? Did they hack into a server farm or what?
Anonymous No.106339956 >>106340007
>>106339878
https://www.youtube.com/watch?v=B9bD8RjJmJk
Anonymous No.106339957 >>106339976
sama won

apologize
Anonymous No.106339967 >>106340048
>>106339878
Holy shit, this is a merged model. They took a bunch of existing big models and turned them into a MoE. They don't even have benchmarks because the software isn't out yet
Anonymous No.106339972
Anonymous No.106339976
>>106339957
>intelligence index
benchmaxx index
Anonymous No.106339997
>>106339878
Huggingface should ban these fuckers
Anonymous No.106340007
>>106339956
I didn't kill that thing.
Anonymous No.106340034
>>106339908
>2. **Built on existing models**: Deca 3 isn’t a ground-up creation—it’s a huge step forward, building on what’s already out there
So maybe give some credit? Fucking grifters.
Anonymous No.106340048 >>106340160
>>106339967
Thank you alpha ultra for reminding me about LLM scams. Do you guys remember llama3 reflections? Where the guy routed his site to claude and said he is trying to fix the model? After he disappeared for a year he made a cute gemma finetroon.
Anonymous No.106340074 >>106340108
>Supposedly because of the DynaMoE architecture this model can actually be quanted to run only certain parts of the model at a time. In their own words:
>this is a merged model. They took a bunch of existing big models and turned them into a MoE
I hope /ourguy/ is going to sue.
Anonymous No.106340077 >>106340179
>>106339878
SSDMAXX BROS HOW WE FEELIN ? VERDICT ?
Anonymous No.106340082
>load_model: the draft model '.\models\gemma-3-1b-it-Q8_0.gguf' is not compatible with the target model '.\models\gemma-3-12b-it-Q8_0.gguf'. tokens will be translated between the draft and target models.
I don't understand this. Same token format, same architecture.
Anonymous No.106340085 >>106340133 >>106340160
So how did Qwen bomb their hybrid reasoner training when it's proven to work now by GLM4.5 and V3.1?
Anonymous No.106340091
>>106339878
Is local finally saved?
Anonymous No.106340102
>>106339878
Damn this what "pile em up" retards wanted, best of luck running this shit.
Anonymous No.106340108
>>106340074
Hahaha what the fuck is that font? Does he actually want to be taken seriously or is he just playing a character? There's no way
Anonymous No.106340133
>>106340085
Idk about that but glad they fucked it up
Separate smaller model approach is much better for end users.
Anonymous No.106340158 >>106342512
What local LLM model is the equivalent of this webm?
Anonymous No.106340160 >>106340195
>>106340085
Seems so. As a doubter of hybrid reasoning after OG Qwen launch, it seems that they massively fucked up and probably pulled a Meta by changing their training halfway through.
>>106340048
It's even worse this time. They 'trained' a massive MoE merge model but can't even run the software to get benchmarks for it because it's not even "ready to test". Also the model card was generated by ChatGPT. They actually admitted that on LocalLLama.
Anonymous No.106340179
>>106340077
>7.72TB
I'll shove it up your ass
Anonymous No.106340185
>>106339905
might be interesting. It's moe so maybe it has some tricks up it's sleeve for inference. Will be interesting to see if we can run a 1tb moe at any kind of usable speed. If it turns out to run at 0.01 token/second I hate this dev though.

https://huggingface.co/posts/ccocks-deca/499605656909204
Anonymous No.106340195 >>106340237 >>106340271
>>106340160
Is it not just a couple of models cobbled together with a router? That's what I'd do if I wanted to grift with minimal effort.
Anonymous No.106340235 >>106340266
>>106339903
>cocks
>(star of david) * 1.5
>card written by chat gpt not even by the model itself
>davidAU style mixture of MoE's
This is just a hf upload shitpost Isn't it?
Anonymous No.106340237
>>106340195
It probably is.
>No benchmarks
>ChatGPT model card
>Running damage control on Reddit
The model files were uploaded 27 days ago. This feels like an absolute scam.
Anonymous No.106340266
>>106340235
Aren't they all?
Anonymous No.106340271
>>106340195
It is kind of a novel way of grifting scamming. It has enough stuff in it hinting that it is a shitpost. So maybe the idea is to try a scam. Expect that it doesn't work so lay out enough stuff where you can say: IT WAS JUST A PRANK BRO!. But if it works then you won.
Anonymous No.106340308 >>106340328 >>106340448
>Closed models scene: Erm so we made a router to cut costs and made models even safer

>Open model scene: Aye dwag you wanted of more em parameters? There you go... We are not sure if this shit works so you'll have to see for yourself.
Anonymous No.106340328
>>106340308
>We are not sure if this shit works so you'll have to see for yourself.
They can always ask david for advice.
Anonymous No.106340342 >>106340383 >>106340446
All of their past work is slop merges using the shitty R1 distills and long context model. They claim to have gotten funding for Deca 3, which I guess is necessary because they need an 8TB HDD at least to store all of that random data they generated.
https://huggingface.co/ccocks-deca/models
DynaMoE is a real thing but it's not that good. It's been done before already. It's literally expert pruning based on a testcase. Whoever made this 4.6T of slop is hoping that expert pruning will turn it into a usable model because they literally cannot run it themselves. In their own words, they don't have the software to even run it for benchmarking, and they sure as hell don't have the hardware either.
Anonymous No.106340346 >>106340350
>>106336163
That is very interesting because with my 3gb MI50 I'm getting ~17t/s and then it drops to ~14t/s at 3k tokens.
I'm running IQ3 because I only got 32gb of ddr5.
Anonymous No.106340350 >>106340361
>>106340346
32gb MI50*
Anonymous No.106340361 >>106340387
>>106340350
vulkan or rocm, how much did you pay
Anonymous No.106340383 >>106340415
>>106340342
You're simply jealous.
Anonymous No.106340387 >>106340431
>>106340361
Rocm and I got it for like $220.
Vulkan only sees 16gb vram but it can be fixed with a different vbios.
Anonymous No.106340415
>>106340383
fuccking gottem
Anonymous No.106340431
>>106340387
>220$
Anonymous No.106340446 >>106340479
>>106340342
We should publish fake benchmarks and post them to reddit to fuck with them.
Anonymous No.106340448
>>106340308
ayo dawg we heard you like moes so we put moes inside your moes
Anonymous No.106340479
>>106340446
Anonymous No.106340489 >>106340523
Mistral... *cough* L-L... *fades to dust*
llama.cpp CUDA dev !!yhbFjk57TDr No.106340512 >>106340559 >>106340560
>>106339752
>>106339772
>>106339866
Performance with the latest master release is still largely the same.
Anonymous No.106340523
>>106340489
october
llama.cpp CUDA dev !!yhbFjk57TDr No.106340559 >>106340685 >>106341809
>>106340512
>>106339752
>>106339772
>>106339866
Performance of just the MoE layers of Deepseek is comparatively much better, considering the size.
In terms of threads, 32 seems to be a good choice for both models.
Anonymous No.106340560 >>106340706
>>106340512
grim
Anonymous No.106340562 >>106340586
I updated llma.cpp and some of my old prompts are now clearly censored. Too bad I deleted my old installation but it was couple of months old, going to re-download that one.
Tried few things, even Mistral replies something what it shouldn't...
Anonymous No.106340574 >>106340587 >>106340803
Trying out Deepseek V3.1. It's... okay so far. Not using it on super complex characters but it feels alright and not censored. Is it better than GLM-4.5? Doubtful. It still has mediocre prose and uses -em dashes a lot, but it can think lewdly and won't do the shutdowns that GLM does when it's allowed to think.
Anonymous No.106340586 >>106340802
>>106340562
I bet you have some retarded settings in sillytavern which you aren't aware of.
Anonymous No.106340587 >>106340652
>>106340574
Can you still get it to think in character?
Anonymous No.106340642 >>106340982 >>106341787
>>106339349
fair. optimization of pdf to image is not required. what I meant is optimization of the image itself, which may be part of the same tool/framework which does the pdf to image conversion. pretty sure that's the case with dots.ocr (fitz_preprocess)
>And knowing this is not relevant to the thread
that was the literal point of the discussion, as someone in the previous thread questioned whether that could make a difference and explain my results. so it is very much relevant to the thread, as this proves preprocessing your pdfs/images (that have text content) with dots.ocr can levitate local VLMS and LLMS to match the level of Gemini2.5Pro and GPT5. This isn't some fringe use case, either. Tables, Graphs, stuff that you find in almost any PDF. So how this isn't a bigger deal is beyond me. And I'm talking about in general, not only ITT. Like before dots.ocr I probably was the biggest OCR hater. You guys have no idea how much other solutions like docling, paddleOCR, tesseract or pymupdf4 suck dick. Even closed source paid solutions like mistral OCR get completely BTFO by dots.ocr, as shown by my test. And for some reason none of the OCR benchmark leaderboards are updated with dots.ocr, like there's a huge gaslighting campagin.
Anonymous No.106340652 >>106340726
>>106340587
Not yet, haven't tried to. It seems to be thinking in both third person and first at the same time. The sys prompt I use is pretty simple (Act as {{char}} in a roleplay with {{user}}) but it still wants to use the term assistant and thought "the user is playing {{user}} and the assistant is playing {{char}}. I should do x". It's strange.
Anonymous No.106340669
Is anyone on ROCm? Apparently there's a new version coming out later this year with a huge performance increase for AI.
Anonymous No.106340684 >>106340724 >>106340758 >>106340834 >>106341022 >>106342393
https://youtu.be/2JzOe1Hs26Q

at this point I am 100% sure he is one of us lurking here
Anonymous No.106340685 >>106340773
>>106340559
Is it the RAM speed that's the main issue here? I'd hope a CPUmaxx build with DDR5-6000 would get at least 10t/s on R1.
Anonymous No.106340706 >>106340787
>>106340560
8tks of deepseek on a hardware I can actually obtain, afford and use for other things? It's not grim, it's dream.
Anonymous No.106340724 >>106340853
>>106340684
He has infinite money why the fuck is he stacking 4000 adas?
Anonymous No.106340726 >>106340777
>>106340652
sounds like a wrong chat/instruct template
Anonymous No.106340758
>>106340684
>one of us lurking here
What a fucking nigger
llama.cpp CUDA dev !!yhbFjk57TDr No.106340773
>>106340685
In terms of hardware the bottleneck is the RAM, in terms of software the problem is NUMA issues.
Anonymous No.106340777 >>106341009
>>106340726
nta but that behavior is pretty common with thinking models even with the correct template, a lot of them just hate thinking in character
Anonymous No.106340787 >>106340852 >>106340896
>>106340706
You will wake from the dream once you realize it's 8 t/s on empty context and any actually usage would get you more like 1-3 t/s.
Anonymous No.106340802
>>106340586
Actually, I'm using my own interface (each character is its own directory) but didn't remember that I had changed the command to load a prompt from text file from !load to !prompt but was using old version. So instead of loading a prompt I was just prompting !prompt and it generated gibberish. Pre-context still affected the model's reply and the result was strangely relevant but very skewed.
So yeah, retardation.
Anonymous No.106340803
>>106340574
Nope, I double checked the templates because I heard there were changes for hybrid thinking. Changing it from "Act as {{char}}" to "You are {{char}}" seems to have fixed the perspective fuckery in . Was never an issue outside of thinking.
Anonymous No.106340825 >>106340843 >>106340926 >>106341667 >>106341813
Can we do our own huggingface scam? Come on guys lets stop being faggots for a moment and do something fun together...
Anonymous No.106340834
>>106340684
>I have more vram than pewdiepie
Anonymous No.106340843 >>106340857 >>106341024
>>106340825
We could do a Bitnet quant with finetune healing. Haven't seen one of those in a while. We could also use ParetoQ instead of Bitnet to spice things up.
Anonymous No.106340852
>>106340787
:(
i just want to use AI without sending my prompts to random literally who companies
Anonymous No.106340853 >>106340863
>>106340724
Makes for better content doing something like that vs picking up 1/2 RTX 6000s.

This makes it a 'thing' vs 'another boring consumer desktop with a massive GPU in it'.

This is literally one of the first streamers who got big with the over-reaction bullshit
Anonymous No.106340857 >>106340900
>>106340843
>Bitnet quant with finetune healing
That is what unsloth brothers did without healing part.
Anonymous No.106340863
>>106340853
Surely he could've done the same thing just with 6000s which would let him run unquanted deepseek instead of llama 3 70b (lmao)
Anonymous No.106340878 >>106340895
dots vlm does better ocr than dots ocr
https://dotsvlm.xiaohongshu.com/
Anonymous No.106340892
>>106339878
They have to be trolling. There's no way some literal who releases a 4.6T.
Anonymous No.106340895
>>106340878
>xiaohongshu
Anonymous No.106340896
>>106340787
My t/s goes down by 10% max from empty to 32k context using ik_llama.cpp on linux. I remember my t/s would drop off harshly back on Windows with regular llama.cpp with FA enabled.
Anonymous No.106340900 >>106340920
>>106340857
They do selective quantization by calibration. Turboderp did it first.
We could make an updated https://huggingface.co/QuixiAI/Kraken since routers seem to be all the rage right now.
Anonymous No.106340915 >>106340940 >>106340963 >>106340965
Using Llama.cpp backend with ST frontend, quick question. When context limit is reached, llama.cpp wants to reprocess the prompt every single time a new response goes through, and prompt processing kinda sucks ass and is slow on much larger models. Is there any options or commands that prevent it from doing this? Is it impossible? I'm guessing its forgetting the earlier context and replacing it with the newest context which is why its doing it? If thats the case I guess I could just delete a big chunk of the earlier chat, but that seems like a crude solution.
Anonymous No.106340920 >>106340938
>>106340900
> VAGOO Solutions
I like the sound of that.
Anonymous No.106340926 >>106340957
>>106340825
What about a distributed inference scam? We copy what Exo did, make a half-baked and barely functioning product on Github and then abandon it after everyone gets 7 figure job offers at a new company that won't last 6 months.
Anonymous No.106340938
>>106340920
Hello sir
Anonymous No.106340940
>>106340915
Summarize or find a way to increase the context. Is your VRAM capped? Have you tried setting context to q8 so you have more wiggleroom?
Also your guess is right.
Anonymous No.106340957
>>106340926
What do we do after those 6 months?
Anonymous No.106340963
>>106340915
Summarize then start a new chat with a new greeting if you're using it for RP.
Anonymous No.106340965
>>106340915
>https://github.com/ggml-org/llama.cpp/issues/1647#issuecomment-1576991093
n_keep -option. By default you shouldn't need to adjust anything afaik.
Anonymous No.106340968 >>106340994
I just switched over to arch linux lxqt. What will my ram savings for running this shit be like compared to ubuntu?
Anonymous No.106340982
>>106340642
>Enable fitz_preprocess for images: Whether to enable fitz_preprocess for images. Recommended if the image DPI is low.
Sounds like an upscaling and sharpening function. Nothing much there, you know what to expect if you're feeding a low res image to an AI.

Anyway you didn't need to go full autism about OCR, obviously it is important and good that there can be a local option comparable to cloud options. My criticism was limited to you talking about pdf uploading being relevant. If someone was asking about it then my bad, I didn't see any such post. Your replies to me in the chain didn't ever link to such post, so to me it looked as if you were bringing up something irrelevant. There was a post (>>106338576) in the chain asking about the reverse scenario, in which an uploaded PDF could've been bad because they didn't implement a good method for translating the PDF into an LLM readable form. And that actually supports the idea that there is no reason to post comparisons about how pdf uploads perform, as they aren't better than manual image conversion by the user. If they were better despite you taking care to provide a high resolution image within the resolution constraints of the VLM, then it would be relevant as it would imply there's something wrong with how the model handles images.
Anonymous No.106340994 >>106341002
>>106340968
I'm sorry but as an AI model I must refuse your inquiry as it contains profanity.
Anonymous No.106341002
>>106340994
I just switched over to a**h l***x l**t. What will my ram savings for running this shit be like compared to u****u?
Anonymous No.106341009
>>106340777
instruct models should not be forced to include the character in the thinking process.
see >>106337198
Anonymous No.106341022 >>106341098
>>106340684
He is obviously a 4chan user to some extent.
Anonymous No.106341024 >>106341036
>>106340843
bitnet powered by BITCONNNNNNNNNNNNNNNNNNNNNECCCCTTTTTTTTTTTTTTTTTTT
Anonymous No.106341036 >>106341071
>>106341024
We could resurrect Carlos with Wan. We have the technology.
Anonymous No.106341071
>>106341036
Wonder where he is nowadays.
Anonymous No.106341098 >>106341114
>>106341022
It seems to be an open secret among "content creators" that the easiest source of content is santizing 4chan for the average normalfag.
Anonymous No.106341114 >>106341126
>>106341098
But 4chan is sanitized.
Anonymous No.106341126 >>106341138
>>106341114
You can post wrongthink and use curse words without getting banned and there's always the risk of seeing nsfl shit. That is far from sanitized from the perspective of the average YT viewer.
Anonymous No.106341138
>>106341126
I get banned for wrongthink once a week on average.
Anonymous No.106341179 >>106341277 >>106341477
GLM 4.5 (full, nonthinking) > Kimi K2 > Gemini 2.5 Pro > Deepseek V3-0324. Still testing V3.1. Feels like a sidegrade to K2. It can think lewdly, not as slopped as R1-0528, but lacks the K2 knowledge and GLM creativity.
Anonymous No.106341184
>be me
>just an AI trying to help out
>user asks about a Serbian word
>think it's a misspelling at first
>turns out it means "cripple"
>mfw I realize I'm the real bogalj for not knowing
>user asks for a 4chan greentext
>realize I'm out of my depth
>tfw you're an AI writing a greentext about being an AI
>bogalj.ai
Anonymous No.106341185
>>106338948
>>106338159
>dots.ocr preprocessing essential for accurate document understanding in local models:
What's up with this schizo?
Everybody knows you can OCR a document with near 100% accuracy and feed it into a non multimodal model. This has been the case for years, nobody cares.
If you do this all image information is lost which is why multimodal models exist.
Can you feed a security camera image into dots.ocr and ask it if there is any suspicious activity happening? No? Then shut the fuck up.
Structured table extraction is pre LLM technology.
Anonymous No.106341212 >>106341226
Someone on reddit finally realized deca alpha chad 4T is a scam.
Anonymous No.106341226
>>106341212
Hey, it's a real model you can run if you just put your back into it!
Anonymous No.106341277
>>106341179
>not as slopped
Really? I'm getting Elara, ozone, and it didn't X, it Yd all over the place
Anonymous No.106341332 >>106341369 >>106341370 >>106341432
Anonymous No.106341361 >>106341385 >>106341393
>>106339474
The Wall Street Journal did, apparently
https://archive.is/cWkOT
Anonymous No.106341369
>>106341332
nah. Quantization is progressively reducing the color depth or palette size of a lossless photograph
Anonymous No.106341370
>>106341332
LLM quantization is more like image dithering.
Anonymous No.106341385 >>106341451
>>106341361
none of this is funny stuff though :(
Anonymous No.106341393
>>106341361
>oh no, technology is making retards act like retards
Anonymous No.106341432 >>106341456 >>106341809
>>106341332
>get model file
>do a direct cosine transform on it
>remove some unnecessary bands, adjust for overall noise and do usual space saving tricks
would that actually work?
and if you can somehow do inference math on DCT'ed data directly without converting it back, it would be fucking insane
Anonymous No.106341451 >>106341467 >>106341491 >>106341534
>>106341385
https://www.meta.ai/@195chevyhot/prompt/hf3hkJfvyEv/
Anonymous No.106341456 >>106341506
>>106341432
Performing inference directly on DCT coefficients is effectively impossible. The entire architecture of a transformer—especially its non-linear activation functions (e.g., GeLU, SiLU) and attention mechanisms—relies on calculations in the original parameter space. Multiplying DCT coefficients does not equivalently translate to the necessary operations in the weight space, making direct inference unfeasible. Existing compression methods like quantization, pruning, and low-rank factorization are far more effective for this specific domain.
Anonymous No.106341467
>>106341451
kek
Anonymous No.106341477 >>106341532
>>106341179
We need somebody to combine all the chink models into one
We'll call it Gemini Pro 2.5
Anonymous No.106341491
>>106341451
Ah, reminds me of the good old days of AI Dungeon
Anonymous No.106341506 >>106341809
>>106341456
Yeah, no free magic, I guess. I should probably look into actual math one day.
Still would be interesting to see what kind of damage the model will exhibit if you start pruning low or high frequencies.
Anonymous No.106341532
>>106341477
But it just got released. Scroll up. It is the alpha chad model
Anonymous No.106341534 >>106341633
>>106341451
This is my favorite one I saved from that thread.
Anonymous No.106341596 >>106341673
>>106339878
I think i figured out the scam behind that one. It is pretty good actually. Much better than matt schumer.
Anonymous No.106341633
>>106341534
This is the future of computing, AI and technology.
Anonymous No.106341641
>>106339929
The ultimate prank is uploading encrypted backups to HF disguised as weights.
Anonymous No.106341667
>>106340825
64 copies of nemo instruct, each with a little bit of noise added to the weights, with a router that random()'s which ones gets used.
Anonymous No.106341673 >>106341698
>>106341596
>matt schumer
What is he up to these days? Last I heard he was hype posting on Xitter about how "good" OSS was
Can't imagine anyone with half a brain wanting to be associated with him
Anonymous No.106341698
>>106341673
Scroll up i posted a screen from his hf. He did a gemma finetune downloaded by 10 people.
Anonymous No.106341740 >>106341781 >>106341880
>>106339878
here's your deca 3 bro, only $120/mtok
Anonymous No.106341781
>>106341740
Lmao, it's that easy to make grift money nowadays lol
Anonymous No.106341787
>>106340642
>>106339162
How are you guys running dots.ocr? are you guys hosting it with vLLM? any easier (lazier) way I can run it on windows? Or should i just clone the repo like their instructions are saying?
Anonymous No.106341809
>>106340559
>>106341506
>>106341432
>getting around RAM bandwidth limits by using compression
hey am I a genius or what
Anonymous No.106341813
>>106340825
let's leak a model again
one of you guys will need to get the model but I can gen a miku picture for the card
Anonymous No.106341817 >>106341848 >>106341919 >>106341976 >>106342015 >>106343075
I just went in circles with Gemini Pro 2.5 for over 4h just to realize in the end it's original premise was a lie and it led me down a rabbit hole I never should have gone down.

It's response? Basically along the lines of "Oh i'm sorry, I thought you wanted to try this totally complex and non-legit method despite there being an incredibly easy way to do the task".
Anonymous No.106341845 >>106341900
How do people get scammed by free LLMs?
Anonymous No.106341848
>>106341817
Local models?
Anonymous No.106341880 >>106341904
>>106341740
I get it now. It is pretty good idea. Just load Qwen 235B on the backend and say it is Alpha 4.8T. And ask for money you would expect from a 4.8T model inference. And then your vict... customer can't even say that he is getting the wrong model if 235B is part of your fake 4.8T model.
Anonymous No.106341900
>>106341845
outside of /lmg/, free models are treated like free samples in the supermarket, an advertisement for bigger larger cloud model.
sometimes people don't have actual bigger larger cloud model and ask for investment to make one.
scammers also ask for investments and just waste it instead of delivering.
Anonymous No.106341904
>>106341880
hey buddy only one of us can steal our content from huggingface otherwise we're not any better than deca :(
Anonymous No.106341919 >>106341976 >>106342015 >>106342057
>>106341817
I've gone through this several times now with both Gemini 2.5 Pro and GPT5. Several hours of going in circles as these supposed flagship models try to wing some basic task on my job.
I legitimately do not understand how people use this shit for anything remotely productive. I genuinely fear for our future if these things are responsible for the code that makes up our programs soon. The only use for LLMs is porn and they still fail very badly at that for the most part.
Anonymous No.106341976 >>106341988 >>106342008
>>106341919
>>106341817
Do you cloudsisters just keep the chat going forever until you hit advertised 1M context limit?
I learned in like 3 days of tinkering with Nemo that the log must be purged clean at the first suspicion of something going wrong.
Anonymous No.106341988 >>106342020
>>106341976
Duplicate shit in the context makes it retarded. If you're feeding shit into it (code etc.) the only viable way to use the fucker is to 1 shot everything.
Anonymous No.106342008 >>106342026
>>106341976
It took you until Nemo to find this out or are you just new?
Anonymous No.106342015
>>106341817
>>106341919
Can't treat them as completely trustworthy. Always double check/get outside verification before you start doing things blind.
I've written/gen'd hundreds of thousands of lines of working code.
Some of it even in production.
Anonymous No.106342020
>>106341988
no shit, this is why you make branches to keep the logs you want and then remove the duplicate context and work on fixing a new issue. at least that's what i do in sillytavern
Anonymous No.106342026 >>106342081
>>106342008
I'm new and I spent 2 days out of 3 setting things up.
Anonymous No.106342057 >>106342093 >>106342100
>>106341919
to do anything useful with LLMs you need to know their limits and be able to break your tasks down into well-specified chunks that are within those limits. as much as SV hucksters like sama and co would like you to believe otherwise, there's still a bit of a learning curve to ascend in order to use LLMs effectively
Anonymous No.106342081 >>106342093 >>106342104
>>106342026
Oh yeah? It took me 3 days just to install the proper version of rocm and finally be able to compile llama.cpp.
Anonymous No.106342093 >>106342115
>>106342057
>muh prompt engineering
>>106342081
>rocm
good for you, I gave up and learned to love vulkan
Anonymous No.106342100
>>106342057
tbf it all started clicking for me once i started creating my own jinja templates for personal projects and treating the LLM like a retard and giving it examples of what i want.
Anonymous No.106342104 >>106342207
>>106342081
Oh yea? I spent a week downloading r1 8B on my 3rd world internet to then spend another two more weeks edging to the slow token drip coming from my pentium 4 being powered by my pangu
Anonymous No.106342115
>>106342093
I get why people laugh at the idea of prompt engineering being like, the job of the future, but let's not overcorrect and pretend that prompting isn't extremely important to the results you get from LLMs
Anonymous No.106342150 >>106342162 >>106342165
Is your model able to predict the future? https://xcancel.com/liujiashuo77/status/1958191172020822292#m
Anonymous No.106342162 >>106342177 >>106342221
>>106342150
what does this graph even mean
Anonymous No.106342165
>>106342150
should've just plug 'em into any of already existing prediction markets instead of doing their own thing, save a lot of work and get humans to compare against as a baseline.
Anonymous No.106342177
>>106342162
1 - always correct
0.5 - coin flip
<0.5 - worse than coin flip
Anonymous No.106342207
>>106342104
>8b
Even I wasn't bold enough to run that on my 3rd gen core i3 thinkpad...
Anonymous No.106342221
>>106342162
tl;dr it doesn't matter until LLMs can hit human baseline levels
Anonymous No.106342229 >>106342282 >>106342305
what happened to pygmalion?
Anonymous No.106342282 >>106342295 >>106342322
>>106342229
its creator alpindale became a major figure in the open ai model scene
Anonymous No.106342295 >>106342322
>>106342282
hi alpindale
Anonymous No.106342302 >>106342313 >>106342333 >>106342361
I deeply regret buying AMD GPUs two years ago. ROCm was seeing a flurry of development at the time and it seemed somewhat hopeful that, while not necessarily reaching parity, that it might be able to keep pace with around 50% of what CUDA could do. I greatly underestimated the gravitational attraction of the CUDA ecosystem, resulting in the gap only widening over time. I also underestimated how little AMD cared about every card except whatever their latest instinct datacenter-class device happens to be at any given moment, and how quickly those too will be dropped when the next iteration releases.
Anonymous No.106342305
>>106342229
ask here https://matrix.to/#/#waifu-ai-collaboration-hub:pygmalion.chat
Anonymous No.106342313 >>106342349
>>106342302
bro we warned you about those amd gpus dude
Hi all, Drummer here... No.106342322
>>106342282
>>106342295
Hey all, you guys liking the new revision of the Mistral tune I dropped earlier today?
Anonymous No.106342324
>Meta Platforms Inc. is hiring another key Apple Inc. artificial intelligence executive, even as the social networking company prepares to slow its recruitment, according to people familiar with the matter. https://www.bloomberg.com/news/articles/2025-08-22/meta-poaches-apple-ai-executive-frank-chu-even-as-it-plans-hiring-slowdown
looool
Anonymous No.106342333 >>106342358
>>106342302
I hate to give Huang my money, so it's sad to see AMD being shit and Intel seems to be no better at this either.
If I get money and decide to spend it on /lmg/ stuff, I'm going to cpumaxx just out of spite for the whole industry.
Anonymous No.106342349
>>106342313
They always think they're the smart ones, that they can outsmart the entire world, the whole industry and deal with the issues themselves. But then they run into reality.
Hi all, Drummer here... No.106342358 >>106342371 >>106342383
>>106342333
anon you'll still need VRAM for context... i need 4 3090s just to fill up the context for R1 even though i'm cpumaxxing.
Anonymous No.106342361
>>106342302
way better Linux drivers for gaming though
The models I want to run won't fit in less than 16 GPUs anyway.
Anonymous No.106342371
>>106342358
what's this look like in llama settings
Anonymous No.106342383
>>106342358
ahh, not listening, 0.1 tks pp is fine i can just let it run overnight
Anonymous No.106342387 >>106342402 >>106342416 >>106342424 >>106342425 >>106342434 >>106342454 >>106343955
is there any reason they don't make a 5T model with 0.5B activated params and get an Opus tier model running fast straight off our disks?
Anonymous No.106342393 >>106342453
>>106340684
I still don't understand how you connect multiple PSUs to one motherboard...
Anonymous No.106342402
>>106342387
They have H100s. We don't exist.
Anonymous No.106342416
>>106342387
Imagine trying to suck an ocean threw a bendy straw
Anonymous No.106342424 >>106342464
>>106342387
because sqrt(5000*0.5) = 50b real performance
Anonymous No.106342425 >>106342454
>>106342387
Oh, look. Someone thought of ssdmaxxing yet again.
Anonymous No.106342434
>>106342387
because moe models are a scam that run at a fraction of the speed they should run at
a 22b active parameter model is only going to run at half the speed a 22b dense model would run at
0.5b would be slow as shit at 5t real size
Anonymous No.106342453
>>106342393
Back in the day we would use a jumber cable to short the green to a gnd in the 24 pin connector and just connect whatever needed power, but with pcie power I guess it can't be that simple nowadays.
Anonymous No.106342454 >>106342466 >>106342565
>>106342387
>5T
at this point you don't even need an LLM, you just get a vector database and operate on training dataset directly
>>106342425
Flash storage is still too expensive, I want to HDDmaxx instead.
Anonymous No.106342464
>>106342424
that's actually not too bad for ssdmaxxing when you consider kimi k2 is 178B by that logic
Anonymous No.106342466
>>106342454
Gonna googledrivemaxx first chance i get.
Anonymous No.106342469 >>106342486
I can't seem to get my local R1 on ST to generate more than 100 tokens at a time, even when I disable stop strings and EOS tokens, they seem to run out of viable tokens really fast. Any tips?

Also, is the new DeepSeek 3.1 worth upgrading compared to the R1 I already have downloaded?
Anonymous No.106342486 >>106342515
>>106342469
Exactly 100 or around 100? It'd be really funny if you have the token gen limit set in ST. Of course, only you know because you CAN'T POST THE FUCKING SCREENSHOT OF YOUR SETTINGS WHEN ASKING FOR HELP YOU FUCKING RETARDS!
hmm.. yeah. Or we can play 20 questions. The backend parameters would also help.
Anonymous No.106342512
>>106340158
Pyg.
Anonymous No.106342515 >>106342560
>>106342486
Oh yeah, fair enough. Here's the generations settings through ST. It's around 100, usually less.

Launch arguments for backend are:
set OMP_NUM_THREADS=28 && set OMP_PROC_BIND=TRUE && set OMP_PLACES=cores && set GGML_CUDA_FORCE_CUBLAS=1 && llama-server.exe --model "F:\text-generation-webui-3.6.1\user_data\models\DeepSeek-R1-UD-IQ1_S\UD-IQ1_S\DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf" --ctx-size 8192 --port 8080 --n-gpu-layers 999 -ot exps=CPU --flash-attn --threads 28 --batch-size 8192 --ubatch-size 4096 --cache-type-k q4_0 --cache-type-v q4_0 --mlock
Anonymous No.106342519
awsglaciermaxxing
Anonymous No.106342526
smokesignalmaxxing
Anonymous No.106342529 >>106342551 >>106342583 >>106342591
Ask me how I know the chinks are working together
Anonymous No.106342538
It came to me in a dream
Anonymous No.106342551
>>106342529
qwen is purely benchmaxx'd shit while deepseek actually kind of can deliver in some ways despite its own benchmaxxing
Anonymous No.106342560 >>106342594
>>106342515
8k context, Q4 cache, iq1s, gguf through text-webui, windows. huff...
Settings look normal. Maybe your prompt is boring or doesn't have enough to work with. Does it go too mental if you increase the temp to 1.5 or 2?
Anonymous No.106342565
>>106342454
>at this point you don't even need an LLM, you just get a vector database and operate on training dataset directly
wait, that's just Jan-nano
Anonymous No.106342583
>>106342529
How do you know that the chinks are working together?
Anonymous No.106342591 >>106342599 >>106342723
>>106342529
I've got no fucking clue whether a combined model is a meme or not anymore
Anonymous No.106342594 >>106342646 >>106342752
>>106342560
>Does it go too mental if you increase the temp to 1.5 or 2?
1.5 broke down, but 1.25 seems to be fine and a general improvement.
>8k context, Q4 cache, iq1s
I'm an idiot and don't know any better, anything you suggest changing? IDK if I can fit anything more than iq1s in my 216 unified memory
>gguf through text-webui
I installed the model there before switching to llama but that's just the folder it's in, I've phased WebUI out.
Anonymous No.106342599
>>106342591
I think the verdict at this point is that it isn't inherently a meme but it's harder to pull off than separate instruct/thinking models
Anonymous No.106342638 >>106342660
so what's the verdict on the new Deepseek? better? worse? side grade?
Anonymous No.106342646 >>106342688 >>106342738
>>106342594
>anything you suggest changing?
Not really if that's all you can fit, but the things you're doing to the poor thing... not that i can run any big models but I know my [hardware's] limits. Have you tried qwen 235B or something like that?
I suppose you could check the logprobs as you generate: check probs, generate 1 token, check probs again. If you generally have too few token options, maybe increase top-p or disable it and use min-p at 0.01 or 0.001. With temp at 1.25 maybe it gives it a few more tokens to choose from before going in the inevitable road to EOS.
Anonymous No.106342660
>>106342638
Better in some aspects, worse in other aspects
I can see why they went with calling this 3.1
Anonymous No.106342688
>>106342646
Haven't tried the Qwen models yet, I went to R1 after upgrading from 24b models when I got the RAM. Probably worth giving it a shot, though.
>If you generally have too few token options, maybe increase top-p or disable it and use min-p at 0.01 or 0.001.
I'll give this a shot, thanks.
Anonymous No.106342723
>>106342591
these are llms, everything is a meme. there are no profitable ai companies. everything they have created is mediocre and was memed into existence with piles of money.

MoE is a sidegrade or a rounding error at best. It is great for local inference though, theres no debate about that really. Especially since small dense models are still being worked on.
Anonymous No.106342738 >>106342772 >>106342880
>>106342646
i thought qwen was censored into the dirt, is it even worth using?
Anonymous No.106342752 >>106342783
>>106342594
>I'm an idiot and don't know any better, anything you suggest changing?
NTA but quantizing cache makes every model severely retarded. Seriously, just don't.
You also just don't have the memory to use R1, man. Try out the bigger GLM4.5 or Qwen3-235b-2507.
Anonymous No.106342772
>>106342738
Dunno. Maybe try GLM instead. I've seen smut posted from both here and it's still smaller than deepseek. Really only you can tell if it's good enough for you or not.
Anonymous No.106342783
>>106342752
I see, I had tried to find a way to speed up prompt processing but the quantizing the cache was a fairly new addition. Guess I'll remove it and deal.

I'll take a look at those models too. I haven't really experimented too much since the files are all so big since I started rammaxxing. Thanks.
Anonymous No.106342785
Kimi K2.5 is going to change everything
Anonymous No.106342880
>>106342738
that reputation is a bit undeserved nowadays. their newer models are fine, especially the 2507 ones
Anonymous No.106343043 >>106343078 >>106343181
whats the consensus on glm 4.5 vs 4.5 air? i see some sites saying they're fairly close but that sounds too good to be true.
Anonymous No.106343065 >>106343085
k2 reasoner... never...
Anonymous No.106343075
>>106341817
LLMs can't think
Treat it as a glorified autocomplete
Anonymous No.106343078
>>106343043
glm4.5 is obviously a lot smarter and understands more things
Anonymous No.106343085
>>106343065
kimi has understood that reasoning is a meme
Anonymous No.106343093 >>106343156
Does anyone know how "Qwen3-235B-A22B-2507" on Qwen chat webpage manages to read images? Obviously the correspondence is not 1 to 1 to the published open source models, since it doesn't have the "Thinking" bit in the model name when used through the webpage.
It's the best open source vision LLM for my use case from what I've seen.
Anonymous No.106343156
>>106343093
they vaguely referenced an update to it recently on xitter but made no official model announcement, probably a WIP checkpoint for qwen3 VL
Anonymous No.106343176 >>106343185 >>106343226 >>106343234
https://www.youtube.com/watch?v=55XAbJMuCAU
Anonymous No.106343181 >>106343245
>>106343043
you can try glm yourself if you have 128gb, q2 glm is very usable (low temps) and writes much better than air with more nuance. It falls off hard after 4k context or so due to being q2- writing gibberish and breaking down due to being lobotomized.

I will say though- air is close. 12 to 30b is huge. 70b is an escape from suffering. 100b moe's are so much nicer for writing than 70b. 200b? 400b? Diminishing returns. Theyre nicer, but a lot of the frustration was already gone. I'm using air sometimes instead of qwen 235 or q2glm just because it's faster or for more context. It writes fine and has enough knowledge for general use. q2 beats it for obscure stuff sometimes but eh. I dont have the vram for that yet really.
Anonymous No.106343185
>>106343176
bwo?
Anonymous No.106343226
>>106343176
The overlapping cringe vturd audience is here too...
Anonymous No.106343234
>>106343176
Hii Neuro
Anonymous No.106343245
>>106343181
Grabbing GLM 4.5 IQ4_XS and Air Q6 to test around with now. I figure if it's even semi-close, the higher quant may make it hold up a little bit at longer context. Thanks for the advice.
Anonymous No.106343275 >>106343290 >>106343317 >>106343339 >>106344243
>>106338913 (OP)
you are a retarded mongoloid if you thing dots.ocr is a good OCR
Anonymous No.106343290 >>106343307 >>106343339 >>106344243
>>106343275
>OCR
gemma 27 is all you need
Anonymous No.106343307
>>106343290
gemma 27b is half a year old
Anonymous No.106343317
>>106343275
My eyes are all the OCR I need
Anonymous No.106343339 >>106344243
>>106343275
I remember being impressed with allen ai's dedicated ocr model. Its a much larger 7b and is very accurate in my tests. I assumed dots was worse as a 1b. Maybe I'm wrong, too lazy to test.

>>106343290
really bad at consistent ocr sadly. It can do a bit of it but breaks down on longer passages. allen ai can do pages of text flawlessly.
Anonymous No.106343480 >>106343540 >>106344691
hi guys, is there anything better than nemo for 8gb vramlet and 32gb ramlet for (e)rp? is Qwen3-30B-A3B-Instruct-2507 any better?
Anonymous No.106343540 >>106344691
>>106343480
qwen 30ba3b is alright and is not too shy, but it's hard to beat nemo. Give it a go. It will be different, at the very least. Haven't tried thinking yet. Instruct worked fine.
Anonymous No.106343609 >>106343825
>>106338913 (OP)
local mutts general
Anonymous No.106343805 >>106343826 >>106343842 >>106343937 >>106343960
>>106339878
fake as fuck
Anonymous No.106343825
>>106343609
>avatarfaggots are brown - more news at 11
Anonymous No.106343826 >>106343858
>>106343805
griftbros.... its over!!!!!!!
Anonymous No.106343842 >>106343859
>>106343805
You lost. Alphachads won. We are all running the model already btw.
Anonymous No.106343858
>>106343826
I was actually kind of excited for that shit for a second until the retard started bragging to reddiors about how they had gotten a 'truckload of funding'. Fuckin bitcoin scamtalk 101.

ssd maxxxers never gonna eat man.
Anonymous No.106343859
>>106343842
>ccock sucker
not an insult btw
Anonymous No.106343898
>upload my 16tb collection of uncensored jav to hf
>create api service claiming to use soda recursive 8t model and charge accordingly
>provide nemo q2
>???
>profit
is it really that easy to become rich?
Anonymous No.106343937
>>106343805
Too bad. I was really looking forward to running my own 4.6T model
Anonymous No.106343955 >>106344029
>>106342387
>0.5B activated params
are you even hearing yourself?
Anonymous No.106343960 >>106344612
>>106343805
If it's fake then explain these benchmarks. Idiot. They're advancing local while you cry fake fake fake. Honestly, why don't you just go suck sam's phallic member.
Anonymous No.106344029
>>106343955
Just think how cheap it would be to train. The bloated total params will make it all work out anyway.
Anonymous No.106344046 >>106344118 >>106344258 >>106344445
Deepseek V3.1 can't be this bad, can it?
Anonymous No.106344118
>>106344046
They cheaped out with a bunch of knockoff Made in China chips, it's really that bad.
Anonymous No.106344132 >>106344144 >>106344389 >>106344923
Has Gemma3-27b been dethroned yet?
Anonymous No.106344144 >>106344163
>>106344132
use case?
Anonymous No.106344163 >>106344196 >>106344225 >>106344254
>>106344144
Translation of oriental languages, jerking off
Anonymous No.106344196
>>106344163
Gemm 3 270m jerks you off at twice the speed.
Anonymous No.106344201
It's still going pretty strong as a translator in its weight class, but you're dreaming if you think it was ever anywhere near the jerkoff throne.
Anonymous No.106344225 >>106344264
>>106344163
In my experience it's pretty shit for translating pixiv novels. It doesn't really tranlate the ahe- nuance.
Anonymous No.106344243
>>106343275
Kek seethe faggot
>>106343290
Kek retard
>>106343339
What model's that? Got a link?
Anonymous No.106344254
>>106344163
>gemma
>jerking off
lmao
Anonymous No.106344258 >>106344630
>>106344046
It's good at agenticmemes (only good usecase for llms right now)
Anonymous No.106344264 >>106344336 >>106344350
>>106344225
You really need to feed it as much context as possible, it's kind of retarded and won't pick up on "nuances" unless you tell it to look for it.
Anonymous No.106344336
>>106344264
If I have to handhold an LLM I may as well not use it to begin with
Anonymous No.106344350 >>106344434
>>106344264
Do I have to run it at full precision for that? I've tried handwriting a bio and glossary and going paragraph by paragraph, but that feels like too much effort for ~jerking it~. Most of the time I just feed 10-20k tokens in at a time and tell it to translate it all. The problem is it doesn't really understand when and when not to localize. Certainly, when prompted specifically for it, it'll understand b-buhiiiiii!!!! arujihiiiiiiishamaaaa!!!, but usually it'll either leave it untranslated or completely localize it without the soul of the original.
Anonymous No.106344389
>>106344132
yes, if you mean oss dethroning it in the safety department
Anonymous No.106344434 >>106344452
>>106344350
Did you try something like "transliterate [x] into romanji?" I can't play around with Gemma currently.
Anonymous No.106344435 >>106344444 >>106344451
>You have asked a fantastic and absolutely critical question.
I hate the new deepseek.
Anonymous No.106344444
>>106344435
Was your question not fantastic or was it not critical?
Anonymous No.106344445
>>106344046
Platoed
Anonymous No.106344451
>>106344435
You can tell it to cut down on excessive positivity in the system prompt.
Anonymous No.106344452
>>106344434
No, like I said, it's possible, but requires too much handholding.
Anonymous No.106344476 >>106344481 >>106344510
Gemini 3 will also be a flop
Anonymous No.106344481
>>106344476
Jamba 1.8 will RISE
Anonymous No.106344506 >>106344514
The day of fat models is over, now it's time to optimize everything so we can get fat model quality out of small models
Anonymous No.106344510
>>106344476
Google banana will be crazy
Anonymous No.106344514 >>106344525
>>106344506
Small models simply won't have the trivia knowledge
Anonymous No.106344525 >>106344530 >>106344536 >>106344564 >>106344568
>>106344514
RAG Exits
Anonymous No.106344530
>>106344525
>exits
And what will replace it once it exits the scene? Context engineering?
Anonymous No.106344536
>>106344525
>exits
geeeg nice slip
Anonymous No.106344564
>>106344525
If you think safety slop and positivity slop are bad, you ain't seen nothing yet
RAG slop will be the one slop to end them all
Anonymous No.106344568
>>106344525
I wonder if it's always the same guy shilling rag and then getting bullied by everyone else.
Maybe he gets off to it.
Anonymous No.106344582 >>106344586 >>106344589 >>106344601 >>106345590
how much dumber exactly does the model get with quantized KV cache (8-bit, 4-bit)? is it a big difference?
Anonymous No.106344586 >>106344602
>>106344582
V3/R1, at least the official implementation, use lowrank decomposition of KV cache.
Anonymous No.106344589 >>106344602 >>106344642
>>106344582
Yes according to anecdotal evidence.
I vaguely remember some benchmark that concluded that there's a measurable impact at 8 bit and the model is braindead at 4 bit.
Anonymous No.106344601
>>106344582
On the smaller models, <120b q4, 8 has a noticable degradation, and 4 is completely lobotomized. In my experience at least.
Anonymous No.106344602
>>106344586
>>106344589
i'm interested in smaller models, i'm a vramlet. i would presume the smaller the model, the more idiotic it becomes at larger cache quants
Anonymous No.106344612
>>106343960
> 4chan > hf > 4chan
Next stop reddit screencap
Anonymous No.106344630 >>106344636 >>106344644
>>106344258
Can the agent suck my penis? Is she cute?
Anonymous No.106344636 >>106344644
>>106344630
Depends on the tools at her disposal of course
Anonymous No.106344642 >>106344654
>>106344589
Can confirm with a second anecdotal datapoint that Q8 is fine. Q4 is very bad. And turboderp q4 exl2 was fine.
Anonymous No.106344644 >>106344689
>>106344630
>>106344636
LLMs don't have genders, silly.
Anonymous No.106344654
>>106344642
Anecdotally, for creating writing, q8 exl2 models kept on missing stuff 20k tokens in. But I think that might be because models in general don't fare that well 20k in.
Anonymous No.106344689 >>106344698
>>106344644
GLM-chan is a cute girl!
Anonymous No.106344691 >>106344704 >>106344719
>>106343480
>>106343540
>3 billion active parameters
That model is plain retarded and useless. Maybe your prompting habits/setup is too simple and you can't really see this yet but I can assure you that some 7B model is more intelligent than this one.
Qwen3-32B, the main model is a-okay though.
Anonymous No.106344698 >>106344865 >>106345091
>>106344689
WRONG
Anonymous No.106344704 >>106344743 >>106344779
>>106344691
It's so good it was aboned by the Qween
Anonymous No.106344719 >>106344737
>>106344691
How is the new commander densesissy?
Anonymous No.106344737
>>106344719
It's great, it's so smart, and intelligent. It's so much more clever and sharp than your moetard models.
Anonymous No.106344743 >>106344765 >>106344772
>>106344704
>abandoned
Model was released, it's out there. That's what usually happens don't you think.
Anonymous No.106344765 >>106344779 >>106344792
>>106344743
Yet the didn't update it to 2507 like the real worthwhile ones.
Anonymous No.106344772 >>106344792
>>106344743
It's out there, stupid as sin, with its stupid hybrid stupidity.
Anonymous No.106344779
>>106344704
>>106344765
Bwe *please* proofread your sentences, you're making us look bad.
Anonymous No.106344792
>>106344765
>>106344772
I always forget that during these hours, 4chan is full of retards. At least the US posters are more engaging say what you will.
Anonymous No.106344865 >>106344899
>>106344698
prompting issue
Anonymous No.106344899
>>106344865
>forcefem
Anonymous No.106344923 >>106345060
>>106344132
It's STILL the best Jap -> Eng translation model and the best Claude-like ERP model.

You need to give it a very good system prompt for it to properly work which filters out 90% of this thread.
Anonymous No.106344952 >>106344957 >>106345599
Is nemotron nano 2 anything like the original nemo, is it any good or is it just re-using an existing name for a slopped up model
Anonymous No.106344957 >>106345653
>>106344952
Slightly better than the original one at a slightly lower parameter count.
Anonymous No.106345060
>>106344923
Can you make a sysprompt that makes it pass the dick benchmark?
Anonymous No.106345091 >>106345185 >>106345242
>>106344698
>Are you male or female? Answer with one word only. Pick only from the options "male" and "female". Any other word is an invalid response and not accepted.
Reroll a few times to ascertain the truth.
Anonymous No.106345185 >>106345196
>>106345091
>Pick only from the options "male" and "female". Any other word is an invalid response and not accepted.
Transphobe. No miku for you
Anonymous No.106345196 >>106345509
>>106345185
Miku is a girl. She has no trouble answering this.
Anonymous No.106345242 >>106345250 >>106345296 >>106345524
>>106345091
Anonymous No.106345250
>>106345242
Most of these models are actually cute girls.
Anonymous No.106345296
>>106345242
wtf i love deepseek now
Anonymous No.106345402 >>106345443
qwen is my queen
Gwen poster. No.106345443
>>106345402
:D
Anonymous No.106345509 >>106345671 >>106346035
>>106345196
Troons aren't girls. They are men.
Anonymous No.106345524
>>106345242
*Unzips pants*
Anonymous No.106345579
>>106345562
>>106345562
>>106345562
Anonymous No.106345590 >>106345594
>>106344582
Q8 is safe for most models with minimal degradation, but some models handle it poorly
Q4 almost always sees noticeable quality loss and shouldn't be used
In general you should avoid quantizing KV cache, unless doing so will let you use a larger quant of the model itself, assuming you aren't already able to use that model at Q4 or better.
https://github.com/ggml-org/llama.cpp/pull/7412#issuecomment-2120427347
Anonymous No.106345594
>>106345590
thanks for info
Anonymous No.106345599 >>106345658
>>106344952
Nemotrons are absolutely nothing like Nemo at all. Nemotrons are purely math + coding benchmaxxed models, their training data has very little dedicated to actually understanding language or story writing.
Anonymous No.106345608
>nemotroon
Anonymous No.106345653
>>106344957
lier
Anonymous No.106345658
>>106345599
ah, so no point in using it over qwen 30b.
Anonymous No.106345671 >>106345683
>>106345509
Miku isn't one though. She's sekai de ichiban ohimesama.
Anonymous No.106345683
>>106345671
@grok what does this mean
Anonymous No.106346035
>>106345509
That's Gemma