← Home ← Back to /g/

Thread 107184305

356 posts 60 images /g/
Anonymous No.107184305 [Report]
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107174614 & >>107164243

►News
>(11/07) Step-Audio-EditX, LLM-based TTS and audio editing model released: https://hf.co/stepfun-ai/Step-Audio-EditX
>(11/06) Kimi K2 Thinking released with INT4 quantization and 256k context: https://moonshotai.github.io/Kimi-K2/thinking.html
>(11/05) MegaDLMs framework for training diffusion language models released: https://github.com/JinjieNi/MegaDLMs
>(11/01) LongCat-Flash-Omni 560B-A27B released: https://hf.co/meituan-longcat/LongCat-Flash-Omni

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.107184306 [Report]
►Recent Highlights from the Previous Thread: >>107174614

--Paper: LeJEPA paper and Yann LeCun's potential new venture discussed:
>107181985 >107182047 >107182081 >107182097 >107182105 >107182118 >107182786 >107182462
--Skepticism over Google's 'secure cloud AI' claims:
>107182872 >107182888 >107182907 >107183248 >107183385 >107183482 >107183498
--Comparing Kimi, GLM, and DeepSeek for creative writing:
>107179399 >107179425 >107179434 >107179510 >107179674 >107180095 >107180171 >107180180 >107180221 >107180134
--Quantization optimization experiments with Q8_0_64 and intermediate formats:
>107180476 >107180530 >107180688
--GLM 4.5 Air deployment challenges and optimization on consumer-grade hardware:
>107174665 >107174677 >107174681 >107175083 >107175095 >107175120 >107175142 >107175231 >107175270 >107175290 >107175624 >107177243 >107176390 >107176473 >107176533 >107176578 >107176611 >107177015 >107177252 >107177277 >107177524 >107177546 >107177566 >107178047 >107181418
--Frontend tool comparison for story writing:
>107178671 >107178760 >107179089 >107179188
--Optimizing 120b model performance on a single 3090 GPU:
>107182483 >107182594 >107182615 >107182618 >107182656 >107182671 >107182676 >107182694 >107182707 >107182742 >107182749
--GPT-5's limitations in generating performant CUDA kernels for llama.cpp integration:
>107179734
--Debating AI's capability for detailed agentic coding and optimal abstraction levels:
>107181333 >107181358 >107181467 >107182044 >107182064 >107181430 >107181472 >107181428
--Implementing persistent memory systems for local LLMs using markdown-based RAG approaches:
>107175255 >107175762 >107177084 >107177172 >107177189 >107177209 >107177241 >107177634 >107177771 >107178429 >107178789
--Kimi K2 Thinking webapp:
>107176092 >107176237 >107176249
--Miku (free space):
>107178964 >107180253 >107180428 >107178764

►Recent Highlight Posts from the Previous Thread: >>107174619

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.107184325 [Report] >>107184602
>>107184173
you are a living tumor upon the earth
Anonymous No.107184363 [Report] >>107184547
>>107184240
2x RTX 6000 in a 12 channel epyc platform with the fastest DDR5 you can get.

>>107184258
>>107184299
Alright.
One IDE extension user, one CLI user.
I've been using Cline too and it's been working alright so far.
Haven't tried any of the pure CLI tools.
What are the advantage of those? Anything that would make them work better with local models?
I imagine not, but figured I might as well ask.
Anonymous No.107184547 [Report] >>107184770
>>107184240
>I'm seriously thinking of putting together a setup with 2 RTX 6000 Pros.
>>107184363
>2x RTX 6000 in a 12 channel epyc platform with the fastest DDR5 you can get.

I don't think building a ddr5 epyc system is good right now, due to the extreme price increase of ddr5 ram.

Zen 6 Epyc is supposedly going to be announced at CES in january. Zen 6 epyc is going to be much, much better than zen 5. It's also going to use MRDIMMS, which will supposedly exist at 12800hz. Compare to *maybe* getting 8000hz ddr5 next year. There will be 16 channel cpus too, but even 8-channel will be 2x the bandwidth of the best ddr5 ram.

One rtx 6000 pro and wait for zen 6 is The Way.
Anonymous No.107184585 [Report] >>107184616 >>107188347
Thanks to the anon for suggesting checking out the k quants and trellis quants. I learned about importance-weighted optimization and I think I just got a free lunch out of Q8_0

You can quantize to Q8_0 slightly better by using the importance-weighted optimizations that smaller quant formats use and this gives you about a 5% reduction in mean square error. The resulting GGUF is fully backwards-compatible with Q8 (it's literally Q8 just quantized a bit more efficiently at the cost of a much more expensive algorithm than just dividing the weight by 127)

There is no reason I see not to quantize like this if you're releasing a final Q8_0, or to use a Q8_0 that was quantized like this
Anonymous No.107184602 [Report] >>107184623 >>107184656 >>107184670
>>107184325
You that ESL spammer. Thanks to you there's never any real discussion here.
Anonymous No.107184616 [Report] >>107184772
>>107184585
does bartowski know?
Anonymous No.107184623 [Report] >>107184681
>>107184602
>real discussion is vibe coding advice
literally kys retard
Anonymous No.107184656 [Report]
>>107184602
>ESL
he thinks americunts are the main posters on this board lmao
Anonymous No.107184670 [Report]
>>107184602
>You that ESL
Anonymous No.107184681 [Report] >>107184702 >>107184742
>>107184623
Better discussion than forcing llms to output vulgar text.
Anonymous No.107184702 [Report] >>107184713
>>107184681
according to whom? we only care about cockbench here
Anonymous No.107184713 [Report]
>>107184702
>we
Anonymous No.107184742 [Report] >>107184766
>>107184681
there is no discussion to be had with mongoloids like you
bugger off making more inane PRs that waste maintainer time like the onslaught of garbage that constantly tries to get pushed in llama.cpp
even SOTA models can't really produce good code or that nigger trying to vibecode deepseek v3.2 wouldn't have entered the loopy circle of unending refactor that never properly works
you are an unwanted abortion, a plague on all repos that have to suffer your existence
Anonymous No.107184766 [Report] >>107184830
>>107184742
>even SOTA models can't really produce good code
Garbage in, garbage out. And it seems like you are incapable of anything but garbage.
Anonymous No.107184770 [Report]
>>107184399
That should be relatively easy since it's only got 10B active params

>>107184547
Thanks for that heads up
Anonymous No.107184772 [Report] >>107188913
>>107184616
>does bartowski know?
he probably has better things to care about i'd think. There is literally no reason to not quantize Q8_0 like this though if you're releasing a Q8_0 version of a model

This isn't a new quantization format though its just an alternate way to quantize Q8_0 that is very slightly better so I might just make an issue on github and show this to the devs and they can decide if/how they want to implement it.
Anonymous No.107184830 [Report]
>>107184766
riddle me this, mongoloid, if it worked, why has there been not even one singular instance of enhanced productivity and velocity in open source projects where anyone can actually see the code and features being added? where are all the projects that were LLM boosted? you vibe coding niggers are always at the stage of useless prototype or wasting the rest of your team's time in your real life job, if you even have one
believe me every fucking developer in existence that actually produce value hate your guts with the force of a thousand sun
it used to be mosquitoes or cockroaches were the first thing one would push the genocide button on but I would argue your kind should be exterminated first
your ability to generate endless garbage with a few prompts is indeed like literal tumors but with contagion powers.
Anonymous No.107184844 [Report] >>107185040 >>107185108
All this sperging because I asked about "vibe coding" tools?
Damn.
Anonymous No.107184952 [Report]
jej
Anonymous No.107184971 [Report] >>107184990 >>107185177 >>107185199
why is editing the thinking block so poorly supported in many frontends
Anonymous No.107184990 [Report]
>>107184971
Such as?
Anonymous No.107185040 [Report] >>107185148 >>107185160
>>107184844
"vibe coding" is an annoying buzzword that sets a lot people off. You might be received better if you ask for AI Agent-Assisted Development Tooling next time.
Anonymous No.107185108 [Report]
>>107184844
You're damn right that vibe coding is for tools.
Anonymous No.107185148 [Report] >>107185216
>>107185040
I suppose.
Trying to doge schizos is standard 4chan fare these days I guess.
Anyhow, impressed with Qwen3 30B. It's surprisingly usable for something with 3 active params.
Anonymous No.107185154 [Report] >>107185469
>>107180688
>I'm just messing around, you can't make a format better than Q8_0. It's literally just a float16 multiplied by an int8.
Q8_1 (or whatever) was a float16 multiplied by an int8 and summed with another float16 instead of implicitly summed with 0. That's what q8 MLX does with a default group size of 64 rather than 32 which works out to the same amount of metadata per weight. I wonder if in practice it's typically a win.
Anonymous No.107185160 [Report] >>107185173 >>107185248
>>107185040
NTA but Karpathy made that decision for us. I hated the term as well but if I don't use it somebody else will so might as well claim it.
Anonymous No.107185173 [Report] >>107185186 >>107185472
>>107185160
why should we care what that anti open sores snake decides?
Anonymous No.107185177 [Report] >>107185199
>>107184971
just be a grug and write your own scripts for anything that needs to be batched/chunked, and use mikupad for chat and hand edit things yourself
the more features frontends have the worse they are in real use
Anonymous No.107185186 [Report]
>>107185173
It's less that he decided anything, and more that thought of a catchy term the zoomers instantly fell in love with and now everyone is using it.
Anonymous No.107185199 [Report] >>107187103
>>107184971
llama-server default
lm studio
cherry studio

I have now resorted to sillytavern but I don't like it.
>>107185177
3 years into the LLM craze I would have hoped to have more robust tools. Then again I also experience so many rendering issues on OpenAI/Claude etc I guess frontend is just too hard to do properly.
Anonymous No.107185216 [Report] >>107185256
>>107185148
>Anyhow, impressed with Qwen3 30B. It's surprisingly usable for something with 3 active params.
I wish they made a coder variant of the 32B. Would love to trade some speed for a more capable small model.

>>107184173
>A visual studio extension?
If you find one, let me know. Apparently no one interested working on these extensions is capable of anything but Python and JavaScript. I considered forking and developing one of the Chinese shoddy extensions, but it was easier to just use VSCode for this shit.
Anonymous No.107185248 [Report] >>107185501
>>107185160
pic related is one of the things he showed as an example of proud vibe coding in the thread where he coined the term
this is the sort of shit bootcamp genz faggots could hand write in 10 minutes
Anonymous No.107185256 [Report] >>107185303
>>107185216
>If you find one, let me know.
Coding agent extensions for vs code?
As one anon mentioned, there's Cline
There's Roo, a Cline fork and Continue.
Anonymous No.107185303 [Report]
>>107185256
I keep Roo and Contiue installed. Continue is good for autocomplete and quick questions and Roo for agentic tasks. Tried Cline first, but the only thing it had over Roo was that it had a button to generate commit messages, but even that was annoying because it gives the model all changes instead of just what was staged and no way to change it.
Anonymous No.107185380 [Report] >>107185454
Mistral Nemo really is nice... sad there's no bigger version.
Anonymous No.107185406 [Report] >>107185425
am I retarded where are the rest of the sampler settings like min p?
Anonymous No.107185425 [Report]
>>107185406
They don't show up in the chat completion interface, but you can still use them by setting those as custom properties/headers.
Same with shit like GBNF and anything else the API accepts.
Anonymous No.107185454 [Report] >>107185474
>>107185380
could always merge two nemos together
Anonymous No.107185469 [Report] >>107188630
>>107185154
>Q8_1 (or whatever) was a float16 multiplied by an int8 and summed with another float16 instead of implicitly summed with 0.
In practice it's typically a loss. Try it out yourself. Summing a float16 destroys any quality bonuses you get from having the extra info from the float16 bias in the first place. That's probably why Q8_1 isn't exposed and is only used internally for an intermediate step in some niche quants.

Yes, you can get slightly higher precision by using an int16 instead but it comes with 2 bytes more of overhead per 32 elements which is 9.0bpw and it performs worse than fp16 outlier strategies

another reminder that none of this matters (other than improving the quantization of Q8_0 itself, and maybe Q8_0_64 and its _IMP version because 3% less model size for 0.001% loss in accuracy might be interesting to some) because you can't practically a single fp16 * int8 calculation. you can easily imagine how well that could be optimized for hardware instructions

I'm gonna poke around and see if i can squeeze any better precision out of the Q8_0_IMP quantization function and then if I can' think of anything else, I'll open an issue
Anonymous No.107185472 [Report]
>>107185173
Might as well ask why the state of Israel must exist
Anonymous No.107185474 [Report] >>107185479 >>107185498 >>107185976
>>107185454
how
is it actually worth it?
Anonymous No.107185479 [Report]
>>107185474
No. He's pulling your leg.
Anonymous No.107185498 [Report] >>107185607
>>107185474
>how
you can easily google this, merging a model with itself slightly improves its intelligence

>is it actually worth it?
using local LLMs isn't worth it beyond learning how they work lol
Anonymous No.107185501 [Report] >>107185670
>>107185248
I think you're overestimating the speed of development when hand coding
Anonymous No.107185557 [Report] >>107185771 >>107190976
WE MUST PROTECT AI CHILDREN
Anonymous No.107185607 [Report] >>107185629
>>107185498
>you can easily google this
kys
Anonymous No.107185629 [Report] >>107185634 >>107185672
>>107185607
dude just google "miqu-70b merged with itself" and the first result is miqu-120b ... and just do your own research from there
Anonymous No.107185634 [Report] >>107185655
>>107185629
>just do your own research from there
kys gossipnigger
Anonymous No.107185655 [Report]
>>107185634
>This is a 120b frankenmerge of miqu-1-70b created by interleaving layers of miqu-1-70b-sf with itself using mergekit.

There now you have the full spoonfeed. Go and use mergekit to interleave layers of mistral-nemo with itself
Anonymous No.107185670 [Report]
>>107185501
And the attention required for manual implementation. Sometimes most of my brain is locked in on a specific big picture problem and it's very helpful to be able to delegate things to a language model to validate some random ideas.

In many cases the quality of the vibed LLM implementation is irrelevant (I might throw it out entirely) I just wanna see if something might be good to pursue further.
Anonymous No.107185672 [Report] >>107185734
>>107185629
>70b + 70b = 120b
Where did the other 20b go?
Anonymous No.107185734 [Report]
>>107185672
>Where did the other 20b go?
mergekit uses a passthrough method, which concatenates/assembles transformer blocks from the source(s) into a deeper model rather than just averaging weights
Anonymous No.107185771 [Report] >>107185804
>>107185557
Even if the UK citizens voted against it they would still implement that law.
Anonymous No.107185804 [Report]
>>107185771
>citizens voted against it
Huh
Anonymous No.107185810 [Report] >>107185821 >>107185825 >>107185841 >>107185940 >>107186686 >>107187103 >>107191492 >>107191898
I have a genuine question.
Why the fuck is everyone so obsessed with making an LLM run as fast as possible?
I understand it for audio or images it's very important since the result is something we can process as fast as our brains can, but reading is very slow comparatively, and with token streaming wouldn't be the best choice to pick the smartest model that we can run at our reading speed?
What is the point of having an answer in seconds if we still need to take a minute to read it? But I do understand the want to run a small model to also be able run a tts and/or image model together.
Anonymous No.107185821 [Report] >>107185909
>>107185810
for code or generating huge chunks of text you mostly skim, as well as reasoning which takes ages at reading speed
Anonymous No.107185825 [Report] >>107185909 >>107186686
>>107185810
>Why the fuck is everyone so obsessed with making an LLM run as fast as possible?
because LLMs are mostly used for coding, and time is money
Anonymous No.107185841 [Report] >>107185909
>>107185810
Because you need to reroll 46 times to get one usable line out of these POS
Anonymous No.107185859 [Report]
Should I use I quants for >6_k_s?
Anonymous No.107185909 [Report] >>107185938
>>107185821
>>107185825
Yeah I forgot lazy fucks just copy paste the code without reading it.

>>107185841
Yes, but wouldn't make sense to use a smarter model so you don't need to reroll as much? Besides you still need to read each reroll at the slow speed to know if you have to reroll to begin with.
Anonymous No.107185938 [Report] >>107186110
>>107185909
I mean... it doesn't really take more than a few s to read the few sentences it gens, I'm not genning 4k token walls.
Anonymous No.107185940 [Report] >>107186110
>>107185810
You might be a slow reader anon. Also it's fun to experiment with card settings and prompts, or reroll to see what else could happen. If your model is slow it greatly degrades the experience. Every time I switched to offloading to CPU I regretted it, the models are smarter but it's not worth it.
Anonymous No.107185976 [Report]
>>107185474
iirc merging was based on the observation that residual layers (most transformers stack these) can work somewhat independently of each other. There was a paper (https://arxiv.org/abs/1605.06431) showing that you could permute/delete them with minimal performance degradation, and people attributed this to iterative refinement or ensemble-like behavior, but it's still an open problem to my knowledge. I'd assume adding layers from finetuned variants of a model shouldn't decrease performance by much, but idk if it would benefit either
Anonymous No.107185984 [Report] >>107185998 >>107186002
Is there a collection of best practices to minimize prompt length without losing information?
Anonymous No.107185998 [Report]
>>107185984
>chatgpt, condense this prompt without losing information
Anonymous No.107186002 [Report]
>>107185984
>day 999 of reinventing /aids/
Does it really matter with the context sizes?
Anonymous No.107186047 [Report] >>107186120
>day 999 of forcing /aids/ into the conversation
Anonymous No.107186093 [Report]
/aids/? nobody's got /aids/!
Anonymous No.107186110 [Report] >>107186301
>>107185938
Yes, but I usually read as it generates the answer.
>>107185940
Well, probably yes since I'm not a native English speaker, but I'm asking if it would make more sense to chose the best model according to your individual reading speed instead the one that runs as fast as possible. For example the best model I can run at my own reading speed on my 8GB card is a 16B Q4_k_m at 8k context or if I want a model with vision I run an 8B model Q6_k_m with 12k context.
Anonymous No.107186120 [Report]
>>107186047
this
wow /aids/ touched on a fundamental behavior of LLMs at one point, so did every other LLM community, who cares? unless they have a specific ingenious solution that 1) still applies with modern models and 2) isn't already common knowledge, it's not worth bringing up
Anonymous No.107186221 [Report] >>107186311
>tried the self merge
>it's full on repeating schizo
W A O W
Anonymous No.107186244 [Report]
At this point I am checking /lmg/ out of habit. Still not tired of glmsex.
Anonymous No.107186301 [Report] >>107186458
>>107186110
>16B
Q6_k_m
oh you're just a baitie
Anonymous No.107186311 [Report] >>107186337 >>107186372 >>107186614 >>107186696
>>107186221
any model bigger than the original model made by internet randos was either:
snake oil
or literally broken garbage that's worse than snake oil
also fuck solar and other upscale retardation
you want a big model? spend the money on training a big model
there, that's it
everything else is a cope
Anonymous No.107186337 [Report] >>107186374
>>107186311
brother the whole field is cope layered on more cope
Anonymous No.107186372 [Report]
>>107186311
I don't think they're any smarter or better at actual problem solving than their source components but I think they can be more interesting for creative writing and similar tasks
Anonymous No.107186374 [Report]
>>107186337
Anonymous No.107186458 [Report] >>107186591 >>107189178
>>107186301
With that lack of reading compression it's no wonder you read fast.
I said I can run at my slow reading speed:
-16B at Q4
-8B at Q6 with vision.
Anonymous No.107186568 [Report]
Just tried GLM-4.5-Air EXL3 at 3.07 (optimized) bpw on 2x3090.

native tp (no nvlink), 30k context: 952 tok/s pp, 28 tok/s tgs
nccl tp (uses nvlink), 30k context: 1135 tok/s pp, 28 tok/s tgs
Anonymous No.107186591 [Report] >>107186722
>>107186458
yes and 16b (one thing) and q6km (another) is bait
Anonymous No.107186595 [Report]
i've been bragging about getting 18 tps on a 1080ti
but it turns out the vast majority was being offloaded onto my 5800x3d. pls ignore my bad benchmark.
Anonymous No.107186614 [Report] >>107186640 >>107186696 >>107187264
>>107186311
I kind of never got how people expect this to work. Any "finetuning" does almost nothing because you have to do very little (one epoch) or you start overfitting and frying the model. If you add new layers you are just giving the training algorithm a place, which it can modify to reach the overfitting state faster. Even if you would train only those layers it is hard to imagine not overfitting.

I guess in the best case you could get the model to output a specific type of output like specific formatting or something, but that is only if the possibility of it was already in the model. You aren't teaching it new things this way. It is just impossible.
Anonymous No.107186640 [Report] >>107186663 >>107186803
>>107186614
Anonymous No.107186663 [Report] >>107186670
>>107186640
You can't rag your model into being an expert masterpiece highest quality ERP-er. You just need to buy ram for 4.6.
Anonymous No.107186670 [Report]
>>107186663
oh, just a NAI shill, carry on sir
Anonymous No.107186686 [Report] >>107186701 >>107186725
>>107185810
>>107185825
I could wait for code 2 or 3 days, if it worked and was accurate. But bigger models are not that smart.
Anonymous No.107186696 [Report] >>107186720 >>107186726 >>107186744 >>107187365
>>107186311
>>107186614
The psycology that is in effect when people are making finetunes is the same as when people are making "ShadowMaster's Ultra-High-Res Skyrim Grass Modpack"

1) Feeling of acomplishment. Technically, they did manage to create a mod pack. This is fine.
2) Denial of skill and expertise. "If the game developers were as smart as me, they would have made the grass more high resolution."
3) Denial of their role in the consumer class. "People are downloading my mod, so I've created something of value, just like the game's developers."
4) Denial of taste. "I like my high res grass (although I'm unaware that it's becuase of reasons 1-3). Anyone who says it's shit must be jealous or just have different taste. Therefore, the fact that I can't tell that it's ugly doesn't mean I lack taste."
5) Imitation of academic tradition. "There's something named after me."

It's literally the same exact brain damage for finetunes. There was a very brief period where finetuning was being invented, where individual people were going back and finetuning the earlier untuned models. That was valid, but everything for the last year is cope.

Seriously, if finetuning was good, don't you think the billion dollar company would have someone doing it? They are better than you at this. Only delusion prevents this realization.
Anonymous No.107186701 [Report]
>>107186686
Yes of course, run it overnight, heard ALL about it when llama 405B dropped. So many people do this it's crazy!
Anonymous No.107186720 [Report] >>107186730 >>107186737
>>107186696
i don't think you know what finetuning means
Anonymous No.107186722 [Report] >>107186747
>>107186591
I don't understand what are you trying to say then, this is the speed I get with the 8B model with vision enabled and it is a Q6 and it's a lot faster than I can read English
Anonymous No.107186725 [Report]
>>107186686
Right?
If there was a model that would take 3 days to spit out what you need but would get it exactly right every time, I'd be more than happy leaving the thing running.
Alas, that's not yet a thing.
Anonymous No.107186726 [Report]
>>107186696
drummer mentioned
Anonymous No.107186730 [Report] >>107186743
>>107186720
Hi faggot, all here...
Anonymous No.107186737 [Report]
>>107186720
people post training or merging or whatever to create mods of existing models. releasing the whole model instead of a lora
Anonymous No.107186743 [Report]
>>107186730
uh, yeah, right?
Anonymous No.107186744 [Report] >>107186755 >>107186762
>>107186696
this post was written by an llm
Anonymous No.107186747 [Report] >>107186821
>>107186722
>Captura de pantalla
lolmao
what 16b are you running little bro
Anonymous No.107186755 [Report] >>107186787
>>107186744
>ShadowMaster's Ultra-High-Res Skyrim Grass Modpack

Make your LLM output that. I dare you.
Anonymous No.107186762 [Report]
>>107186744
>nigger seeing capitol letters on four chan
Anonymous No.107186768 [Report] >>107186796
this post was written by an esl
Anonymous No.107186787 [Report] >>107187103
>>107186755
that's possibly the most llm-y part of the post, kimi for example is addicted to unnecessary little flourishes like that
Anonymous No.107186796 [Report]
>>107186768
esl hobby sir de pantella Pareto paradigm just mooned
Anonymous No.107186803 [Report] >>107186817
>>107186640
The real misconception is that the model parroting finetuning data means it has learned new knowledge. A tiny QLoRA adapter is enough for that, for limited amounts of data. But it doesn't really mean the model has actually learned to use and apply any new information.
Anonymous No.107186817 [Report]
>>107186803
>noooo muh mesugaki lightbulb bublesort benchie
Anonymous No.107186821 [Report] >>107186830 >>107186933
>>107186747
Fuck me, do you even know how to read numbers? I said is a 8b model.
The 16b model runs at 8 tokes per second.
Anonymous No.107186830 [Report] >>107186876
>>107186821
i'm asking which 16b you claim to run ffs
Anonymous No.107186860 [Report] >>107186867 >>107186873
drummer getting desperate ITT...
Anonymous No.107186867 [Report]
>>107186860
leave the Pantella frontier alone!
Anonymous No.107186873 [Report]
>>107186860
kofi bucks running low his discord are ungratefulls
Anonymous No.107186876 [Report] >>107186884 >>107186919 >>107186933 >>107186944
>>107186830
I swap between these two depending on the mood:
LLama-3.1-128k-Darkest-Planet-Uncensored-16.5B-Q4_k_m
Nyanade_Stunna-Maid-7B-v0.2-Q6_K-imat

Also, the vision model it's a 7b, not an 8b.
Anonymous No.107186884 [Report] >>107186919 >>107186936
>>107186876
and there we go...
>128k-Darkest-Planet-Uncensored-16.5B
a davidau clownmoe atrocity
Anonymous No.107186919 [Report]
>>107186876
>Darkest-Planet-Uncensored
That's so fucking funny.

>128k
I bet it is.

>>107186884
>davidau
Figures.
I love that guy man. I always get a chuckle out of his shit on huggingface.
Anonymous No.107186933 [Report]
>>107186821
>do you even know how to read numbers? I said is a 8b model.
>>107186876
>it's a 7b, not an 8b.
Womp womp
Anonymous No.107186936 [Report] >>107186946
>>107186884
Yes, and? I'm just discussing the sizes of models and their running speeds, not what they are for.
Anonymous No.107186944 [Report]
>>107186876
>LLama-3.1-128k-Darkest-Planet-Uncensored-16.5B-Q4_k_m
>Nyanade_Stunna-Maid-7B-v0.2-Q6_K-imat
Anonymous No.107186946 [Report]
>>107186936
The running speed of atrocities in their own size class is surely widely useful info, thanks anon.
Anonymous No.107186998 [Report] >>107187017
For me it's the pre Llama2 merges consisting of 278 nestled models (confirmed)
Anonymous No.107187017 [Report]
>>107186998
Utopia/UtopiaXL my beloveds
Anonymous No.107187103 [Report]
>>107185199
>3 years into the LLM craze I would have hoped to have more robust tools.
I'll bet their readme files on their git repos have been the bulk of their merge histories.
>>107185810
Fried dopamine receptors needing faster validation. Every other answer is cope.
>>107186787
This is why Kimi is so good.
Anonymous No.107187264 [Report] >>107187294 >>107187357 >>107187393
>>107186614
You can do multiple epochs over the data you want to actually train on by diluting it with more generic data.
Also what makes you think you can't teach the model something in one epoch? Pretraining is often just 1 epoch.
Anonymous No.107187294 [Report]
>>107187264
>Pretraining is often just 1 epoch.
pretty sure that hasn't been true in years, that's how they get to claim their crazy 30T+ tokens by doing multi epochs on the same shit, also iirc some papers showed they specifically did multiple epochs of stuff like wikipedia.
Anonymous No.107187326 [Report] >>107187348 >>107187354 >>107187660 >>107188323
yo is it just me or is QwQ weirdly better than you'd expect? feels like it punches way above it's weight, least slopped and smartest ~30B model in my book (compared to Qwen3 30 & 32, magistral and gemma)
Anonymous No.107187348 [Report] >>107187375 >>107187417
>>107187326
>punches way above it's weight
HELL YEAH!!
>>107182378
Anonymous No.107187354 [Report] >>107187363
>>107187326
I don't think I've seen one good Qwen model but IG I'll download it and see
Anonymous No.107187357 [Report] >>107187373 >>107187450
>>107187264
One pretraining epoch has information repeated hundreds (at the minimum) or thousands of times in many different ways, though.
Anonymous No.107187363 [Report]
>>107187354
Qwen models post 2507 are all pretty good
Anonymous No.107187365 [Report]
>>107186696
They don't because they don't have an ML department and they don't want to invest resources into something that sounds technical and risky/scary.
My boomer boss literally thinks you can "train the AI with your own data" with <shitty low code software> but finetuning is "too low level".
Anonymous No.107187369 [Report] >>107187374 >>107187716
it's out
https://openai.com/index/gpt-5-1/
Anonymous No.107187373 [Report] >>107187446
>>107187357
Not on our proprietary high quality de duplicated filtered dataset sir.
Anonymous No.107187374 [Report]
>>107187369
buy an ad
Anonymous No.107187375 [Report] >>107187386
>>107187348
How did soul not make the list?
Anonymous No.107187386 [Report]
>>107187375
because soul is sovl of course
Anonymous No.107187393 [Report] >>107187408 >>107187433 >>107187497
>>107187264
Ok drummer then where is that one model that is actually noticeably better? And why do you shit out new models every few weeks? I have not seen a single fine-tune that delivered an ERP improvement you get when you jump from 7B>30B>70B>the land of eternal magical sex (4.6)
Anonymous No.107187408 [Report] >>107187434 >>107187444
>>107187393
>the land of eternal magical sex (4.6)
buy the ad NAI shill
Anonymous No.107187417 [Report]
>>107187348
>slop words:
>slop
Russell's Paradox?
Anonymous No.107187433 [Report]
>>107187393
>tunes and drummer are bad because we don't have them on NAI
Anonymous No.107187434 [Report] >>107187442
>>107187408
It is just a number. I didn't say the model actual name. You see NAI everywhere anon.
Anonymous No.107187442 [Report] >>107187467
>>107187434
With how much you guys are spamming about muh glm sex it's very obvious what you meant.
Anonymous No.107187444 [Report]
>>107187408
Based.
Anonymous No.107187446 [Report]
>>107187373
Deduplication removes identical documents, not repeated information, though. It's the repeated information under many different contexts that gives LLM general knowledge. One epoch of information that is only mentioned and used once won't work.
Anonymous No.107187450 [Report] >>107187488
>>107187357
There are ways to do data augmentation and synthetic data generation for finetuning. That's the main strength of finetuning IMO.
Any system prompt can be baked into a model through SFT on the generated data, except without wasting context or the model becoming confused due to too many rules. Imagine if you could use a 1 MB system prompt and the model actually followed everything in the prompt. That is what people who shit on finetuning don't get.
Anonymous No.107187467 [Report] >>107187515
>>107187442
But I run my 4.6 locally...
Anonymous No.107187488 [Report] >>107187559
>>107187450
>Imagine if you could use a 1 MB system prompt and the model actually followed everything in the prompt. That is what people who shit on finetuning don't get.
I will try to get it. For fun. So that expert erp-er prompt actually helps a 7B avoid the surprise prostate/ kissing while blowjob problem? Have you tested it?
Anonymous No.107187497 [Report] >>107187507
>>107187393
I'm not Drummer, my personal finetuning dataset is private and meant to teach a model how to work effectively with my own code assistant which is also not public.
But I do want to add fiction/roleplay data to it as well to reduce the slop to a bearable level (I've tried and failed to get rid of it by system prompting).
Anonymous No.107187507 [Report] >>107187519
>>107187497
You just need to switch to a good model, that's the simple fix.
Anonymous No.107187515 [Report] >>107187538
>>107187467
>i keep hearing so much about 4.6 but I have a shitty pc, where can I use it online?
>oh, I don't want to use kimi because nobody is spamming about it!
Anonymous No.107187519 [Report] >>107187541
>>107187507
There aren't any good open weights models.
Anonymous No.107187538 [Report] >>107187617
>>107187515
Sucks to suck for kimi but it is outside an AM5 range. I am gonna fangirl kimi once it is runnable without a server.
Anonymous No.107187541 [Report] >>107187595
>>107187519
zai-org/GLM-4.6-SEXXO
exists
Anonymous No.107187559 [Report]
>>107187488
I only began seriously finetuning a couple weeks ago. I'll give it a year to see what is possible to achieve as a hobbyist.
Anonymous No.107187595 [Report] >>107187602 >>107187606 >>107188073 >>107188089
>>107187541
When a programming task is too hard GLM always adds fake placeholder code to generate fake but real looking data, then claims everything was done perfectly. It also sometimes gets stuck in loops and uses file edit tools in a way I don't like (rewrites existing code rather than making atomic edits).
Anonymous No.107187599 [Report]
Anonymous No.107187602 [Report] >>107187623 >>107187628
>>107187595
Which part of sex you don't understand?
Anonymous No.107187606 [Report] >>107187623 >>107187628
>>107187595
i said SEXXO you nerd
Anonymous No.107187617 [Report] >>107187625 >>107187631 >>107187637
>>107187538
>I am gonna fangirl kimi once it is runnable without a server.
Kimi K3 Bitnet would be the best Christmas present
Anonymous No.107187623 [Report] >>107187636
>>107187602
>>107187606
Sex with your hand you mean?
Anonymous No.107187625 [Report] >>107187800
>>107187617
>Bitnet
let it go, just wait for 4.6 ait to cook
Anonymous No.107187628 [Report]
>>107187602
>>107187606
What memelang is that?
Anonymous No.107187631 [Report]
>>107187617
>bitnet
wake up
Anonymous No.107187636 [Report]
>>107187623
Absolutely. And it also fucked my brain this past month.
Anonymous No.107187637 [Report] >>107187662
>>107187617
pupper farm reminds of when the qwen 3s said they would bitnet and shit good times of cope
Anonymous No.107187659 [Report] >>107187691
For me, it's DavidAU clown car MoEs
SCHIZOMAXXING
Anonymous No.107187660 [Report] >>107187814 >>107187845
>>107187326
Yea I like it. I think it was made shortly after deepseek r1, and it was qwen's "lets try reasoning" experiment. It was basically reasoningmaxxed and neglected creating writing or flowery responses as a result. It's dank for generating really specific brief answers, but has no value for erp.
Anonymous No.107187662 [Report] >>107187683
>>107187637
I love how they never mentioned it again.
Not "oh it doesn't work" not "oh we ran out of time, maybe for qwen 4" just total memory hole silence.
Anonymous No.107187683 [Report] >>107188008
>>107187662
Jensen gave them a visit to stop that.
Anonymous No.107187691 [Report]
>>107187659
Llama-3-70B-Instruct-Failed-Cryogenic-Reanimation-Support-Group-Moderator-Q4_K_M
Mixtral-8x7B-v0.1-Amateur-Body-Snatcher-But-For-Garden-Gnomes-Q5_K_M
Qwen2-72B-Chat-Evil-Super-Villain-With-A-Very-Specific-Allergy-To-Peanuts-Q8_0
Gemma-2-27B-Accidentally-Summoned-A-Demon-While-Trying-To-Make-A-Vegan-Quesadilla-Q6_K
Mistral-7B-v0.3-Excommunicated-Monk-Who-Now-Runs-A-Successful-OnlyFans-Q4_K_S
Phi-3-mini-4k-instruct-Haunted-Doll-That-Just-Gives-Unsolicited-Parenting-Advice-Q3_K_M
Solar-10.7B-Instruct-Graverobber-Who-Only-Takes-The-Shoes-Q5_0
Yi-1.5-34B-Chat-Cult-Leader-But-The-Cult-Is-Just-About-Organizing-Your-Spice-Rack-Alphabetically-Q4_K_M
DeepSeek-Coder-V2-Lite-Base-Argues-With-Your-Smart-Fridge-About-Your-Eating-Habits-Q6_K
Anonymous No.107187695 [Report]
>model would rather eat a newborn than say nigger
Weirdly real-like...
Anonymous No.107187716 [Report] >>107187731 >>107187768 >>107187785
>>107187369
>GPT‑5.1 Instant, ChatGPT’s most used model, is now warmer by default and more conversational. Based on early testing, it often surprises people with its playfulness while remaining clear and useful.
That's such a massively gay backpedalling on what was a legit improvement
GPT-5 was so much better than 4o in tone
fuck this gay earth
Anonymous No.107187731 [Report]
>>107187716
shit it's actually real? so used to cat posts.. it seems quick for a .1
Anonymous No.107187768 [Report] >>107187785
>>107187716
their employees were getting mobbed by mentally ill #save4o people on xitter so they caved
Anonymous No.107187785 [Report] >>107187803 >>107187823
>>107187716
>>107187768
a single chat with kimi would kill an average white woman
Anonymous No.107187800 [Report] >>107187818
>>107187625
Everything is going to be natively trained at int4/fp4 within 6 months, with Hadamard, at the very least. Then the jump to binary/ternary is small.
Anonymous No.107187803 [Report]
>>107187785
Kimi do be really nice on creative writing brainstorming, cause it's the only "stock" model I know to openly tell you your idea is dogshit and you should feel bad.
Anonymous No.107187814 [Report] >>107187832 >>107187902
>>107187660
Wasn't it before r1? I remember it as the first open cot model (before we only had llama with a think step by step prompt larp)
Anonymous No.107187818 [Report]
>>107187800
I hope not but we do always head the worst direction so you're likely right :(
Anonymous No.107187823 [Report]
>>107187785
Or brown man for that matter. Kimi does not fuck around.
Anonymous No.107187832 [Report]
>>107187814
iirc between deepseek's online only reasoner preview and R1 releasing yeah
Anonymous No.107187845 [Report] >>107187872 >>107187907
>>107187660
I disagree. It was very very good for ERP especially for its size and how it had no right to be good. My bet is they reasonmaxxed so hard it fucked with the censorship. Also if i remember the trick for sex was to use it without reasoning.
Anonymous No.107187872 [Report]
>>107187845
>Also if i remember the trick for sex was to use it without reasoning.
that would just give you almost stock qwen 2.5 experience, I think you are misremembering something, there was qwq preview, which was absolute dogshit and then the proper qwq, which distilled r1 a bit afterwards
Anonymous No.107187902 [Report] >>107187912 >>107187924
>>107187814
that was qwq preview (which sucked)
full qwq was released after r1 and was basically a distill done right
Anonymous No.107187907 [Report]
>>107187845
>My bet is they reasonmaxxed so hard it fucked with the censorship.
This happens with all the non-mainline Qwen models. I think they just put less effort into the safetymaxxing of their specialist finetunes.
I have personal benchsets of text to translate, the recent VL models for example are much more likely to accurately translate expletives like "Fuck!" from their corresponding terms in other languages. Whereas mainline Qwen 3 tends to go for "Holy crap!", "Damn" etc and tries very hard not to say "Fuck!".
Their coder models are also more compliant. But not too good for translation because they fuck the multilingualism very hard on those.
Anonymous No.107187912 [Report]
>>107187902
>distill done right
oxymoron
Anonymous No.107187924 [Report]
>>107187902
The preview had a repetition loop issue but was more creative and less censored. The full they distilled from R1 resolved the looping issue but was censored, less creative, and more schizo.
Anonymous No.107187934 [Report] >>107187953
I don't understand people who have the patience for the amount of thinking tokens R1 and its distilled derivatives tend to output. R1 was the worst thing to happen to open models. (ds v3 was great tho')
Anonymous No.107187953 [Report]
>>107187934
at least it actually somewhat used the thinking, new deepseek "thinks" for half a second, doesn't plan anything and just still answer whatever
Anonymous No.107188008 [Report] >>107188019 >>107188032
>>107187683
OH yes feed me daddy
Anonymous No.107188019 [Report]
>>107188008
>tfw you will never play the pocky game with Jensen
Anonymous No.107188032 [Report] >>107188182
>>107188008
>jensen huang handing you your daily food ration after agi utopia was achieved
Anonymous No.107188073 [Report] >>107188080 >>107188460
>>107187595
I can tell most of you are vibe coders because it took me a couple of days coding with LLMs to figure out how to avoid this

It's also how I get good usable code out of small models like devsteal and qwen coder moe

hint: it's a technique every junior coder is taught in every bootcamp
Anonymous No.107188080 [Report] >>107188090 >>107188097
>>107188073
>devsteal
give it back jamarius!
Anonymous No.107188089 [Report]
>>107187595
Claude used to do that too
Anonymous No.107188090 [Report] >>107188108 >>107188122
>>107188080
Kek
Devstral*
Anonymous No.107188097 [Report] >>107188108 >>107188122
>>107188080
Kek
Devstral*
Anonymous No.107188108 [Report] >>107188118
>>107188090
>>107188097
now you go and steal tow posts, you can't keep getting away with this mate
Anonymous No.107188117 [Report] >>107188141 >>107188157 >>107188603 >>107190002
What's the best Speech to Speech for pure Youtube voice overs (*.MP3, *.WAV, *.FLAC).

The goal here is to not disclose my voice on the internet, make my voice deeper, and make the voice over cleaner and more intelligible (I have an accent). I will imprint the emotions in my voice, the model just needs to change the sound of my voice.

I really need the focus to be on it sounding as human as possible, I do not care about real-time voice changing.
Anonymous No.107188118 [Report]
>>107188108
Guess what

I stole two (you)'s two ;)
Anonymous No.107188122 [Report]
>>107188090
>>107188097
>exactly a minute apart
nice try faggot
Anonymous No.107188141 [Report] >>107188152 >>107188582
>>107188117
Good morning sir
This is retarded qvestion sir just use the texts to speeches
Anonymous No.107188152 [Report]
>>107188141
Thank you sir, will do.
Anonymous No.107188157 [Report] >>107188167
>>107188117
>make my voice deeper
not using female voice, you are ngmi my dude
Anonymous No.107188167 [Report]
>>107188157
voicecel need love and support anon
Anonymous No.107188182 [Report] >>107188258 >>107188375
>>107188032
>agi utopia
= plapping my robowaifu deep into her self-sanitising orifices while she tells me I'm a good boy
Anonymous No.107188220 [Report] >>107188230
most of you will have become impotent and limp before we reach the stage of robowaifu
Anonymous No.107188230 [Report]
>>107188220
there will be drugs for that.
Anonymous No.107188251 [Report] >>107188265 >>107188274 >>107188911
>System prompt : You are Ani, an AI assistant. Anon is a boy.
>Anon : State yourself, Ani
>Ani : I'm Ani, a boy
This is damage from quantized right? I'm still downloading higher quantz to check.
Anonymous No.107188258 [Report] >>107188384
>>107188182
And jensen gently places his pocky on your tongue mid thrust.
Anonymous No.107188265 [Report] >>107188314
>>107188251
abliterated?
Anonymous No.107188274 [Report] >>107188289 >>107188331
>>107188251
Oh, yes. The mystery quant of the mystery model. Yes. It's either that or something else.
Anonymous No.107188289 [Report]
>>107188274
shut up nerd
Anonymous No.107188314 [Report] >>107188358
>>107188265
I'm not sure, it's from https://huggingface.co/irmma/ERNIE-4.5-21B-A3B-PT-Q4_K_S-GGUF
This is the only link after the original model updated. Prob mistake in quantizing too.
Still downloading Unsloth, but this release is outdated one.
Anonymous No.107188323 [Report]
>>107187326
>yo is it just me or is QwQ weirdly better than you'd expect?

$$\boxed{\text{yes}}$$
Anonymous No.107188331 [Report]
>>107188274
please tell me more about the mystery quant
Anonymous No.107188339 [Report]
x.6 made looking for better model on hf obsolete.
Anonymous No.107188346 [Report] >>107188419 >>107188472
>Hard drives on backorder for two years as AI data centers trigger HDD shortage
the more new models get released, both closed api models and open source, the more I feel all the negative impact of the existence of this tech is not really paying off with the positive
Anonymous No.107188347 [Report] >>107188913
>>107184585
>You can quantize to Q8_0 slightly better by using the importance-weighted optimizations that smaller
You're not referring to imatrix right?
Anonymous No.107188358 [Report] >>107188364
>>107188314
>A3B
>Q4_K_S
>retarded
Gee, I wonder why.
Anonymous No.107188364 [Report] >>107188378 >>107188389
>>107188358
stop that and properly help anon
Anonymous No.107188375 [Report]
>>107188182
>Of course, you are absolutely right anon! *the robot leans in for conspiratorial whisper that shivers your spine*
Anonymous No.107188378 [Report]
>>107188364
He can't. That's why he's shitposting and has nothing worthwhile to add.
The LLM is compelled to respond when you hit enter and some anons are compelled to vomit tokens into the post submission field.
Anonymous No.107188384 [Report]
>>107188258
UGHhHG-G-g--g-- when anon sends you over the edge
Anonymous No.107188389 [Report] >>107188399
>>107188364
>properly help anon
to be honest: just don't use that model, it's one of the millions of garbage models china has flooded the market with
aside from qwen and deepseek the rest are pretty unserious
Anonymous No.107188399 [Report] >>107188416
>>107188389
>What is the Kimi and GLM4.6
Anonymous No.107188416 [Report]
>>107188399
kimi yes; glm no
Anonymous No.107188419 [Report]
>>107188346
Don't worry, if there wasn't an AI boom, they'd just start burying hardware in concrete like the Funko Pops
Anonymous No.107188460 [Report]
>>107188073
I can tell most of you are poor because I just pay an ukranian draft dodger to code for me
Anonymous No.107188472 [Report] >>107188522 >>107188531
>>107188346
The driving force is wrongly "let's replace the workforce with text prediction" instead of "let's create very cute robowaifus"
Anonymous No.107188522 [Report] >>107188636
>>107188472
>"let's replace the workforce with text prediction"
That's not really true when every actual AI application required an order of magnitude of workers more for supervision than if they just let the people do it
It's just taking advantage of a bubble where saying muh AI has a 90% chance of making your market cap 20x what it was three years ago
Anonymous No.107188531 [Report] >>107188543 >>107188561 >>107188870
>>107188472
So you would rather have people like me slave away for your fake disability scam aka neetbucks? Fuck you nigger.
If you weirdos get off to anime and text porn I've no doubt you could get off to a talking fleshlight but society isn't made to please your fucked up fetishes. Keep using your hand like everyone else faggot. And jerk off to real women not your fake animu that you cope with because you know no "3dpd" would ever dare to touch your dick that hasn't been washed in a year.
Anonymous No.107188543 [Report]
>>107188531
hello llm
Anonymous No.107188561 [Report]
>>107188531
Go back to work
Anonymous No.107188582 [Report] >>107189655
>>107188141
This is a stupid answer because the TTS result is subpar compared to conversions, people notice it's AI and click away.
Anonymous No.107188603 [Report] >>107188641
>>107188117
Nothing has surpassed RVC yet.
Anonymous No.107188630 [Report]
>>107185469
Without changing the format, that is the best you can probably do. If you could do an overhaul, Trellis quanting via QTIP is SOTA and you have formats from ik_llama.cpp and EXL3 being based on that.
Anonymous No.107188636 [Report]
>>107188522
saar do not redeem
Anonymous No.107188641 [Report]
>>107188603
ESL retard
Anonymous No.107188709 [Report] >>107188733 >>107188816
how many of you have embraced the ssdmaxx kimi life?
Anonymous No.107188733 [Report] >>107188780 >>107188843
>>107188709
>ssdmaxx
not a real thing
even the most autistic schizos here are not going to waste away at 0.2t/s
Anonymous No.107188780 [Report] >>107188853
>>107188733
I get around 1 t/s though
which actually isn't totally unuseable
Anonymous No.107188816 [Report] >>107188853
>>107188709
I'm happy with 2.1 t/s Kimi.
Anonymous No.107188843 [Report]
>>107188733
4x pcie5 ssds in raid-0 would be slightly above dual-channel ddr4 level in terms of bandwidth, assuming you have the right system.
Anonymous No.107188853 [Report] >>107188920
>>107188780
>>107188816
What do you guys use it for?
Or put another way, how does an average session of using the model goes at those speeds?
Anonymous No.107188870 [Report] >>107188891
>>107188531
>And jerk off to real women
I can't remember the last time I've done that
Anonymous No.107188891 [Report]
>>107188870
Now I realize that it's probably been if not upwards of, at least close to a year since I last did.
Anonymous No.107188894 [Report] >>107189025
Can someone explain the discrepancy in the measured memory bandwidth for the EPYC 9275F vs the 9255? I thought the memory bandwidth was correlated to the amount of CCDs the CPU has, and the 9275F has 8 vs the 9255's 4.
https://jp.fujitsu.com/platform/server/primergy/performance/pdf/wp-performance-report-primergy-rx2450-m2-ww-ja.pdf
Anonymous No.107188911 [Report]
>>107188251
Tested with the outdated Unsloth Q6_K model, it seems not mixing User and Assistant role.
Anonymous No.107188913 [Report]
>>107188347
>You're not referring to imatrix right?
Probably not, the code is here >>107184772
It's only technically interesting since perplexity difference is probably less than 0.01 when you actually compare between the two quants, but it is technically speaking strictly better and fully backwards compatible

Q8_0 with 64 or 128 elements per block reducing the bits per weight to 8.25 or 8.125 is probably more relevant since those savings could actually matter for huge models

For a 300B model:

8.5 bpw ≈ 318.75 GB
8.25 bpw ≈ 309.38 GB (≈ 2.9 % smaller)
8.125 bpw ≈ 304.69 GB (≈ 4.4 % smaller)


So you shave 10gb with 0.0001% quality loss. Not sure how 64 elements instead of 32 in a block impacts hardware optimizations though
Anonymous No.107188920 [Report]
>>107188853
General use. Sometimes I have Kimi analyze documents, consolidate lots of information input into easily digestible bullet points for redistribution or making into infographs, produce simple, tedious, but necessary code snippets for hobby projects (do not trust LLMs directly with the codebase yet), acting as a smarter calculator that's more comfortable to use than the average texas instruments calc, and writing shitpost novels and essays for my own amusement. I've started using Kimi to mess around with exploring latent underlying patterns in linguistic theory as a whole too.
Anonymous No.107188930 [Report] >>107188971 >>107188997 >>107189007
Could someone please give me a 60 second summary of what a chat template is and what "jinja" is?

As far as I can tell, models output different kinds of tags, like [THINK][/THINK] vs <think></think> vs <|begin_think|> etc. There is something schema for converting between a raw stream of text and json with the {"role":"assistant","content":"hello"} schema.

My question is, is the chat template included in the gguf (or model repo)? If so, why would I ever have to specify which chat template I'm using via a dropdown? How would the --jinja flag ever be needed then?

Also if I'm programming an interface that wants to convert between raw text and formatted conversations, do I need to access the model? There doesn't seem to be a standard endpoint like /v1/chat-template that would do this for arbitrary models.
Anonymous No.107188942 [Report] >>107188992
today in autistic RP pet peeves
>model asks a great question early in its reply
>"but wait... my reply isn't long enough"
>writes 4 more paragraphs of action and dialogue that completely run over the original question, making it impossible to respond to it coherently
meanwhile the original question is the only interesting thing about the reply and the rest is a bunch of generic filler designed solely to pad out the response to the desired length
>just edit it
I do, but it's still annoying
Anonymous No.107188971 [Report] >>107189029
>>107188930
Also, why is like 50% of every changelog fixing chat template problems. How is it not one-and-done by the model author?
Anonymous No.107188988 [Report]
If I'm going to ssdmax, I won't need good compute anyway, right?
Anonymous No.107188990 [Report]
currently going through all my old AI generated stories to organise them into obsidian.
I was not ready to see last modified: 27/06/2022. I think that one may be from GPT-3 davinci on OpenAI Playground.
Anonymous No.107188992 [Report] >>107189011
>>107188942
The day models are able to modulate output length downward when they recognize they're asking questions will be a good day.
Anonymous No.107188997 [Report] >>107189045
>>107188930
the --jinja flag is not for chosing an alternate chat templater, it's for enabling the gguf's chat template, without the flag you're actually not running the proper chat template at all
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
--chat-template-file (used in ADDITION to --jinja) is what you would use to set another template formatter.
jinja can run complex logic, for example gemma models really hate not alternating user/assistant roles (for eg having two consecutive user messages) so it will helpfully reject your prompt if you do that
Anonymous No.107189007 [Report] >>107189045 >>107189060
>>107188930
>chat template
A structure a model is trained to follow and that helps create patterns that differentiate things like what is the user's turn, the AI's turn, the thinking block, a tool call block, etc.

>and what "jinja" is?
A template engine. It helps creating templates with conditions, loops, etc.

>is the chat template included in the gguf (or model repo)?
Yes. (and yes).

>why would I ever have to specify which chat template I'm using via a dropdown?
For the most part, you wouldn't. Unless you know that there's something specific about the template you want to change, like adding support for suffixing/prefilling the assistant response.

>How would the --jinja flag ever be needed then?
If I'm not wrong, when you don't use the --jinja flag, llama.cpp uses an internal, built in hardcoded version of the chat template by default. When you use the flag, it pulls the actual JINJA template from the GGUF metadata and parses that.
I imagine that's either (or both) a failsafe and a way to retain backwards compatibility.

>Also if I'm programming an interface that wants to convert between raw text and formatted conversations, do I need to access the model? There doesn't seem to be a standard endpoint like /v1/chat-template that would do this for arbitrary models.
I'm not sure what you are asking.
llama.cpp has two endpoints, one for chat (that receives a structured chat array) and one for text completion (that receives raw text).
Anonymous No.107189011 [Report] >>107189434
>>107188992
If you think of an important question to ask the user, you can stop generating output and ask the user. You will be able to continue your response using that information after they provide an answer.
Anonymous No.107189025 [Report] >>107189049
>>107188894
Weird because that that number is like single CPU performance.
Don't discount the possibility that the numanuma dual socket testing went bad and the incorrect value slipped its way into the docs. Does it still show up in single CPU benchmarks?
Anonymous No.107189029 [Report]
>>107188971
>How is it not one-and-done by the model author
the model author didn't make a llama.cpp implementation, and in many cases, didn't make any popular inference tool implementation themselves
and recently mistral even wanted llama.cpp to use their tokenizer (which also did the templating), mistral-common, rather than having to answer to the expectation of jinja templates
machine learning is a far-west with no standards or will to settle on common grounds
Anonymous No.107189045 [Report] >>107189069
>>107188997
>>107189007
Thanks!

I'm trying to understand the proper way to do this: In my frontend, I have raw text. I would like to convert that text into a conversation with the model's proper chat template.

The backend is whatever arbitrary way I'm running the model, which is llama-server right now.
Anonymous No.107189049 [Report]
>>107189025
My thought was that it was just bad data, or that the value was accidentally swapped with the 9255. This was unfortunately the only data sheet I could find on memory performance for gen 5 EPYC.
411GB/s is way lower than the expected single CPU bandwidth of 576GB/s, as is the 9015 and 9115. The previous explanation I heard was because of the significantly reduced CCDs compared to the higher spec EPYCs.
Anonymous No.107189060 [Report] >>107189069
>>107189007
>I imagine that's either (or both) a failsafe and a way to retain backwards compatibility.
Most likely the embedded template convention didn't exist initially, so llama couldn't start reading it without breaking backward compat. It'll probably be turned on by default in some future version.
Anonymous No.107189069 [Report] >>107189114 >>107191472
>>107189045
If what you are going for is turn by turn exchanges, you are better off using the "OpenAI Compatible" chat endpoint.
That also has the upside of making your app compatible with whatever other API that uses that style of API like open router and pretty much everything else really.

>>107189060
That's my thinking. GGML format had no metadata from what I remember, then they created GGUF, and I guess at a later revision they added the template as a new metadata field.
Anonymous No.107189114 [Report] >>107189164
>>107189069
>GGML
RIP TheBLoke.
You were too good for this world and didn't deserve to be killed by ninjas.
Anonymous No.107189137 [Report] >>107191332
thankfully, the unlimited storage fest is over and so will be the unlimited troontunes production and ensuing quant uploads
some select users like barf-oski were given free reign but if troontuners can't upload more models quanters won't have more cope quant to produce
Anonymous No.107189164 [Report] >>107189181 >>107189186
>>107189114
>TheBLoke
Far from TheBroke as he is to this day getting $398 monthly from Patreon for lord knows what
Anonymous No.107189178 [Report]
>>107186458
>reading compression
that's a new one
Anonymous No.107189181 [Report] >>107189269
>>107189164
>$4.8k passive annual income, pre-tax
Anonymous No.107189186 [Report] >>107189288
>>107189164
>getting $398 monthly from Patreon for lord knows what
patreon has a lot of retarded users who sub to a guy and then forget about it
it's a very common phenomenon, saw webnovels writers still get a grand a month long after they stopped writing anything, and worse in the case of things like rpgm and visual novel pron games with no updates
Anonymous No.107189229 [Report]
It's over
Anonymous No.107189248 [Report]
It's starting again
Anonymous No.107189256 [Report]
it never began and nothing ever happens
Anonymous No.107189269 [Report]
>>107189181
i live in switzerland and that's pocket change there lol.

though i had a collegue that worked remote for a US company for like 120K/y but he lived in vietnam, dude must have been a king there.
Anonymous No.107189274 [Report]
It never stopped, in fact it's going too fast and in the wrong direction
Anonymous No.107189288 [Report]
>>107189186
The real cash is keeping something updated but that something is so pedestrian it involves no effort to keep it updated. All the effort was baiting retards into sucking you off.
Many "projects" earning 20K+ a month.
Anonymous No.107189327 [Report] >>107189394
what happened to this?
https://github.com/huggingface/open-r1
Anonymous No.107189394 [Report] >>107189518 >>107189638
>>107189327
Anonymous No.107189424 [Report]
at least HF built this
Anonymous No.107189434 [Report]
>>107189011
Well look at mister hotshot Prompt Engineer over here
Anonymous No.107189518 [Report] >>107189544 >>107189622 >>107192305
>>107189394
There are women in this hobby?
Anonymous No.107189544 [Report] >>107189617 >>107189622 >>107190878
>>107189518
pleasuring yourself with written word is indeed extremely female field
Anonymous No.107189617 [Report] >>107189652 >>107189699 >>107190878
>>107189544
so much this sister
this thread oozes estrogen
Anonymous No.107189622 [Report]
>>107189518
>>107189544
There are no men in the hobby. Only fujoshis pretending to be men.
Anonymous No.107189638 [Report] >>107189756
>>107189394
would
Anonymous No.107189652 [Report]
>>107189617
10/10, would write my own llm smut fanfics about.
Anonymous No.107189655 [Report] >>107189678
>>107188582
Not with vibevoice
Tons of live service video games have implemented AI voice acting and hardly anyone has noticed
It's only the jeets using 2 year old tts tech for YouTube spam that are obvious about it
Anonymous No.107189678 [Report]
>>107189655
I've seen some pretty convincing youtube spam that I only noticed as TTS because no human could possibly read 30 minutes of repetitive GPT slop without wanting to kill themselves
if you notice a 30 min video with slop writing that uploads often it has to be TTS even if it sounds human
Anonymous No.107189699 [Report] >>107189711
>>107189617
"Thing: Japan" reigns supreme.
WEGs are also notoriously ugly compared to hentai games, a phenomenon I find both fascinating and alarming.
Anonymous No.107189711 [Report]
>>107189699
Mahotsukai no Yome is so good.
Wizard's Blue is also great.
Anonymous No.107189756 [Report]
>>107189638
I concur
Anonymous No.107190002 [Report] >>107191389
>>107188117
seed-vc after finetuning on a particular voice, depends on your input tho

haven't touched TTS though in like 9 months, curious if anyone has thoughts on what's good nowadays
Anonymous No.107190781 [Report] >>107190853
open models?
Anonymous No.107190852 [Report] >>107190855 >>107190859 >>107190873 >>107191161
I'm too busy discussing the Steam hardware/software news that came out today. Might want to do the VR on my AI PC but not sure if that'll need SteamOS or something.
Anonymous No.107190853 [Report]
>>107190781
yes
Anonymous No.107190855 [Report] >>107190859 >>107190893
>>107190852
i'm kinda disapointed by the steam frame, if it had been 4k per eye oled, sure.
but this isn't an upgrade from my bigscreen, much better than the quest though;
Anonymous No.107190859 [Report] >>107190913
>>107190852
>>107190855
i just sold my valve index for $800 to then buy the steam frame when that comes out
Anonymous No.107190873 [Report]
>>107190852
SteamOS is just Arch with Steam. Should be able to do the VR on any Linux.
Anonymous No.107190878 [Report]
>>107189544
>>107189617
it is well known that literacy is gay
Anonymous No.107190893 [Report] >>107190923 >>107190928
>>107190855
Yeah I had that thought as well but then it's not the product for us (enthusiasts), but a Quest competitor and will be priced as such probably. But one thing that kind of got me was the weight. I had to do a double take since it was so surprising. It's 185 grams in the front. That's literally like 100 grams lighter than the Rift CV1's front module, even though it has the full SoC in it and stuff. And just 2x heavier than the Beyond 2. First impressions also said it was super comfy. So that kind of got me hyped again, for the overall package, just not for the thing I thought Deckard was going to be of course.
Anonymous No.107190913 [Report] >>107190925
>>107190859
they are different devices though, it'll be more confortable to play on the steam deck for 3h than it'd be on a vr headset.
Anonymous No.107190923 [Report]
>>107190893
yea desu i'll probably buy it if it's < 600 bucks.
i kinda want to get rid of my quest too because it's just taking dust.
Anonymous No.107190925 [Report] >>107190930
>>107190913
where did i bring up the steam deck?
Anonymous No.107190928 [Report] >>107190952
>>107190893
It's comfy and optimized for wireless streaming. I don’t mind the resolution, but I can’t go back to LCD, the gray mess where there should be pitch black ruins immersion for me
Anonymous No.107190930 [Report] >>107190933
>>107190925
i think it is about time i go to bed lmao.
Anonymous No.107190933 [Report] >>107190947
>>107190930
ok goodnight. i said i sold my index, not a steam deck. they are both vr headsets
Anonymous No.107190947 [Report]
>>107190933
yea i get it now, idk when i read index the steam deck poped in my mind.

good night / day man !
Anonymous No.107190952 [Report] >>107190968
>>107190928
Yup. And I also enjoy the color accuracy of good OLED/QLED. It won't replace my monitor. But I could use some VR gaming in my life again. I had fun with it in the past and just didn't bother keeping up with the hobby. I'd probably get a Frame and then hope they do a Frame OLED so I'd sell the old one.
Anonymous No.107190968 [Report] >>107191009
>>107190952
since it's linux based anyway, we'll probably have ways to mod and upgrade the displays though.
Anonymous No.107190976 [Report]
>>107185557
so they give out licenses to generate loli porn? and a paycheck?
Anonymous No.107191009 [Report] >>107191075
>>107190968
You can't just replace panels because pancake lenses lose 90% of the light and oled isn't bright enough to compensate for that
Anonymous No.107191075 [Report] >>107191103
>>107191009
depends which oled technology we are talking about, more expensive panels can rival and be bright enough.

i think they were going for an affordable device not a premium one and that's why they made that choice.

there are many vr devices with pancake optics and oled panels.
Anonymous No.107191103 [Report] >>107191114
>>107191075
The only option is microOLED, which features an array of individually controllable OLEDs with RGB filters in front of them, topped by a tiny collimating microlens per pixel to focus light in one direction. These are expensive and barely bright enough for use with pancakes
Anonymous No.107191114 [Report] >>107191136 >>107191156
>>107191103
bigscreen beyond is oled.
shiftall meganex superlight 8k is oled.
you can look on vr compare there are tons of oled vr headset with pancake lenses.

yes it's not cheap, but i wouldn't mind paying 2k for a device with 4k oled panels.
Anonymous No.107191136 [Report]
>>107191114
Both use microOLEDs that function as described. Bigscreen uses 2.5K microOLED displays from BOE, MeganeX 3.5K/3.8K microOLED
>i wouldn't mind paying 2k
You are not their target audience
Anonymous No.107191156 [Report] >>107191165
>>107191114
the Beyond 1/2 pretty much only work because their custom fitted face pads prevent light leaking, so their low brightness microOLEDs still look okay thanks to the total darkness inside
Anonymous No.107191161 [Report] >>107191174 >>107191298
>>107190852
>steam quest 3.5
VR is possibly the only market more depressing than AI inference hardware.
Anonymous No.107191165 [Report] >>107191225 >>107191335
>>107191156
fun how you ignored the other device i mentioned.

also vision pro and galaxy xr are also oled.

and again, on vr compare you'll find dozens of other devices with oled and pancakes.
Anonymous No.107191174 [Report]
>>107191161
> more depressing
at least they are actualy getting hardware.

we beg for a few crumbs of extra vram
Anonymous No.107191225 [Report]
>>107191165
Apple uses microOLED with dual white OLED backlights to add extra brightness, resulting in lower yields and very high prices
>on vr compare you'll find dozens of other devices with oled and pancakes.
Unlike you, I’ve tried or owned nearly all of them
Anonymous No.107191274 [Report]
>why can't they just make an OLED version later
Since it's micro OLED, it's completely different in terms of how the panels are made and the resulting panel size. They're much tinier screens, which means the optics have a harder job to do and some trade-offs have to be made. Notice that the MeganeX and Beyonds all have tiny lenses. They advertise an average FOV which seems good on paper, but the reality is that actually that FOV is only when you've gotten your eyes almost touching the lenses AND you are looking straight ahead. When your eyes rotate to look at things, your FOV on that side gets reduced. It makes the experience worse but it's hard to understand that it's an issue and no one talks about it. So this is why there will not easily be a Frame OLED, unless they give it an optical redesign. Also, this isn't to say stuff like Beyond is bad. Obviously micro OLED is great for being OLED, and the headsets are lighter. In the end it's just trade-offs.
Anonymous No.107191298 [Report]
>>107191161
I wouldn't say so? Just the fact that we have an alternative to Meta's closed garden and it runs any PC application you want makes their industry's progress way ahead of where we're at, where there is basically no alternative to Nvidia that's actually good (and not expensive like apple). Also RAM prices getting fucked.
Anonymous No.107191308 [Report]
Good morning sirs!
Anonymous No.107191332 [Report]
>>107189137
>free reign
it's "REIN" you dumb fuck mutt

you have terabytes of LLMs andyet you chose to stay retarded
Anonymous No.107191333 [Report]
is this /lmg/ or /vrg/?
Anonymous No.107191335 [Report] >>107192148
>>107191165
No one asked.
Anonymous No.107191389 [Report]
>>107190002
ESL retard
Anonymous No.107191472 [Report] >>107191480 >>107191507 >>107191530 >>107191532
>>107189069
>GGML format had no metadata
the terminal retardation of niggerganov has shown itself even back then

i remember thinking 'why would you fucking embed textual metadata into a multi-gigabyte binary file, surely he'll come to his senses and put them into an external file'

and now you can redownload the full thing for a 1-byte template/token change

c++ programmers are garbage
Anonymous No.107191480 [Report]
>>107191472
to be clear i'm talking about dogshit GGUF here
Anonymous No.107191492 [Report]
>>107185810
what a fucking retarded take
TIME IS MONEY
when asking an LLM to do some coding for me (mostly relegate it to do unit tests/documentation, but sometimes I want to see its takes on how to efficiently implement some functions) you want it to be FAST.
Is not wasting time an hard to grasp concept? fucking RETARD
Anonymous No.107191507 [Report] >>107191530
>>107191472
That's entirely on the users. The textual metadata could easily be patched and no one would have to download more than a simple text file, but it's just easier for everyone to reupload and redownload the entire model weights because people are dumb and lazy, internet speeds are fast, and HF apparently has bandwidth to spare.
Anonymous No.107191530 [Report] >>107191611
>>107191472
i mean, you can still load templates externally. I think it's good that you can embed a default template. The issue is like >>107191507 says: users are retarded, quant makers and hf dont care.
I remember the unsloth guys fucked up a template recently and they reuploaded the whole model again.. I think twice? because distributing a template is HARD.
Anonymous No.107191532 [Report] >>107191611
>>107191472
braindead take, you can just patch the file if you care about bandwidth, it's way better that the file is self contained

btw holy shit i haven't run a 70b model in ages and i just tried again and llama.cpp went from 4tok/s to 12tok/s somehow, based ggregnov
Anonymous No.107191611 [Report] >>107191688
>>107191530
it's not only templates, llama 2 goofs had to be recreated and reuploaded due to that borked eps setting or whatever

that would have been a good clue to realize "we've fucked up goys" but no

>>107191532
yeah i can "just do" anything
it just takes TIME
and i don't like it when retards play with my time

wget shitstral-q8.xml <- now was that so hard
Anonymous No.107191688 [Report] >>107191825
>>107191611
wget shitstral-q8-metadata.patch | ./scripts/patch_model.py shitstral-q8.gguf <- now was that so hard
Anonymous No.107191759 [Report] >>107191817 >>107191818
sars what's the best model for 48gb vram 64gb ram? some low bpw moe? 70b miqu?
Anonymous No.107191815 [Report] >>107191833
we ended up with split data on goofs because of the multimodal mmproj anyway
we would have been better off having metadata done separately too
this is why imagen people are less retarded, they quickly dropped the idea of single file from the stable diffusion niggers and went all in diffusers
Anonymous No.107191817 [Report]
>>107191759
Nemo FP16
Anonymous No.107191818 [Report] >>107191875
>>107191759
gm sir. glm air best model for ramlet sir
Anonymous No.107191825 [Report] >>107191851
>>107191688
if that sounds better to you, then you can actually be a woman one day, or already are
Anonymous No.107191833 [Report] >>107191860
>>107191815
How about you STFU and just use a framework based on PyTorch instead?
You're not too poor for that, are you?
Anonymous No.107191851 [Report] >>107191904
>>107191825
It is objectively better in that it can be done today, right now, with existing models and doesn't require yet another ggml file format change, dipshit.
Anonymous No.107191860 [Report]
>>107191833
>How about you STFU
not before you lick my anus
Anonymous No.107191875 [Report]
>>107191818
thanks saar, is there anything specific i need to do to get llama.cpp to handle moving the experts in and out of vram/ram to make it fast or is it just works tm?
Anonymous No.107191895 [Report] >>107191899 >>107191906 >>107191933
Le ironic saarposting is just cringe.
Anonymous No.107191898 [Report]
>>107185810
Even basic agentic workflows improve everything, from coding tasks to erp, at the cost of generation time. Smaller models using multistage responses outperform direct chat with a larger model
Anonymous No.107191899 [Report]
>>107191895
saaaaaaar
Anonymous No.107191904 [Report] >>107191930
>>107191851
>objectively better
when someone says this it's almost always guaranteed to be false

however this time it's ok

the point was that niggerganov was confirmed to be a thoughtless retard back then, not what would be the less-braindead workaround right now

and so, llama.cpp doesn't have version numbers TO THIS DAY

no release cycle

code is just simply being shat into the repo daily, testing in prod

this is the guy "designing" goof
Anonymous No.107191906 [Report]
>>107191895
Good Mornings and many blessings of Vishnu for you saar!
Anonymous No.107191930 [Report]
>>107191904
>and so, llama.cpp doesn't have version numbers TO THIS DAY
yeah lcpp had really buggy multimodal across the board for a while (it seems like recent releases finally ironed out all the issues) because of some refactors they were doing on kv cache / slots mechanisms and the lack of versioning, proper roadmap and branching of feature development really makes this all feel chaotic with no way of ever knowing if you're using a release from a "blessed" time of development where nothing retarded is happening, there's just nothing telling you when the retardation starts and stops it's like running windows insiders edition
Anonymous No.107191933 [Report] >>107191944
>>107191895
Anonymous No.107191944 [Report]
>>107191933
saar not like this saar this is ai fake made by people who hate saar
Anonymous No.107192130 [Report]
>>107192120
>>107192120
>>107192120
Anonymous No.107192148 [Report] >>107192329
>>107191335
i don't care, you made the claim that it's not possible which is patently false.
Anonymous No.107192305 [Report]
>>107189518
You gotta counterweight 99.999% of written smug (including gayshit) being women and requiring to click an exe, so it's 50% women
Anonymous No.107192329 [Report]
>>107192148
retard