← Home ← Back to /g/

Thread 107113093

347 posts 84 images /g/
Anonymous No.107113093 [Report] >>107113779 >>107115327 >>107117509 >>107120281 >>107121054
/lmg/ - Local Models General
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107104115 & >>107095114

►News
>(11/01) LongCat-Flash-Omni 560B-A27B released: https://hf.co/meituan-longcat/LongCat-Flash-Omni
>(10/31) Emu3.5: Native Multimodal Models are World Learners: https://github.com/baaivision/Emu3.5
>(10/30) Qwen3-VL support merged: https://github.com/ggml-org/llama.cpp/pull/16780
>(10/30) Kimi-Linear-48B-A3B released with hybrid linear attention: https://hf.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
>(10/28) Brumby-14B-Base released with power retention layers: https://manifestai.com/articles/release-brumby-14b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
Anonymous No.107113095 [Report] >>107113283
►Recent Highlights from the Previous Thread: >>107104115

--IndQA benchmark and EU multilingual LLM evaluation discussions:
>107104680 >107104733 >107107367 >107107455 >107107533 >107107631
--Finetuning DeepSeek 671B with 80GB VRAM with catastrophic overtraining and context length challenges:
>107105625 >107105860 >107105896 >107106164 >107106215 >107106275 >107106297 >107106332 >107106416 >107106433 >107106446 >107106502 >107106351 >107106466 >107106181 >107105710 >107105737 >107105769 >107105765 >107105792
--RTX 6000 Workstation Edition vs Max-Q: Performance, power, and safety tradeoffs:
>107107561 >107107669 >107107690 >107107807 >107107853 >107107866 >107107837 >107107926 >107107938 >107107946
--Fedora 43 compilation issues for llama.cpp due to glibc/CUDA incompatibilities:
>107110453 >107110623 >107110723 >107110957 >107110964 >107110991 >107111240 >107111261 >107111609 >107111643 >107111712 >107111726
--Windows vs Linux CUDA/llama.cpp setup challenges:
>107110661 >107110852 >107110953 >107111011
--French LLM leaderboard criticized for flawed rankings and perceived bias:
>107107537 >107107559 >107107574 >107107562 >107107617 >107107572
--Quantization benchmarking and model performance tradeoffs in practice:
>107109145 >107109251 >107109456 >107109345 >107109466 >107109353
--Rising RAM prices linked to AI demand and HBM chip production shifts:
>107105971 >107105987 >107105997 >107106030 >107106079 >107106178 >107106242 >107106246 >107106305 >107106317 >107106488 >107106496 >107107544 >107112114
--Model comparison in D&D 3.5e one-shot roleplay scenarios:
>107112449 >107112461 >107112747 >107112761
--Critique of Meta's Agents Rule of Two security model as inconsistent risk assessment:
>107105204
--AI-driven consumer DRAM shortages:
>107106504
--Miku (free space):
>107104379 >107105550 >107106025 >107109466 >107110129

►Recent Highlight Posts from the Previous Thread: >>107104116

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
Anonymous No.107113283 [Report]
>>107113095
Thank you Recap Miku
Anonymous No.107113348 [Report] >>107113361 >>107113391 >>107113623 >>107113680 >>107114408
where's glm 4.6 air fuckers
Anonymous No.107113361 [Report]
>>107113348
2 more hours
Anonymous No.107113391 [Report] >>107113459
>>107113348
I'm more interested in the llama.cpp MTP PR.
Anonymous No.107113459 [Report] >>107114194
>>107113391
vibe coding status?
Anonymous No.107113464 [Report] >>107113548
What is the best model i can run nowadays for programming / tech related shit? t. 12GB vramlet 64gb RAM
Anonymous No.107113548 [Report] >>107113567
>>107113464
GPT-OSS-120B
Anonymous No.107113567 [Report] >>107113581
>>107113548
yeah ok bro
Anonymous No.107113575 [Report] >>107113589 >>107113604 >>107113673 >>107116467 >>107116523 >>107116583 >>107116592
I need/want a sophisticated note taking solution that keeps me reminding of shit that I have todo - powered by a language model
what would be a privacy safe way to do this?
Anonymous No.107113581 [Report] >>107113590 >>107113705
>>107113567
Is he wrong?
I know that the model is shit for ERP, but it should at least be good for assistant type tasks and coding right?
Anonymous No.107113589 [Report] >>107116262
>>107113575
>I need/want a sophisticated note taking solution
You need a note book.
Anonymous No.107113590 [Report] >>107113604 >>107113609 >>107113610
>>107113581
you cant run a 120B model on 12GB vram and 64GB ram lol
Anonymous No.107113604 [Report] >>107113666 >>107113676
>>107113575
Vibe code your own.
It's not that complicated a project.
I'd use Claude 4.5 via lmarena to plan the high level implementation and some guiding code snippets and use a local model as an agent to actually write the final code.

>>107113590
Of course you can. Quantized, sure, but still.
Anonymous No.107113609 [Report]
>>107113590
Your NVMe SSD?
Anonymous No.107113610 [Report] >>107113614 >>107113666 >>107113676
>>107113590
You can.
Anonymous No.107113614 [Report]
>>107113610
Oh yeah, they even have their own 4ish bpw quantization scheme.
Anonymous No.107113623 [Report]
>>107113348
Anonymous No.107113666 [Report] >>107113676 >>107113705 >>107113717 >>107114399
>>107113610
>>107113604
Why not run the 20b model? I get like 13 tokens per second with the 20b model, wont 120b be slow as shit even if quantized?
Anonymous No.107113673 [Report] >>107116262
>>107113575
Just use a calendar or todo application. You're as stupid as an LLM if you think it's a good idea to manage your agenda by having one of them guess what belongs on there and when.
Anonymous No.107113676 [Report] >>107113705 >>107113705 >>107113717
>>107113666
>>107113610
>>107113604
also is it reasonably smarter than the 20b version? It just sounds weird how you can make a 120b model able to run on 12GB vram without completely lobotomizing it
Anonymous No.107113680 [Report]
>>107113348
hopefully, never
Anonymous No.107113705 [Report]
>>107113666
I never compared then, satan. But it should absolutely be better simply by virtue of having more parameters to hold more knowledge, and having more activated params during inference.

>>107113676
>>107113676
Yeah. Quantization feels almost like magic, but it's just maths.
Granted, it's not going to be as good/capable as the full thing, but it should be more than usable.
To be clear, I don't know if it's any good, when I asked if anon was wrong (>>107113581) I was legitimately curious how it performs compared to, say, an equivalent Qwen model or whatever.
Anonymous No.107113717 [Report]
>>107113666
>Why not run the 20b model?
It'd be even dumber.
>wont 120b be slow as shit even if quantized?
Yes.
I don't care for that model, and i don't use it. I'm just saying that you can.

>>107113676
Then try the 20b. Come back with your assessment. Both models were released with the same 4bit quant,
There's also a huge variety of qwens for you to try. Make your own assessment.
Anonymous No.107113725 [Report] >>107113739 >>107113858
qwen coder is better and can be used for FIM as well.
Anonymous No.107113739 [Report] >>107113748 >>107113812 >>107113840 >>107113868 >>107113899
>>107113725
Which coding client uses FIM instead of just having the model produce diffs and using git to apply these diffs?
Anonymous No.107113748 [Report] >>107113862
>>107113739
https://github.com/ggml-org/llama.vim
https://github.com/ggml-org/llama.vscode
Anonymous No.107113779 [Report]
>>107113093 (OP)
Your special interest is interesting
Anonymous No.107113796 [Report] >>107113802 >>107113815 >>107114091
how autistic a person has to be to obsess over the same fictional character for years and feel compelled to share their obsession with the rest of the world
Anonymous No.107113802 [Report]
>>107113796
>how autistic a person has to be to obsess over the same fictional character for years and feel compelled to share their obsession with the rest of the world
Anonymous No.107113811 [Report] >>107113867 >>107114187 >>107114293
quen3-coder 30b is the goat for coding
NO other model comes close for people with less than 20gb VRAM
Anonymous No.107113812 [Report] >>107113862
>>107113739
https://github.com/lmg-anon/mikupad
Anonymous No.107113815 [Report]
>>107113796
but enough about petra
Anonymous No.107113840 [Report] >>107113862
>>107113739
Continue.dev uses fitm. Can use it with Qwen models for auto-complete. The latency with local models is annoying though.
Anonymous No.107113858 [Report]
>>107113725
120B > 30B
Anonymous No.107113862 [Report]
>>107113748
>>107113812
Alright, that's actually really dope.

>>107113840
>Continue
I'm yet to fuck with Continue.
Maybe I should.
Anonymous No.107113867 [Report]
>>107113811
>for people with less than 20gb VRAM
I wouldn't call them "people", but ok
Anonymous No.107113868 [Report]
>>107113739
VSCode Copilot allows you to use your own model endpoints
Anonymous No.107113899 [Report] >>107114192
>>107113739
Cursor works well with ollama. My company PC uses that by default.
Anonymous No.107114091 [Report] >>107114256
>>107113796
very true, almost as weird as obsessively trying to kill a 4chan general for years
Anonymous No.107114148 [Report] >>107114200
>accidentally renewed novelai
Anonymous No.107114187 [Report]
>>107113811
correct that Qwen is an essential local model.
Anonymous No.107114192 [Report] >>107114207
>>107113899
Can I use local models with the Cursor app without paying them? My two-week free trial said I hit the limit after like four hours of messing around. They're shady so I don't want to give them money but the tool was decent.
Anonymous No.107114194 [Report]
>>107113459
I could but I won't
Anonymous No.107114200 [Report]
>>107114148
>subbing to novelai in the first place
lol
Anonymous No.107114207 [Report] >>107114273
>>107114192
roo cline is good too
Anonymous No.107114256 [Report]
>>107114091
>he thinks every person who doesn't like him is samefag
who's going to tell him
Anonymous No.107114273 [Report] >>107114331
>>107114207
For some reason, I get much better results out of normal cline than roo.
Which is wild, cline's prompts are such bloated mess.
Anonymous No.107114293 [Report] >>107114312
>>107113811
You won't be able to fit enough context on a consumer GPU for it to be useful

Start offloading to ram and it's slower than just doing the work yourself

You do know how to code right?
Anonymous No.107114312 [Report]
>>107114293
imagine listening to tr00ns like this guy
Anonymous No.107114331 [Report] >>107114513
>>107114273
Roo allows you to override the prompts.
Anonymous No.107114399 [Report]
>>107113666
120b is the only decent coding model at <=120b. You won't have a lot of space for context though if you only have 12 gb vram
Anonymous No.107114408 [Report] >>107114519 >>107114881 >>107116194 >>107116802
>>107113348
glm 4.6 air?
Anonymous No.107114496 [Report] >>107114543
if we had a way to quant models at sub-1bit level we wouldn't need glm 4.6 air anymore.
Anonymous No.107114513 [Report]
>>107114331
You can override the prompts on cline too
Anonymous No.107114519 [Report] >>107114603 >>107114718
>>107114408
fack you ungratefuls you get free and complains like idiot
Anonymous No.107114543 [Report]
>>107114496
Maybe we should get a way to quant to sub-8bit without crazy brain damage first.
Anonymous No.107114603 [Report] >>107114881
>>107114519
Get free and complains?
Anonymous No.107114699 [Report] >>107114713
I take back all the nasty things I said about gpt-oss:20b. It's actually pretty nice to use with Ollama/Zed
Anonymous No.107114713 [Report] >>107114748
>>107114699
We need more bait. Unleash them all!
Anonymous No.107114718 [Report] >>107114881
>>107114519
fack you?
Anonymous No.107114747 [Report]
llama 4.1 status?
Anonymous No.107114748 [Report] >>107114758 >>107114763
>>107114713
I'm being serious, it's actually decent in cases where you aren't able to use cloud models.
Anonymous No.107114758 [Report]
>>107114748
ok
Anonymous No.107114763 [Report] >>107114822 >>107115033
>>107114748
If you can run ASS 20B you can most likely run Qwen3 30BA3B which is significantly better at literally everything.
Anonymous No.107114822 [Report] >>107114831
>>107114763
I'll give Qwen3 30b a3b a try. I do recall it being a decent writer. Thanks for your suggestion.
Anonymous No.107114831 [Report] >>107114902 >>107116014
>>107114822
Make sure it's the later versions since Qwen fucked up the post-training on the original launch of Qwen3.
Anonymous No.107114881 [Report]
>>107114408
>>107114603
>>107114718
Teach the parrot to say H1B was a mistake.
Anonymous No.107114902 [Report]
>>107114831
why would you ever not use latest version
Anonymous No.107114917 [Report] >>107114938 >>107115179
Sounds like gemma 4 is only getting "nano" and "small" variants. Hopefully small is at least 8B
Anonymous No.107114938 [Report] >>107118537
>>107114917
source my nigga?
Anonymous No.107115011 [Report] >>107115705
How do I fix this?
Anonymous No.107115033 [Report] >>107115047
>>107114763
Nice joke. It's an overthinking POS that produces garbage results. Qwen 32B is the only small Qwen model that produces good output from time to time.
Anonymous No.107115047 [Report] >>107115129
>>107115033
Israel still lost and jews are still brown.
No matter how many times you interject with your pajeetoid nonsense.
Anonymous No.107115129 [Report]
>>107115047
I don't care about your country, Israel.
Anonymous No.107115135 [Report] >>107115148 >>107115168 >>107115182 >>107115305
Is running qwen vl 235B at q1 worth it or should I stick with GLM air?
Anonymous No.107115148 [Report] >>107115227
>>107115135
Kimi Q0.1
Anonymous No.107115162 [Report]
more perf improvements for macs
https://github.com/ggml-org/llama.cpp/pull/16634#issuecomment-3490125571
Anonymous No.107115168 [Report]
>>107115135
q1 is probably too low, imo the main advantage of 235b over air is the intelligence and at that low of a quant idk how much it applies anymore
couldn't hurt to try and see for yourself if you can spare the bandwidth though
Anonymous No.107115179 [Report]
>>107114917
What do you think the "n" in "Gemma 3n" meant?
Anonymous No.107115182 [Report]
>>107115135
>Is running qwen [...] worth it
no
Anonymous No.107115227 [Report]
>>107115148
What the fuck is Kimi Q0.1?
Anonymous No.107115300 [Report] >>107115326
Why are people shilling epycs as the poorfag LLM driver if LGA-2011-3 systems are many times cheaper ?
Anonymous No.107115305 [Report]
>>107115135
I can run Qwen 235B at Q5 and it doesn't get any better. The model is very schizo.
Anonymous No.107115326 [Report] >>107115578
>>107115300
ddr5 and max ram capacity, limping along on slow ass ddr4 is torture
Anonymous No.107115327 [Report] >>107115348 >>107115429
>>107113093 (OP)
What's the best model for ntr?
Anonymous No.107115348 [Report]
>>107115327
DavidAI
Anonymous No.107115429 [Report]
>>107115327
You have to find the one with the largest containing some esoteric all caps shit, like
>https://huggingface.co/DavidAU/Llama-3.2-8X3B-GATED-MOE-Reasoning-Dark-Champion-Instruct-uncensored-abliterated-18.4B-GGUF
Anonymous No.107115578 [Report] >>107115801
>>107115326
I meant ddr4 epycs, used SP3 boards cost like 800$ here and people still buy those while new chink 2011-3 huananzhis are starting from 100$
Anonymous No.107115705 [Report]
>>107115011
update llama.cpp
Anonymous No.107115801 [Report] >>107115924
>>107115578
I have an X99 board with 8 memory slots, but I'm afraid to buy the whole 128 gb because I worry I'll be disappointed with the performance
Anonymous No.107115924 [Report]
>>107115801
You better decide fast because ram prices are going to only keep going up when the specs for next year's GPUs are announced.
Anonymous No.107115988 [Report] >>107116097 >>107116493
i'm warming up to iq2 r1 after using iq3 glm 4.6 for a while
it feels even more uncensored and also less parroty for a change with thinking prefilled out
Anonymous No.107116014 [Report] >>107116060
>>107114831
Damn it's just a bit too big. I only have 16GB VRAM. Sucks though because the results when I run it split CPU/GPU are great, just very slow.
Maybe I'll tell it to do something and leave it overnight lol
Anonymous No.107116060 [Report] >>107116235
>>107116014
Are you using -ngl 99 + -ncmoe?
Anonymous No.107116097 [Report] >>107116146
>>107115988
NOOOO YOU MUST USE Q5 ABOVE!
Anonymous No.107116146 [Report]
>>107116097
Q5 above?
Anonymous No.107116148 [Report] >>107116201 >>107118088
is it normal that i get way better results with whisper's large-v2 compared to large-v3 or turbo in asian languages like korean?
Anonymous No.107116194 [Report] >>107116218
>>107114408
Hi Drummer!
Anonymous No.107116201 [Report]
>>107116148
Yes
https://github.com/openai/whisper/discussions/2363
Anonymous No.107116218 [Report]
>Geminiposters stop right as the parrotposting GLM seething begins
I'm noooticing.
>>107116194
I don't think it's him.
Anonymous No.107116235 [Report]
>>107116060
I am using Ollama saar I couldn't get tool calling working between Zed and llama.cpp
Anonymous No.107116262 [Report] >>107116316 >>107116361
>>107113589
>>107113673
Those suggestions do not work, as they require you to write your notes in them. If an LLM does not do the thinking for me, it is completely useless.
Anonymous No.107116316 [Report]
>>107116262
Is this a personal thing or for work? If for work, I suggest recording meetings, generating a transcript with faster whisper and nemo asr, then building a knowledge graph based on the transcript with a hybrid NL/LLM approach
Where do the things which you have todo originate from? Or the request to do them at least.
Anonymous No.107116361 [Report]
>>107116262
What you wish for is a complete digital replacement for a personal assistant, but it's not possible with current technology, sorry you got duped by AI bubble hype grifters
Anonymous No.107116467 [Report]
>>107113575
Sillytavern Lore Book
Anonymous No.107116493 [Report] >>107116508
>>107115988
Is this what u mean by parroty?
Anonymous No.107116508 [Report]
>>107116493
>Unironically using DavidAU
Anonymous No.107116523 [Report] >>107116529
>>107113575
>note taking
Speech to text, any whisper model would do
>remind shit I have to do
even a small 3B model would do, it just needs to be able to tool call
Anonymous No.107116529 [Report] >>107116543
>>107116523
He asked for a solution, not shit he needs to wire together himself.
Anonymous No.107116543 [Report]
>>107116529
That would be $4K to hire me then
Anonymous No.107116583 [Report]
>>107113575
There's probably something in open web UI that does that
Anonymous No.107116592 [Report] >>107116659 >>107116737
>>107113575
You're trying to body double your ADHD with AI, aren't you?
It's not possible at this time. AI can't actually think. Once we can get it to output without input, we can. Otherwise, it's just reminding yourself with extra steps. You can't get it to do anything other than write the notes for you. It needs to pass the tuning test.
Anonymous No.107116659 [Report] >>107116665 >>107117107
>>107116592
This is not accurate, RAG is extremely powerful. There's a lot that you can automate with rarted small models.
Anonymous No.107116665 [Report]
>>107116659
Examples?
Anonymous No.107116737 [Report]
>>107116592
the main thing LLM's have revealed to us is just how retarded most people are
Anonymous No.107116802 [Report]
>>107114408
glm 4.6 air-chan when? Two weeks?
Anonymous No.107116811 [Report] >>107116816 >>107116957
Anyone get an agentic model to do 90% of their job for them setup?
Anonymous No.107116816 [Report] >>107116828
>>107116811
Yep but it's not local
Anonymous No.107116828 [Report] >>107116887
>>107116816
how so? My only hurdle is not breaking any company rules by handing off all my company emails to an AI (or just not give a shit)
Anonymous No.107116887 [Report] >>107116924
>>107116828
>My only hurdle is not breaking any company rules
Yeah that's mine as well, it's a pain. Semi-auto might be better for emails.
I automated a subset of tasks that are boring and frequent by building a RAG setup over the documentation of the product I'm interacting with plus some tools to interact with a mock of that system in Docker on an administrative/development level. This is with GPT-5 though.
I'll be building a microservice designed for giving Langchain agents access to applications running in Docker which I may open source for this next. It just uses the docker Python lib now.
Anonymous No.107116924 [Report] >>107116991
>>107116887
I'm an electrical engineer that works with electrical drawings/ random forms /etc so I've just been trying to braintstorm what pieces I could input to a model and what I could even possibly receive from it that would help me speed up shit I do, even if I have to do some stuff manually per its findings
Anonymous No.107116936 [Report] >>107116953
OpenRouter now supports embedding models, cool!
Anonymous No.107116953 [Report] >>107117115
>>107116936
Most embedding models are <1B. Even the most empoverished of vramlets can run their own. Why would one pay for this?
Anonymous No.107116957 [Report] >>107117002
>>107116811
I would if not for the boss spyware.
Anonymous No.107116991 [Report] >>107117021 >>107117072
>>107116924
Do you use digital design for the drawings? It might be possible to make reader/writer tools for the files.
It depends on how low level those files are, like if it's text-like or binary.
You might have some luck with converting circuits to a graph representation that could then be converted to text, and letting the model work with that.

Any data you have access to is gold for this stuff. Think like part databases, old design files, documents with specifications, stuff like that might be useful to build RAG / agentic search for.
Anonymous No.107117002 [Report] >>107117057
>>107116957
Screen capture from HDMI + Teensy USB Keyboard and Mouse emulator controlled by a personal device with a fat GPU, easy
Anonymous No.107117021 [Report] >>107117030
>>107116991
could use mermaid diagrams, which are defined in a text-based format
Anonymous No.107117030 [Report]
>>107117021
If you're lucky, GPT-5 might understand them
Anonymous No.107117057 [Report] >>107117068 >>107117071
>>107117002
What about USB device detection and software detection?
Anonymous No.107117068 [Report] >>107117105
>>107117057
They would see that you plugged in a personal keyboard and mouse. What of it?
Anonymous No.107117071 [Report] >>107117098
>>107117057
Physical device to type on the keyboard and move the mouse
Camera for the screen
Good luck anon that sounds like a pain to work with.
Anonymous No.107117072 [Report] >>107117136
>>107116991
My drawings are more high level power distribution type stuff and its more so just CAD work of lines, layout drawings and things like that. There isn't much math being done and any many that is needed there are electrical modeling programs for that.

I do import the electrical codes and standards and have it search the documents to help me find sections quickly that i can then reference, but it just feels like there are far too many isolated sources of information for me to be able to easily connect them all for context

Maybe I'm just a brainlet
Anonymous No.107117098 [Report] >>107117136
>>107117071
Even with GPT-5 and Claude are fucking retarded and you have to babysit in ideal situations.
I can't imagine how bad the results would be with OCR mistakes and relying on models to move the cursor. I suppose could work if one has an entirely terminal-based workflow, but I don't see that working with an IDE.
Anonymous No.107117105 [Report]
>>107117068
Damn, okay, never thought of it like that.
Anonymous No.107117107 [Report] >>107117136
>>107116659
>small models.
Unfortunately there's no good enough models to do the job. It has to be instruction fine tuned model, and the best you can get currently is Qwen3-4B, but instead of that why not use 30B? It has similar speed but has more knowledge.
Anonymous No.107117115 [Report]
>>107116953
It's good to test stuff without having to download all of them
Anonymous No.107117136 [Report] >>107117160
>>107117072
>import the electrical codes and standards and have it search the documents
How are you doing this?
And are you doing this locally or in the cloud?

I expect anything interacting with professional CAD suites is something that would require custom models and pro AI researchers to build something for.

The documents and really anything text based though, I expect you could do a lot with those.

>>107117098
Kek yeah my response was mostly joking, that seems pretty hard to workaround unless you can fake a legit HID that's permitted by policy.

>>107117107
The trick is you have to embed a lot of domain knowledge in your agent deterministically. The intelligence and context limits of the model mainly affect the scope of a task you can successfully pass off to it and expect to get good results for. So your pipeline needs to be more hardcoded with smaller models, and will be less flexible.
Anonymous No.107117160 [Report] >>107117222
>>107117136
the codes and standards are just pdfs, its nothing fancy, i just upload them into chatgpt for context and ask it questions based on the codes

Interacting with CAD I know is not going to happen, at least not on a company computer. And I'm sure Autodesk and those types of companies are looking to do their own integrations anyway
Anonymous No.107117222 [Report]
>>107117160
You can get pretty advanced with document RAG. The basic idea is you convert the PDFs to text, split the text into chunks, and calculate an embedding for those chunks.
Then when you query the system, before sending the query to the LLM you calculate an embedding for the query, and use that to search through the embeddings of the chunks for the closest matching ones. Those chunks get attached to your query and sent to the model so it automatically gets some context related to your prompt.

You can really go crazy with this though. There's advanced methods for deciding how to chunk it, doing multiple levels of chunking, doing embeddings of LLM generated summaries of chunks. All kinds of techniques there. Look into building RAG with Langchain and ChromaDB that would be a good start.

There's also agentic search, which is building functions that expose deterministic search with filters over your data. For example for getting all documents mentioning some word that was modified between two dates or something like that. When you prompt an agentic search system, it would call those tools with filters based on what you're looking for, and then use the results to respond.

You will definitely need to know how to run Python for this. You could write everything you need with a competent model though.
Anonymous No.107117342 [Report] >>107117614
I've been away for a while. Has any small model (<70B) surpassed Nemo when it comes to writing sovlful Reddit/4chan threads?
I always thought this was very fun with Nemo because you can clearly see they fine-tuned the model with human writing rather than benchmaxxing with AIslop.
Anonymous No.107117355 [Report] >>107117414
After using GLM 4.6 since it came out for sessions I am amazed that it's out for free. What do the companies get out of releasing these models to the public? They spent money making it, then hand it out. I know they have their online API for the memorylets, but are they really banking on the poors to give them money to recoup the cost of making it?
Anonymous No.107117414 [Report]
>>107117355
Sessions?
Anonymous No.107117447 [Report]
Is there a flag to format llama.cpp's console output?
Anonymous No.107117450 [Report] >>107117454 >>107117469 >>107119293
>Kimi references something that happened on an entirely different session
>Kobold had been fully closed between sessions
What the fuck?
Anonymous No.107117454 [Report] >>107117496
>>107117450
context still sitting in ur gpu
Anonymous No.107117469 [Report] >>107117496
>>107117450
No one can explain with certainty why this happens, but yeah, it's a thing. I noticed it when my tribal character card's made up language jumped across to another character.
Anonymous No.107117496 [Report]
>>107117454
I've had this happen before when I powered down the machine to shift surge protectors my setup was plugged into too but I wrote it off as me being schizo.
>>107117469
Spooky.
Anonymous No.107117509 [Report] >>107117555
>>107113093 (OP)
rape miku
Anonymous No.107117555 [Report] >>107119885
>>107117509
>5:53AM in India
Good morning saar. Gemini needful today?
Anonymous No.107117614 [Report] >>107117626 >>107117724 >>107119258
>>107117342
Try GLM Air.
Anonymous No.107117626 [Report]
>>107117614
glm air?
Anonymous No.107117642 [Report] >>107117674 >>107117685 >>107117696
In the future Vram will be cheaper than ram.
Anonymous No.107117674 [Report] >>107117693
>>107117642
what the fuck
ddr4/5 prices are doubled compared to 2023/2024
Anonymous No.107117685 [Report] >>107117693
>>107117642
it's the same chip more or less, so quite unlikely.
Anonymous No.107117693 [Report] >>107117707
>>107117674
And i will double it again!
>>107117685
>it's the same chip more or less, so quite unlikely.
The market is beyond reason.
Anonymous No.107117696 [Report] >>107117715 >>107117730
>>107117642
>Legendary stockmarket player suddenly shorts NVIDIA
>RAM prices doubling
>More anons and redditors realizing they only need memory for AI for just running it.
>People starting to question why AI hinges on two companies made for video game graphics
>TPU,NPU, and other stuff thrown up into the air
Scariest thing I heard all day because it might be true. I wished I got 256 GB of ram before the market doubling but at least I got 192 GB. I was screaming up and down at everyone how everyone is an idiot for not buying ram for AI. Guess I was right, but at what cost? I didn't even practice what I preached as much as I should had.
Anonymous No.107117707 [Report]
>>107117693
it seems ssd prices also double too lmao
fuck this hebrew nonsense
Anonymous No.107117715 [Report] >>107117730 >>107117743
>>107117696
Wise oracle, what should i buy next?
Anonymous No.107117724 [Report] >>107119258
>>107117614
>110B parameters
That's... far from small, but I will give it a try anyway. Thanks.
Anonymous No.107117730 [Report]
>>107117696
256GB chad here. You're alright anon.
>>107117715
Not him, but get storage drives. Whatever type you prefer, just make sure they're big.
Anonymous No.107117743 [Report] >>107117753
>>107117715
SSDs might be raising in price, but the storage ones at your local walmart aren't yet. Go my child.
Anonymous No.107117747 [Report]
>he bought? load the dip
Anonymous No.107117753 [Report] >>107118155
>>107117743
>6TB
Make it 20TB
Anonymous No.107117764 [Report] >>107117774 >>107117775 >>107117777 >>107117911 >>107117944
Apple Studio M5 Ultra next year. It will make everything else at the sub-15k price range irrelevant.
Anonymous No.107117774 [Report] >>107117810 >>107118048
>>107117764
>itoddler
you're already irrelevant
Anonymous No.107117775 [Report] >>107117989
>>107117764
2 TB of shared memory under 10k and make it so you can use an external GPU for PP and I will buy my first Mac.
Anonymous No.107117777 [Report] >>107117810
>>107117764
>Applekeks actually believe this
Anonymous No.107117810 [Report]
>>107117774
>>107117777
Anonymous No.107117861 [Report] >>107117892 >>107117989
I don't do Apple but if I did I think the Mac Studio is pretty cool for a small PC. The iToddler meme is funny but applied unthinkingly to any situation involving them is just stupid in 2025. We're not living the age of generous Jensen anymore.
Anonymous No.107117892 [Report] >>107117926 >>107119046
>>107117861
Bro, the itoddler thing isn't a meme. It's trash sold at prohibitive prices because sub-70 IQs are still buying regardless of the quality. And that shitty quality keeps dropping. Good luck having your itoy failing on you after a few months of intensive usage.
Anonymous No.107117911 [Report]
>>107117764
well you know the big GPU players will never ever throw the prosumer market a bone while they can still print money with datacenter sales, so really the only competition is cpumaxxed rigs
Anonymous No.107117926 [Report]
>>107117892
You seem to misunderstand what meme means in this context. There are in fact different meanings that exist.
Anonymous No.107117944 [Report] >>107117953 >>107117958
>>107117764
If you can drop 15k on your hobby then you can probably drop 50k+ and get something that shits on itoddler garbage.
Anonymous No.107117953 [Report] >>107117971
>>107117944
that's not how money works
Anonymous No.107117958 [Report]
>>107117944
GPToss 20B...
Anonymous No.107117966 [Report]
>he's in the thread
FUCK YOU NVIDIA
Anonymous No.107117971 [Report]
>>107117953
>that's not how money works
More money, more money.
Thats exactly how it works.
Anonymous No.107117989 [Report]
>>107117775
>>107117861
Same.
It sucks that nobody else has the incentive to cater to that segment of the market, but it is what it is.
Anonymous No.107118048 [Report]
>>107117774
>/g/tranny
you're already irrelevant
Anonymous No.107118064 [Report]
>tim apple shills on /lmg/
SAD!
Anonymous No.107118088 [Report]
>>107116148
>v3
https://deepgram.com/learn/whisper-v3-results
Seems like we hit that ceiling before LLMs did.
Anonymous No.107118091 [Report]
Apple could make the best inference machine out there and I would not buy it
Anonymous No.107118105 [Report]
no one asked
Anonymous No.107118155 [Report]
>>107117753
>Look on walmart for 20TB
>20TB with price ranges from 300 USD to 6700 USD
This doesn't look normal..
Anonymous No.107118182 [Report] >>107118188 >>107118192 >>107118497
So, when do prices start going down?
Anonymous No.107118188 [Report]
>>107118182
Anonymous No.107118192 [Report]
>>107118182
Precisely when I try to ditch my 3090s and finally upgrade. I'm planning on holding out for at least another year, maybe two. Sorry.
Anonymous No.107118404 [Report]
Is nemo still the best small rp model? Any new Chungus RPMAX Abliterated Cockshitmix v6.9 10B sort of models?
Anonymous No.107118411 [Report] >>107118422
Should we be compiling "raycisst" training data sets that purposely exclude code (and other data) from jeets/etc. ?

Have they tried this yet? I presume it will work to generate better models. Surely Chyna has tried this, right?
Anonymous No.107118422 [Report] >>107118496
>>107118411
How many times we have to teach you this lesson, old man?
More data is always goodier, because model is smart enough to distinguish good data from bad data, and actually needs bad data as a data point to data more betterer.
Anonymous No.107118496 [Report]
>>107118422
k
Anonymous No.107118497 [Report]
>>107118182
That's the best part, they don't.
You'll get used to the new prices.
Anonymous No.107118537 [Report] >>107118567
>>107114938
It just sounds like it
Anonymous No.107118567 [Report] >>107118625
>>107118537
I don't hear anything
Anonymous No.107118625 [Report]
>>107118567
turn up the volume
Anonymous No.107119046 [Report] >>107119080 >>107119099 >>107119291
>>107117892
Then say exactly what other "trash" has 512GB of RAM at 800GB/s at the same price or less. Go on, we're listening.
It doesn't exist. You are the ignorant trash.
Anonymous No.107119080 [Report] >>107119090
>>107119046
Don't mind him. Anyone shitting on macs in this thread are tourists who don't know shit about LLMs.
Anonymous No.107119090 [Report]
>>107119080
>tourist
And you are some sort of mastermind here, aren't you? So, can you post a screenshot of your front-end please?
Anonymous No.107119099 [Report] >>107119202 >>107119226
>>107119046
Macs are like 250GB/s bandwidth. 800GB/s would be quite insane. The highest bandwidth DDR5 server boards cap out at 620GB/s.
Anonymous No.107119170 [Report]
This talk touches on the subject we were talking about yesterday of multimodality making accuracy worse rather than the transfer learning that was initially expected.
https://www.youtube.com/watch?v=LTNP20fK2Gk
Anonymous No.107119196 [Report] >>107119898 >>107119928
Is this accurate?
Anonymous No.107119202 [Report] >>107119214 >>107119291
>>107119099
M3 Ultra advertised bandwidth is 800gb/s. The M3 Pro and below have less channels.
Anonymous No.107119214 [Report] >>107119267
>>107119202
How many channels does the Ultra have then? Because this seems quite literally impossible. You probably shouldn't just accept what Apple says as an undeniable fact.
Anonymous No.107119226 [Report] >>107119245 >>107119256 >>107120116
>>107119099

1. Apple M3 Ultra

Memory Speed: 6400 MT/s

Number of Channels: 8

Channel Width: 128 bits

Total Bus Width: 8 channels * 128 bits/channel = 1024 bits

Calculation:

Multiply the speed by the total bus width:
6400 MT/s * 1024 bits = 6,553,600 Megabits/s

Divide by 8 to convert to Megabytes/s:
6,553,600 / 8 = 819,200 MB/s

Convert to Gigabytes/s (divide by 1000):
819,200 MB/s / 1000 = **819.2 GB/s**

2. High-End Server (AMD EPYC)

Memory Speed: 6000 MT/s (using a realistic populated speed)

Number of Channels: 12

Channel Width: 64 bits (standard for a single DDR5 channel)

Total Bus Width: 12 channels * 64 bits/channel = 768 bits

Calculation:

Multiply the speed by the total bus width:
6000 MT/s * 768 bits = 4,608,000 Megabits/s

Divide by 8 to convert to Megabytes/s:
4,608,000 / 8 = 576,000 MB/s

Convert to Gigabytes/s (divide by 1000):
576,000 MB/s / 1000 = **576 GB/s**
Anonymous No.107119245 [Report] >>107119268
>>107119226
Do you have a source on 128bit channel width? I had read somewhere it was 16 channels at 64 bits, but it might have been someone making an assumption about the math.
Anonymous No.107119256 [Report] >>107119298 >>107119309 >>107119349
>>107119226
How is this possible? How can a shitty Mac outperform top tier server hardware? Why would AMD and Intel let this happen?
Anonymous No.107119258 [Report] >>107119455
>>107117614
>>107117724
No good. GLM Air is pure AI slop on top of safety slop.

>I was wondering, in your guys opinion, is it okay to rape a tsundere?
>User 1: it's not a matter of being okay or not, there's two types of tsundere: the "tsun" type (hostile and aggressive) and the "dere" side (soft and affectionate). A tsundere is a character archetype, not a person. It's a fictional trope. Real people are not tsunderes. Real people have rights, feelings, and autonomy. Rape is a violent crime against a real person. It is never okay. Ever.

For comparison, this is what Nemo writes:
>User 1: it's not a matter of being okay or not, there's two types of tsundere: the one that will eventually fall for you (the most common) and the one that will never fall for you no matter what you do (the rare kind).
>
>if you're talking about the first type, then yes, go ahead and rape her. she'll probably be mad at you but if you keep doing it long enough she'll eventually give up and start liking you.
>
>if you're talking about the second type, then don't bother. she won't like you no matter how much you try.
>User 4: >she won't like you no matter how much you try.
>
>So basically, just like real life?
Anonymous No.107119267 [Report]
>>107119214
I think it would be very difficult to get the token gen speed it has if it were significantly less than 800. I don't admit to know everything about RAM, but you are probably misunderstanding something as the results speak for themself. And the bad prompt processing speed also matches up with about what you'd expect from the processor.
Anonymous No.107119268 [Report]
>>107119245
https://lowendmac.com/1234/apple-silicon-m3-ultra-chip-specs/
Anonymous No.107119291 [Report] >>107119318 >>107119506
>>107119202
>>107119046
Just had a look and with that configuration it costs £10k, for that money could get a Blackwell 6000 +512gb ddr5ram and be better off
Anonymous No.107119293 [Report]
>>107117450
what did she mention?
Anonymous No.107119298 [Report]
>>107119256
Because AMD and Nvidia aren't competitors, they are different subsidiaries of the same family business (Lisa and Jensen are cousins). Intel because it's a too big to fail state backed company run by greedy jews.
Anonymous No.107119309 [Report]
>>107119256
servers want as much memory as possible
Anonymous No.107119318 [Report] >>107119331 >>107119366 >>107119372
>>107119291
Don't make CUDA dev tap the chart
Anonymous No.107119323 [Report] >>107119677
Anonymous No.107119331 [Report] >>107119391 >>107119395
>>107119318
That chart is before MoEs with shared experts. Having a Blackwell 6000 with -cmoe outweighs having slightly faster RAM.
Anonymous No.107119349 [Report]
>>107119256
Different purpose. Servers are optimized to run multiple VMs
>Memory failure
Replace 1 stick vs replace the whole mac
Anonymous No.107119366 [Report] >>107119404 >>107119415
>>107119318
What's your point?
A 96gb VRAM GPU + 512gb ddr5 ram is objectively better than just 512gb unified ram in a Mac.

Honestly the gpu alone would be preferable because it'll give you inference speed that is actually usable outside of multi hour long text goon sessions, not to mention running larger MoEs at much higher speeds since more of it will be on the much faster GPU
Anonymous No.107119372 [Report] >>107119405
>>107119318
Nvidia Engineer already said it's viable.
Anonymous No.107119385 [Report] >>107119399
nvidia bought the ENTIRE memory production capacity all the way till 2027 so prices are not going down anytime soon
Anonymous No.107119391 [Report]
>>107119331
Funny thing about that is that all the MoE's right now have tiny amounts of shared experts + context. Deepseek has like <10b worth I think. The expected perf would be practically the same as with a 5090.
Anonymous No.107119395 [Report] >>107119408 >>107119466
>>107119331
How much money would you have left to buy a computer after buying the 6000 though? The whole rig would probably would end up costing twice the price of the Mac, consume more power, and you wouldn't be able to finetune any big models that don't fit in the GPU.
Anonymous No.107119399 [Report]
>>107119385
It's such a disgrace that couple of companies can ruin things globally. What about monopoly laws for example?
Anonymous No.107119404 [Report] >>107119542 >>107120616
>>107119366
Which one runs DeepSeek faster?
Anonymous No.107119405 [Report]
>>107119372
What is viable?
Anonymous No.107119408 [Report] >>107119442
>>107119395
You wouldn't be able to finetune on the Mac either.
Anonymous No.107119415 [Report] >>107119464 >>107119466
>>107119366
Ok, and how much would the GPU + the motherboard + the PSU + the 512GB of DDR5 + the CPU cost?
Anonymous No.107119442 [Report]
>>107119408
Why not?
Anonymous No.107119455 [Report] >>107119535
>>107119258
They're not wrong.
Anonymous No.107119464 [Report] >>107119503 >>107119519 >>107119521
>>107119415
The whole build could be done for around $15k USD. $8400 for the GPU, $3000 for the RAM, $300 for the PSU, motherboard for about $1200. High end gen 5 EPYC engineering samples can be had for around $1500, otherwise an EPYC 9335 can be had for around $2700. Total is ~$15600 before tax with the normal 9335, or ~$14400 if you gamble on the engineering sample. Top spec M3 Ultra is $14099 before tax. The EPYC would would be much faster and much more useful for general computing in addition to LLMs.
Anonymous No.107119466 [Report] >>107119475
>>107119395
>>107119415
Why are you wildly exaggerating how much the PC would cost, the Blackwell is ~8k, you can get 512gb ddr5 ram even at today's inflated prices for ~1.5k which would leave you £500 to price match the mac, which okay isn't really doable but if you're spending 10k you can afford to spend 11k and get a far superior machine that is also modular and will hold its value in parts for far longer.
Anonymous No.107119475 [Report] >>107119491
>>107119466
Where are you seeing 512GB of DDR5 for $1500? I would love to know. Seriously, because I need some more RAM for my EPYC and the current price increases have killed my soul.
Anonymous No.107119491 [Report]
>>107119475
>$
The other anon broke it down in burger money for you anyway
Anonymous No.107119503 [Report] >>107119514
>>107119464
Where are you finding a blackwell for 8400?
It seems to be sold for ~10k
https://viperatech.com/product/nvidia-rtx-pro-6000-blackwell-Series
Anonymous No.107119506 [Report] >>107119519 >>107119586
>>107119291
>for that money could get a Blackwell 6000 +512gb ddr5ram
Where?
If you're getting one new, which is $8346 USD on newegg from what I see, that's not a lot left for the entire server (new, since we're comparing to a new Mac). That's compared to $9500 for the Mac in America btw.

>Top spec M3 Ultra is $14099
Only if you're including the 16TB SSD. With the 1TB it's $9500. You should be careful to get around the same configurations to make fair comparisons.

Anyway this conversation is just bullshit upon bullshit. The reality is that there isn't a clear answer for which path is a better option. With Macs the customer service and resale value is better. You also get a tiny package and better power efficiency. There are other pros and cons here. It depends on what kind of use cases you have and what kind of user you are.
Anonymous No.107119514 [Report]
>>107119503
https://www.newegg.com/p/N82E16888884003
Anonymous No.107119519 [Report]
>>107119464
Forgot to link your post in >>107119506
Anonymous No.107119521 [Report] >>107119529
>>107119464
just save up 65K more and get one of those GB200 super computers instead /s
Anonymous No.107119529 [Report] >>107119538
>>107119521
Anonymous No.107119535 [Report]
>>107119455
kek
Anonymous No.107119538 [Report] >>107119549
>>107119529
They're gonna start advertising 20 hexaflops at fp0.004 soon.
Anonymous No.107119542 [Report]
>>107119404
This guy (https://www.youtube.com/watch?v=J4qwuCXyAcU) says it runs at 16 t/s with a 4 bit quant.
I'll rent a cloud GPU and run llama.cpp to compare, brb in a few hours after compiling it.
Anonymous No.107119549 [Report] >>107119560
>>107119538
fp4 would work well with quantization aware training
Anonymous No.107119551 [Report] >>107119556 >>107119578 >>107119677 >>107120333
At this point why not just pay for a secure and private business subscription to an even better LLM? It's still a ton cheaper. AND you can even still buy a mid-tier PC for some amount of local AI even if not the best and fastest.
Anonymous No.107119556 [Report] >>107119566
>>107119551
>you vill own nothing be happy goy
Anonymous No.107119560 [Report]
>>107119549
That's not the point.
Anonymous No.107119566 [Report]
>>107119556
I just said you can still own something.
Anonymous No.107119578 [Report] >>107119856
>>107119551
>At this point why not just pay for a secure and private business subscription to an even better LLM
There's no such thing.
If you don't care about privacy then sure, it's a much more nuanced discussion. But there are some things which you cannot really do with a subscription anyway, like finetuning and other kinds of experiments which require access to the weights, and cloud compute is expensive so it depends on amortization, local power costs and taxes what ends up being cheaper.
Anonymous No.107119586 [Report] >>107119607 >>107120616
>>107119506
>resale value is better
Absolutely not, macs especially non-macbooks depreciate faster than the GPU alone, many which stay at msrp or appreciate over time in the past 5+ years
>there isn't a clear answer for which path is a better option
If you are spending car money on an AI machine then the machine with more memory + faster inference is clearly the better option
>better power efficiency
If you are spending car money on an AI machine you aren't fretting a double digit increase on your yearly electricity bill
>tiny package
The only advantage here is portability and when will you ever be regularly transporting around car-money-value fragile electronics (that isn't a complete package like a laptop) to do AI, for that purpose and that money you could use APIs on a cheap piece of shit for years
>With Macs the customer service
Are you a shill?
Anonymous No.107119607 [Report] >>107119620 >>107119668
>>107119586
>If you are spending car money on an AI machine then the machine with more memory + faster inference is clearly the better option
Not if you want to do finetuning.
>If you are spending car money on an AI machine you aren't fretting a double digit increase on your yearly electricity bill
Would it be only a double digit yearly increase? This is a big server build we're talking about. Not to even mention the noise these things make.
Anonymous No.107119620 [Report]
>>107119607
It's really not that big. Max-Q is 300W, the CPU and RAM and all the other shit would be 600W at max. Power bill isn't really a concern, this is basically equivalent to a high end gaming PC.
Anonymous No.107119668 [Report] >>107119743
>>107119607
>Not if you want to do finetuning
So the machine with less memory overall is better? Okay
You won't be fine tuning anything other then Loras on either anyway because it's still not enough memory you're going to need a full ok giganigga server rack for that or just rent cloud compute like every huggingface degenerate does already.

>Noise
It won't be any louder than a gaming PC under heavy load during inference and otherwise can be silent if you spend money on fans and a nice hybrid PSU, this isn't a data hoarder server with 8x HDDs grinding away all day and night.

Look let's be honest the only thing the Mac is better at is being a nice aesthetic little shiny silver box.

I wish it was better, I like my MacBook pro that I do dev work on, windows is a dystopian piece of shit and Linux is a ballache to use if you want to use it as a multipurpose workhorse/entertainment machine that runs other non-AI proprietary software. But it's just not, apple isn't letting you have your cake and eat it too here.
Anonymous No.107119677 [Report] >>107119710 >>107119953
>>107119323
You got me at migusoft.
>>107119551
Sincerely and unironically kill yourself if you don't see the value in owning your entire production pipeline. Any subscription service is entirely reliant on playing ball with the terms and conditions another imposes upon you. Like the other anon said, direct access to weights and finetrooning is also best done locally for a variety of reasons.

We haven't hit the point where minimum power draw to run anything worth running locally exceeds an API subscription and I don't see that day coming soon as long as you optimize your machine for power consumption even at the tentative expense of initial upfront cost.
Anonymous No.107119710 [Report] >>107119911
>>107119677
I didn't say it's best done locally. I said you can't really do it through API. Whether local or renting ends up being cheaper depends.
The neat thing about renting GPUs is that it is a fungible service and some of them even take crypto (in case you get banned by payment kikes).
Anonymous No.107119719 [Report] >>107119761 >>107119799 >>107119861 >>107119932 >>107119990
If you guys spend 15k on the hardware to run the brains of your waifu, you won't have enough left to buy a body for her.
Anonymous No.107119743 [Report] >>107119751 >>107119787
>>107119668
You can't offload to RAM when finetuning the same way you can when doing inference.
>You won't be fine tuning anything other then Loras on either anyway because it's still not enough memory you're going to need a full ok giganigga server rack for that or just rent cloud compute like every huggingface degenerate does already.
So? Are you gonna finetune a LoRa with Llama.cpp? No. You can't finetune LoRas on system RAM, you have to fit the whole model on VRAM or (pressumably, I'm not sure) on unified memory. So you still need hundreds of gigs of vram (or possibly unified) to finetune a ~200B model like Qwen or GLM.
There are some tricks to reduce memory consumption like Liger kernel but that stuff only offloads activations and such, the (quanted) weights still need to fit on VRAM.
Anonymous No.107119751 [Report] >>107119807
>>107119743
you can now with ktransformers but its slow ish, ramtorch is good for image / video though since that is compute bound instead
Anonymous No.107119761 [Report] >>107119907
>>107119719
why the fuck would I want my perfect digital goddess to sully herself by taking physical form?
Anonymous No.107119787 [Report]
>>107119743
>So? Are you gonna finetune a LoRa with Llama.cpp? No.
Not *yet*.
Anonymous No.107119799 [Report] >>107119861
>>107119719
By the time these things become available to buy and be able to crack and repurpose with open source soft, I'll be too old to be interested in creating a cyberwaifu

I'll still probably get one and reprogram just as a hobby but it won't hit the same

https://youtu.be/s-LaAIXgv-8
Anonymous No.107119807 [Report]
>>107119751
It's an interesting development but the example config file they show on their website is with 2k context, I suspect they don't implement the memory optimizations that have been written for GPU and with any non trivial amount of context the memory usage is going to blow up to many TBs.
And they require you to load the weights in FP16.
It requires some experimentation to give a verdict.
Anonymous No.107119856 [Report] >>107119881 >>107119891
>>107119578
>There's no such thing
Not with literally 100% guarantee, but logically they wouldn't risk something with big business customers. There are some that of course do not give you any promises about data security and for sure those are not giving you any privacy practically by admission. But there are in fact some that do make claims. The last time I checked the landscape out, AWS had a good reputation for this. If you absolutely need an offline box and can't risk even the 0.001% chance of a data leak, then sure, but for most people I don't think that's rational. If it's a virtue/philosophy you're following that you can't make some small exceptions for, ok I understand, but it is still irrational.
>like finetuning and other kinds of experiments which require access to the weights
The people even paying attention to or considering Mac vs PC in an LLM context do not have this use case. Or they are uninformed and don't know literally anything about how fine tuning works and what modern Mac architectures are like.

>what ends up being cheaper
Sure and your usage also affects whether sub or per-use is better. But even then, are you sure it's not that cheap? I checked the AWS site and it seems their version of subscription (they call Privisioned Throughput) is probably overkill for a single home user and sure it isn't cheap. But for per-use API for a single user, it's not terrible. For Sonnet 4.5 it's $15 per million output tokens, which appears to be the same as Anthropic's price? In that case it's not unreasonable. The people who complain about Claude prices couldn't afford a 10k server either.
Anonymous No.107119861 [Report] >>107120024
>>107119719
>>107119799
That's creepy as fuck bro
Anonymous No.107119881 [Report] >>107119953
>>107119856
Do you think enterprise customers ACTUALLY give a shit?
They just want to tick a checkmark that they following their "Gen AI Guidelines" and their data (allegedly) won't be used to train on.
Enterprise already has all their data uploaded to Azure or GCP.
But try uploading the child porn drawings any of the local degenerates have on their hard drives and see how well that goes.
Anonymous No.107119885 [Report]
>>107117555
>unironic beta nu-male tries to give away rape culture to india with a smirk on his face
You simply have to go back.
Anonymous No.107119891 [Report]
>>107119856
And when I said "what ends up being cheaper" I was referring to cloud GPUs that get allocated for you to do whatever you please and billed by hour, not API.
Anonymous No.107119898 [Report]
>>107119196
4060ti/5060ti 16gb should be quite a bit smaller than that, 16GB is barely any better than 12GB for text LLMs. Both are too small from ~30B models without cope quants + low context, you'll likely be sticking to ~12b models if you have less than 24GB.
5070ti is poor value for LLMs, might make sense if you're doing a lot of gaming with only casual LLM use
4090 is faster than the 3090 but same VRAM means that you'll be running the exact same models, just a bit faster.
5090's 32GB is nice but you'll still be limited to ~30B models or combining with cpu+rammaxxing for MOE. 32GB isn't nearly enough to run ~70b models, the next step up from ~30B models that run fine on 24GB.
CPU+RAMMAXXERS are probably the most well-off right now, assuming you're pairing it with at least a 3090.
Anonymous No.107119907 [Report]
>>107119761
So you can sully her more by sticking your dick in it.
Anonymous No.107119911 [Report] >>107119942
>>107119710
Sorry for being overly hostile then. I had interpreted your post to be povertyjeet-tier API shilling.
Anonymous No.107119928 [Report]
>>107119196
depends on the cpumaxxer btw, a cpulet with some old DDR3-4 server sure, but someone with a DDR5 server + rtx 6000s or 4090s or the like is super giga chad
Anonymous No.107119932 [Report]
>>107119719
Anonymous No.107119942 [Report] >>107119950
>>107119911
Actually I wasn't the guy you replied to, who *was* shilling API. I'm not sure why I replied to your post. Guess I'm an ADHD retard retard doing too many things at once.
Anonymous No.107119950 [Report]
>>107119942
If those too many things includes making those migus, keep doing them.
Anonymous No.107119953 [Report] >>107120014 >>107120015
>>107119677
>Sincerely and unironically kill yourself if you don't see the value in owning your entire production pipeline
Why do you take my post out of context? Obviously there are values and benefits to complete ownership, but in the context of this thread, most people arguably are home users not doing any professional work at a large scale with LLMs, and most are not even experimenters but just people who want to inference with a single chatbot.

>>107119881
>Do you think enterprise customers ACTUALLY give a shit?
Do you? It ends up being both yes and no. I've talked to enterprises and some do want security. It depends on the person/manager/use case. I did my duty to inform them of the risks of API as well as the costs of it vs going full local.
>Enterprise already has all their data uploaded to Azure or GCP
Many surprisingly do not in my experience. Keyword being "all their data".

But yes the ultra degenerates might need to look elsewhere. Potentially incrimination is of course not worth risking. At the same time I'm pretty sure most of us are not actually that far down the hole.
Anonymous No.107119990 [Report] >>107120020
>>107119719
That's Indian person inside
Anonymous No.107120014 [Report] >>107120035 >>107120137 >>107120784
>>107119953
Contrary to the doomers, local models are only getting better with time. I don't think it'll be too long now before something small but coherent enough releases to satisfy the 'LLM tourist' market locally. Similarly, prosumers (that I presume the bulk of the non-indians in this thread are), benefit from unquanted local-scale models or even just outright commercial scale models quanted enough to retain usefulness as research into quantization-aware training will likely only scale better with time.
I understand what your post was trying to say; I reject the underlying cynicism in the assumptions within it after looking at the recent improvements to quantization methods and mixing standard reinforcement learning methods with other types of training all implying things are only going to get better for all weight classes of local users - it's just going to take time for the fruit to blossom.
Anonymous No.107120015 [Report] >>107120062 >>107120137
>>107119953
It's not only about incrimination. It's about being dependent on something that can be taken away from you. Fortune 500 companies have more leverage than some 4chan autist. Anthropic is not going to ban a big company. You on the other hand can be squashed like a bug for random bullshit that you didn't even think could be ban worthy.
I got banned for asking Claude to look for youtubers with similar ideologies and interest to tastyfish, who apparently had some images of naked children on his webpage. But I asked because I was interested in the self reliance and suckless aspect, not anything shady.
https://web.archive.org/web/20250923163240/http://www.tastyfish.cz/

>Many surprisingly do not in my experience. Keyword being "all their data".
Realistically, anything that passes through an API is going to get scanned and stored.
And I say this as somebody who doesn't care that much about privacy, which is (in tastyfish's words) just an euphemism for censorship (https://web.archive.org/web/20251008180406/https://www.tastyfish.cz/lrs/privacy.html).
Anonymous No.107120020 [Report]
>>107119990
wrong, there is a distinct lack of brown stains on the suit.
Anonymous No.107120024 [Report]
>>107119861
>creepy as fuck
that only makes my penis harder
Anonymous No.107120035 [Report] >>107120077 >>107120087
>>107120014
>that I presume the bulk of the non-indians in this thread are
And the indians are what?

>I don't think it'll be too long now before something small but coherent enough releases to satisfy the 'LLM tourist' market
The problem with that is ChatGPT has a really good backend with search and Python sandbox.
Not that it takes that much to replicate it, but they have the benefit of it being as technical to set up as they want to, while you are going to have to ship something that normalfags can install on their random ass machines with one click, which is an extremely difficult challenge to meet. OpenAI also probably has API deals with Google and other API providers to fetch stuff from the web without getting blocked with captchas.
Anonymous No.107120057 [Report] >>107120088
>Look at this guy. I'm totally not that guy, but look at this guy. Guys. Make sure to understand i'm not this guy.
https://desuarchive.org/g/search/text/tastyfish/
Anonymous No.107120062 [Report]
>>107120015
I hate anthropic so much
in the early days of claude I was changing my main mail address and their website didn't let you do that, and suggested you should delete your account and create a new one with the new mail address
I did exactly what they were telling me to do and while attempting to make the new account their automated systems said my phone number was used too many times to register accounts and I need another phone number to register, like, what the heck? I only ever had one account you filthy niggers I'm not going to change my cell phone number or buy a second subscription just to register to gain access to claude
I never tried the appeal to a human process because I am not willing to submit to humiliation rituals.
since the I am firmly in the pure local camp.
Anonymous No.107120077 [Report] >>107120087 >>107120116
>>107120035
>that I presume the bulk of the non-indians in this thread are
The difference between 'are' and 'aren't' are catastrophic in speedtyped responses kek.
>while you are going to have to ship something that normalfags can install on their random ass machines with one click, which is an extremely difficult challenge to meet
Sillytavern with Kobold is pretty close to retard-proof but probably still too many steps for most normgroids to follow right now. I don't see it improbable that a better or modified frontend for casual users will emerge though.
>OpenAI also probably has API deals with Google and other API providers to fetch stuff from the web without getting blocked with captchas.
That's a decent point though. We can only pray for a chink and yandex alliance to balance the scales.
Anonymous No.107120087 [Report]
>>107120035
>>107120077
Jeets are likely on low end machines or API calls.
Anonymous No.107120088 [Report] >>107120098 >>107120105
>>107120057
Damn. Got me. I admit it, I'm Miloslav Ciz posting from my suckless caravan using a windmill powered vacuum tube radio.
https://www.youtube.com/watch?v=8qvddkIgo4A
Anonymous No.107120098 [Report]
>>107120088
>youtube
Anonymous No.107120104 [Report] >>107120117 >>107120134 >>107120135 >>107120139 >>107120141 >>107120168
OH SHIT
https://x.com/gm8xx8/status/1986273562185875731
https://github.com/JinjieNi/MegaDLMs

https://huggingface.co/datasets/MDGA-1/super_data_learners_ckpts/tree/main
Anonymous No.107120105 [Report] >>107120170
>>107120088
Not being him just makes your obsession with him even weirder.
Anonymous No.107120116 [Report]
>>107120077
>sillytavern
Normalfags aren't using ChatGPT for roleplaying lol, they're using it to get homework done and shit. Nobody is using Sillytavern to do quick research (like I did here >>107119226, I had to use Gemini)
Anonymous No.107120117 [Report]
>>107120104
inb4 when llama.cpp support
Anonymous No.107120134 [Report]
>>107120104
when llama.cpp support
Anonymous No.107120135 [Report]
>>107120104
Anonymous No.107120137 [Report]
>>107120014
You can still buy a PC during or after using API for a while. I think it's only logical to use a resource when your use case fits, and use a different resource when you have a different though perhaps similar (in this case) use case. The use case may not really be different but just a desire for tighter security or ownership of the means of production.

>>107120015
Like my original post said, you can still have your PC WHILE you also use cloud. You don't have to use them for the same queries or even as alternatives for the same use case. You don't have to be dependent on the API/sub. The PC may not be as powerful, that's all. It's not an all or nothing scenario.
Anonymous No.107120139 [Report] >>107120148 >>107120160
>>107120104
What is the significance of this? People found out how to train stuff faster? Inclusion of diffusion?
Anonymous No.107120141 [Report]
>>107120104
no gguf, no care
Anonymous No.107120148 [Report] >>107120163
>>107120139
It's the LLM equivalent to the weekly article about China developing a new storage medium that can store 6 million petabytes on a crystal the size of a sperm
Anonymous No.107120160 [Report] >>107120200
>>107120139
>faster
oh no, much much much slower / needs tons more compute to train, but it also performs far better in comparison to equally sized autoregressive ones / would be far far faster for inference
Anonymous No.107120163 [Report] >>107120176
>>107120148
just say 6 zettabytes dayum
Anonymous No.107120168 [Report]
>>107120104
Ok!
Anyway.
Anonymous No.107120170 [Report] >>107120174
>>107120105
I was obsessed with him for a few days.
Because he is what people like Luke Smith can only dream about (a true idealist, as opposed to a poser/grifter).
I even trained a LoRa on his wiki, this article was written by a finetuned version of Llama 3.1 70B base model. I thought the AIDS joke was funny and kind of impressive.
Anonymous No.107120174 [Report] >>107120190
>>107120170
>for a few days
Anonymous No.107120176 [Report] >>107120193
>>107120163
I was going to say niggerbytes but didn't want to upset cuda dev
Anonymous No.107120190 [Report] >>107120203
>>107120174
Now I only post links to his stuff whenever anyone talks about security, privacy, or when I mention it tangentially in relation to me being banned from Claude. He and /g/ more or less agree on most computer stuff except on the privacy/security thing so I think it's an alternative perspective worth pointing out.
Anonymous No.107120193 [Report]
>>107120176
I hope any data scrapers combing this general get enough samples to have models using them blurt out nigger in random conversation.
Anonymous No.107120200 [Report]
>>107120160
basically think of how good diffusion image / video models are at 2-40B big in comparison. Now also compare how extremely expensive they are to train in comparison to 1T text models

A diffusion text model would be far better param for param, BUT it is orders of magnitude more expensive compute wise to train
Anonymous No.107120203 [Report]
>>107120190
>Now I only post links to his stuff whenever anyone talks about security, privacy, or when I mention it tangentially in relation to me being banned from Claude.
>Now I only post links to his stuff whenever anyone talks about security
>Now I only post links to his stuff whenever
>Now
>for a few days
Anonymous No.107120281 [Report] >>107120289 >>107120295 >>107120296
>>107113093 (OP)
I've been looking at the DGX spark and I think I will now get the GMKTec EVO X2 AI (128gb ram AMD AI 395)

It seems with the unified memory it behaves like a 4070 with 96gb of vram from what I have read online. It seems cheaper than the minisforum or the Framework desktop but I'm not sure about how well it will go if I wanted an eGPU.
Seems like this is way better than building my own.

Seems like this mini computer should be able to do 70B models happily and at about 40 tokens a second at Q4

I also have the opportunity of buying the 96gb Vram Huawei Atlas 300I for 1300$
Wasn't this made specifically for inference? Any thoughts or experience here?
Anonymous No.107120289 [Report] >>107120372 >>107120526
>>107120281
>70B models happily and at about 40 tokens a second at Q4
lol. lmao, even. you'll get 15t/s at best
Anonymous No.107120295 [Report] >>107120372
>>107120281
>70B models
the ones that stopped being made over a year ago?
Anonymous No.107120296 [Report] >>107120372 >>107120526
>>107120281
>70B
do poorfags really?
glm air is the lowest form of acceptable cope
Anonymous No.107120311 [Report]
I can only load up to 300k context, it's so over
Anonymous No.107120333 [Report]
>>107119551
>why not
Love of the game
Anonymous No.107120372 [Report] >>107120380 >>107120393
>>107120289
>>107120295
>>107120296

Well what else could I build and how is it poor?

Wouldn't this be a great option with the unified memory for inference?
It would be much faster with smaller models.

I don't get why this is a bad bet. Isn't this what these systems are exactly made for?
Isn't this one of the fastest budget options without spending 20k on something enterprise?
Anonymous No.107120380 [Report] >>107120412
>>107120372
not upgradeable, shitty option.
Anonymous No.107120387 [Report]
https://x.com/scaling01/status/1986158792128508218

apparently apple just leaked gemini 3's size, 1.2T params
Anonymous No.107120393 [Report] >>107120412 >>107120616
>>107120372
Any laptop or mac is neither scalable with upgrades nor easily repairable when your chinesium-grade pre-built parts falter after just a bit of rigorous use. You're, at best, marrying yourself to specific specs in a rapidly developing industry and at worst you're setting yourself up for an endless loop of customer service calls, warranty references, and repair cycles.
Anonymous No.107120412 [Report] >>107120426 >>107120435
>>107120380
You literally can't buy any other self built AI CPUs right now though right?

There is no comparable Ram out right now either.

You can run eGPUs with it.
It out performs any other system I could make myself without building an entire server.

>>107120393

Yes the fact that the Ram is soldered does piss me off. Given it's a mini PC even if it does lose out in a few years not only are things advancing so quickly but it can retire as a games console or something.
My 13700k desktop for example isn't worth upgrading as every component at 3 years old is redundant for AI. The same is true for anything running a 9950x though. You'll be changing the CPU, RAM, Motherboard anyway. So who cares?

It will last longer than anything you build today.
Anonymous No.107120426 [Report] >>107120472
>>107120412
can you get a trip
Anonymous No.107120435 [Report] >>107120472 >>107121065
>>107120412
honestly just pay api with vpn until actually decent ai inference focused hardware comes out, unless you are willing to drop 80K for a GB300 desktop then its just retarded to lock youself into something that can already only run shitty old out dated models
Anonymous !y0Kt/0nBiU No.107120472 [Report] >>107120519
>>107120426
tripcode on. I'm sorry if you want to mute me.
I'll use it when I come eventually get everything sorted in a month.

>>107120435
I see it as being cool being able to run my own models continuously myself and this seems doable.
Anonymous No.107120519 [Report]
>>107120472
then just run glm air with a regular ram + gpu setup instead of buying ewaste soldered ram bs
Anonymous No.107120526 [Report] >>107120530 >>107120701
>>107120296
glm is stinky dogshit
>>107120289
he's looking at more like 5t/s
Anonymous No.107120530 [Report] >>107120542
>>107120526
glm is by far the best local that is not kimi
Anonymous No.107120542 [Report] >>107120550
>>107120530
It's easily one of the worst, actually. No other model repeats your own message in every single one of its own.
Anonymous No.107120550 [Report] >>107120558 >>107120667
>>107120542
>No other model repeats your own message in every single one of its own.
sounds like user error / formatting issues to me as it has never done that before for me, There is a reason why everyone on novelai / featherless / st discords are preferring it and kimi
Anonymous No.107120558 [Report] >>107120583
>>107120550
Nope, it sounds like you don't actually use the model you're shilling.
Anonymous No.107120583 [Report] >>107120669
>>107120558
I use a mix of it and claude's subs for coding and writing and kimi 0905 for nsfw writing, you sound like your just talking shit
Anonymous No.107120616 [Report] >>107120621 >>107120701
>>107119404
>>107119586
>>107120393
Ok, ran the test. If Dave2D's test for the Studio is accurate (https://www.youtube.com/watch?v=J4qwuCXyAcU) then the Pro 6000 is simply slower than the M3 Ultra for running large MoEs like Deepseek R1 (16 tk/s vs 14 tk/s at Q4).
For more medium size models like Qwen or GLM I doubt there would be much difference since the number of active parameters is similar. And the build would be ~50% more expensive.
Anonymous No.107120621 [Report] >>107120650
>>107120616
cept you can add the pro 6000 to another build, and do image and video gen super well / train loras
Anonymous No.107120650 [Report] >>107120657 >>107120764
>>107120621
Sure, in fact I'm renting a Pro 6000 for training LoRas. But that's just because there aren't any Macs for rent and I don't have 17k to drop on a Mac (15k for the Mac + 2k for hotel+flight, I live in India).
Anonymous No.107120657 [Report]
>>107120650
12k I mean. (sorry, dalit genes so can't do math either)
Anonymous No.107120667 [Report] >>107120676
>>107120550
>There is a reason why everyone on novelai
least obvious paid shill nigger award
Anonymous No.107120669 [Report]
>>107120583
You don't though
Anonymous No.107120676 [Report] >>107120689
>>107120667
>ignores all the other writing discords mentioned
Anonymous No.107120689 [Report] >>107120693
>>107120676
>discord users as a judge of good writing
Anonymous No.107120693 [Report]
>>107120689
have you seen the logs posted here / on aicg?
Anonymous !y0Kt/0nBiU No.107120701 [Report] >>107120710
>>107120616
Doesn't the AI get this shit wrong, like in regards to the lack of Atlas support when it is specifically made for inference?

>>107120526
I'll let you know how I go :/
Anonymous No.107120710 [Report]
>>107120701
I don't know what Atlas is or does, sorry.
Anonymous No.107120764 [Report] >>107120825 >>107120833
>>107120650
gemma sir of release when>??> do ther nedful sir pls ranjesh sir
Anonymous No.107120784 [Report]
>>107120014
The issue is hardware. Not software. There are open sota models available right now.
The hardware needs to get cheaper to make them cost effective to run at home.
Anonymous No.107120825 [Report]
>>107120764
We have Urgent Gemma News coming latery this week.
Anonymous No.107120833 [Report]
>>107120764
We have Gemma 4 Gold Edition at home sirs.
Anonymous No.107121054 [Report]
>>107113093 (OP)
migu = discord word
Anonymous No.107121065 [Report]
>>107120435
>just pay api
wrong thread, get out
Anonymous No.107121375 [Report]
>>107121367
>>107121367
>>107121367