More

kcb · 2026-04-07T23:56:10 1775606170

What benefit is there to dropping $50k on GPUs to run this personally besides being a cool enthusiast project?

marcus_holmes · 2026-04-08T01:22:42 1775611362

Why would anyone need more than 640Kb of memory?

kcb · 2026-04-08T02:10:55 1775614255

Exactly the point though. In the 640KB days there was no subscription to ever increasing compute resources as an alternative.

marcus_holmes · 2026-04-08T02:45:09 1775616309

Well, there kinda was - most computing then was done on mainframes. Personal / Micro computers were seen as a hobby or toy that didn't need any "serious" amounts of memory. And then they ate the world and mainframes became sidelined into a specific niche only used by large institutions because legacy.

I can totally see the same happening here; on-device LLMs are a toy, and then they eat the world and everyone has their own personal LLM running on their own device and the cloud LLMs are a niche used by large institutions.

kcb · 2026-04-08T02:55:05 1775616905

The difference is computers post text terminal are latency and throughput dependent to the user. LLMs are not particularly.

marcus_holmes · 2026-04-08T02:59:42 1775617182

Sorry, I don't understand that comment. Can you clarify, please?

kcb · 2026-04-08T03:18:44 1775618324

My point is LLMs aren't more usable if the hardware is in your room versus a few states away. Personal computers still to this day aren't great when the hardware is fully remote.

marcus_holmes · 2026-04-08T03:42:59 1775619779

Agreed. But you couldn't do much on a PC when they launched, at least compared to a mainframe. The hardware was slow, the memory was limited, there was no networking at all, etc. If you wanted to do any actual serious computing, you couldn't do that on a PC. And yet they ate the world.

I can easily see the advantage, even now, of running the LLM locally. As others have said in this topic. I think it'll happen.

edit: thanks for clarifying :)

deminature · 2026-04-08T00:20:15 1775607615

Intel has just released a high VRAM card which allows you to have 128GB of VRAM for $4k. The prices are dropping rapidly. The local models aren't adapted to work on this setup yet, so performance is disappointing. But highly capable local models are becoming increasingly realistic. https://www.youtube.com/watch?v=RcIWhm16ouQ

kcb · 2026-04-08T02:18:00 1775614680

That's 4 32GB GPUs with 600GB/s bw each. This model is not running on that scale GPUs. I think something like 96GB RTX PRO 6000 Blackwells would be the minimum to run a model of this size with performance in the range of subscription models.

acchow · 2026-04-08T04:25:32 1775622332

> I think something like 96GB RTX PRO 6000 Blackwells would be the minimum to run a model of this size with performance in the range of subscription models.

GLM 5.1 has 754B parameters tho. And you still need RAM for context too. You'll want much more than 96GB ram.

blizdiddy · 2026-04-08T00:14:24 1775607264

Is it so hard to project out a couple product cycles? Computers get better. We’ve gone from $50k workstation to commodity hardware before several times

kcb · 2026-04-08T02:06:44 1775614004

Subscription services get all the same benefits from computer hardware getting better. But actually due to scale, batching, resource utilization, they'll always be able to take more advantage of that.

CamperBob2 · 2026-04-08T01:25:06 1775611506

It will run exactly the same tomorrow, and the next day, and the day after that, and 10 years from now. It will be just as smart as the day you downloaded the weights. It won't stop working, exhaust your token quota, or get any worse.

That's a valuable guarantee. So valuable, in fact, that you won't get it from Anthropic, OpenAI, or Google at any price.

kcb · 2026-04-08T03:24:02 1775618642

That's why we all still use our e machines its never obsolete PCs. Works just the same it did 20 years ago, though probably not because I've never heard of hardware that's guaranteed not to fail.

fwipsy · 2026-04-08T01:48:54 1775612934

Agree directionally but you don't need $50k. $5k is plenty, $2-3k arguably the sweet spot.

unlikelytomato · 2026-04-08T01:52:21 1775613141

as a local LLM novice, do you have any recommended reading to bootstrap me on selecting hardware? It has been quite confusing bring a latecomer to this game. Googling yields me a lot of outdated info.

fwipsy · 2026-04-08T03:42:16 1775619736

First answer: If you haven't, give it a shot on whatever you already have. MoE models like Qwen3 and GPT-OSS are good on low-end hardware. My RTX 4060 can run qwen3:30b at a comfortable reading pace even though 2/3 of it spills over into system RAM. Even on an 8-year-old tiny PC with 32gb it's still usable.

Second answer: ask an AI, but prices have risen dramatically since their training cutoff, so be sure to get them to check current prices.

Third answer: I'm not an expert by a long shot, but I like building my own PCs. If I were to upgrade, I would buy one of these:

Framework desktop with 128gb for $3k or mainboard-only for $2700 (could just swap it into my gaming PC.) Or any other Strix Halo (ryzen AI 385 and above) mini PC with 64/96/128gb; more is better of course. Most integrated GPUs are constrained by memory bandwidth. Strix Halo has a wider memory bus and so it's a good way to get lots of high-bandwidth shared system/video RAM for relatively cheap. 380=40%; 385=80%; 395=100% GPU power.

I was also considering doing a much hackier build with 2x Tesla P100s (16gb HBM2 each for about $90 each) in a precision 5820 (cheap with lots of space and power for GPUs.) Total about $500 for 32gb HBM2+32gb system RAM but it's all 10-year-old used parts, need to DIY fan setup for the GPUs, and software support is very spotty. Definitely a tinker project; here there be dragons.

terbo · 2026-04-08T06:57:33 1775631453

Agree on the framework, last week you could get a strix halo for $2700 shipped now it's over $3500, find a deal on a NVME and the framework with the noctua is probably going to be the quietest, some of them are pretty loud and hot.

I run qwen 122b with Claude code and nanoclaw, it's pretty decent but this stuff is nowhere prime time ready, but super fun to tinker with. I have to keep updating drivers and see speed increases and stability being worked on. I can even run much larger models with llama.cpp (--fit on) like qwen 397b and I suppose any larger model like GLM, it's slow but smart.

kcb · 2026-04-08T02:05:34 1775613934

The 4-bit quants are 350GB, what hardware are you talking about?

fwipsy · 2026-04-08T03:27:04 1775618824

qwen3:0.6b is 523mb, what model are you talking about? You seem to have a specific one in mind but the parent comment doesn't mention any.

For a hobby/enthusiast product, and even for some useful local tasks, MoE models run fine on gaming PCs or even older midrange PCs. For dedicated AI hardware I was thinking of Strix Halo - with 128gb is currently $2-3k. None of this will replace a Claude subscription.

kcb · 2026-04-02T18:49:46 1775155786

Nemotron 3 Super was released recently. That's a direct competitor to gpt-oss-120b. https://developer.nvidia.com/blog/introducing-nemotron-3-sup...

evilduck · 2026-04-02T21:16:32 1775164592

In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.

mratsim · 2026-04-02T22:06:28 1775167588

In my case with vLLM on dual RTX Pro 6000

gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k

Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)

Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...

coder68 · 2026-04-02T19:36:37 1775158597

I gave it a whirl but was unenthused. I'll try it again, but so far have not really enjoyed any of the nvidia models, though they are best in class for execution speed.

markab21 · 2026-04-02T21:08:57 1775164137

I'll pipe in here as someone working on an agentic harness project using mastra as the harness.

Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.

I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.

The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)

I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).

Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)

girvo · 2026-04-02T22:30:41 1775169041

> I'm not sure why more people aren't jumping on it

Simple: most of the people you’re talking to aren’t setting these things up. They’re running off the shelf software and setups and calling it a day. They’re not working with custom harnesses or even tweaking temperature or templates, most of them.

pertymcpert · 2026-04-03T05:26:28 1775193988

I’d be very interested in trying it if you could spare the time to write up how to tune it well. If not thanks for the input anyway.

kcb · 2026-03-30T23:22:34 1774912954

Because the initial announcement included none of that... it wasn't addressed at all until the harsh sentiment.

fc417fc802 · 2026-03-31T01:52:36 1774921956

It still hasn't been addressed. They walked back half of their wholly unreasonable position in an attempt to legitimize the other half.

TGower · 2026-03-30T23:25:15 1774913115

Then shouldn't we celebrate the victory, drop it, and move on?

kcb · 2026-03-30T23:41:36 1774914096

Victory is my device and its OS working the same way it always worked and the way it worked when I bought it.

TGower · 2026-03-31T00:00:24 1774915224

Just don't install the OS updates then.

kcb · 2026-03-29T04:19:49 1774757989

Inference isn't really that expensive, its the training of new foundational models that is. With whatever highly optimized setup the big providers are using, they should be able to pack quite a lot of concurrent users onto a deployment of a model. Just think too, it's very possible their use case would be served just fine by a 100B model deployed to a $4,000 DGX Spark.

kcb · 2026-03-26T00:07:19 1774483639

It's glass...

kcb · 2026-03-20T18:03:08 1774029788

Another factor, it's not just GPUs it's the full hardware stack. https://static.tweaktown.com/news/1/1/110521_2_nvidia-update...

kcb · 2026-03-18T23:23:53 1773876233

CUDA has had managed memory that pages between VRAM and system RAM for a decade. Problem is doing so is unusably slow for AI purposes. Seems like an unnecessary layer here.

kcb · 2026-03-09T12:29:51 1773059391

But it has chemicals in it...

RCitronsBroker · 2026-03-09T15:00:24 1773068424

i wouldn’t last a day as a food chemist

kcb · 2026-03-05T17:06:39 1772730399

The 100 class Nvidia chips are targeted at training. With Nvidia buying Groq it will further move in that direction.

kcb · 2026-03-04T13:33:27 1772631207

There's already videos of US/Israeli jets over tehran dropping guided bombs.