More

coder543 · 2026-04-02T18:22:11 1775154131

For the many DGX Spark and Strix Halo users with 128GB of memory, I believe the ideal model size would probably be a MoE with close to 200B total parameters and a low active count of 3B to 10B.

I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.

It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.

The common 120B size these days leaves a lot of unused memory on the table on these machines.

I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!

redman25 · 2026-04-02T21:31:46 1775165506

200a10b please, 200a3b is too little active to have good intelligence IMO and 10b is still reasonably fast.

suprjami · 2026-04-02T21:42:17 1775166137

Following the current rule of thumb MoE = `sqrt(param*active)` a 200B-A3B would have the intelligence of a ~24B dense model.

That seems pointless. You can achieve that with a single 24G graphics card already.

I wonder if it would even hold up at that level, as 3B active is really not a lot to work with. Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

I don't see any value proposition for these little boxes like DGX Spark and Strix Halo. Lots of too-slow RAM to do anything useful except run mergekit. imo you'd have been better building a desktop computer with two 3090s.

coder543 · 2026-04-02T21:46:05 1775166365

That rule of thumb was invented years ago, and I don’t think it is relevant anymore, despite how frequently it is quoted on Reddit. It is certainly not the "current" rule of thumb.

For the sake of argument, even if we take that old rule of thumb at face value, you can see how the MoE still wins:

- (DGX Spark) 273GB/s of memory bandwidth with 3B active parameters at Q4 = 273 / 1.5 = 182 tokens per second as the theoretical maximum.

- (RTX 3090) 936GB/s with 24B parameters at Q4 = 936 / 12 = 78 tokens per second. Or 39 tokens per second if you wanted to run at Q8 to maximize the memory usage on the 24GB card.

The "slow" DGX Spark is now more than twice as fast as the RTX 3090, thanks to an appropriate MoE architecture. Even with two RTX 3090s, you would still be slower. All else being equal, I would take 182 tokens per second over 78 any day of the week. Yes, an RTX 5090 would close that gap significantly, but you mentioned RTX 3090s, and I also have an RTX 3090-based AI desktop.

(The above calculation is dramatically oversimplified, but the end result holds, even if the absolute numbers would probably be less for both scenarios. Token generation is fundamentally bandwidth limited with current autoregressive models. Diffusion LLMs could change that.)

The mid-size frontier models are rumored to be extremely sparse like that, but 10x larger on both total and active. No one has ever released an open model that sparse for us to try out.

As I said, I wanted to see what it is possible for Google to achieve.

> Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

From what I've seen, having used both, I would anecdotally report that the 122B model is better in ways that aren't reflected in benchmarks, with more inherent knowledge and more adaptability. But, I agree those two models are quite close, and that's why I want to see greater sparsity and greater total parameters: to push the limits and see what happens, for science.

zozbot234 · 2026-04-02T22:09:45 1775167785

Kimi 2.5 is relatively sparse at 1T/32B; GLM 5 does 744B/40B so only slightly denser. Maybe you could try reducing active expert count on those to artificially increase sparsity, but I'm sure that would impact quality.

coder543 · 2026-04-02T22:55:12 1775170512

Reducing the expert count after training causes catastrophic loss of knowledge and skills. Cerebras does this with their REAP models (although it is applied to the total set of experts, not just routing to fewer experts each time), and it can be okay for very specific use cases if you measure which experts are needed for your use case and carefully choose to delete the least used ones, but it doesn't really provide any general insight into how a higher sparsity model would behave if trained that way from scratch.

zozbot234 · 2026-04-02T22:04:17 1775167457

Large MoE models are too heavily bottlenecked on typical discrete GPUs. You end up pushing just a few common/non-shared layers to GPU and running the MoE part on CPU, because the bandwidth of PCIe transfers to a discrete GPU is a killer bottleneck. Platforms with reasonable amounts of unified memory are more balanced despite the lower VRAM bandwidth, and can more easily run even larger models by streaming inactive weights from SSD (though this quickly becomes overkill as you get increasingly bottlenecked by storage bandwidth: you'd be better off then with a plain HEDT accessing lots of fast storage in parallel via abundant PCIe lanes).

girvo · 2026-04-02T22:26:45 1775168805

The value prop for the Nvidia one is simple: playing with CUDA with wide enough RAM at okay enough speeds, then running your actual workload on a server someone running the same (not really, lol Blackwell does not mean Blackwell…) architecture.

They’re fine tuning and teaching boxes, not inference boxes. IMO anyway, that’s what mine is for.

coder543 · 2026-04-02T17:19:56 1775150396

> Wild differences in ELO compared to tfa's graph

Because those are two different, completely independent Elos... the one you linked is for LMArena, not Codeforces.

coder543 · 2026-03-30T15:28:10 1774884490

That Codex one comes from the new `github` plugin, which includes a `github:yeet` skill. There are several ways to disable it: you can disconnect github from codex entirely, or uninstall the plugin, or add this to your config.toml:

    [[skills.config]]
    name = "github:yeet"
    enabled = false

I agree that skill is too opinionated as written, with effects beyond just creating branches.

saberience · 2026-03-30T15:40:29 1774885229

What's weird is, I never installed any github plugins, or indeed any customization to Codex, other than updating using brew... so I was so confused when this started happening.

coder543 · 2026-03-30T15:53:05 1774885985

If you visit https://chatgpt.com/codex/settings/connectors, you're saying you don't have GitHub connected?

Plugins are a new feature as of this past week, so Codex "helpfully" installs the GitHub one automatically if you have GitHub connected.

coder543 · 2026-03-19T22:35:39 1773959739

From my point of view, Parakeet is not very good at formatting the output, so it would be nice if a small model focused on having nicely formatted (and correct) text, not just the lowest WER score. Rewarding the model for inserting logical line breaks, quotation marks, etc.

coder543 · 2026-03-17T18:52:41 1773773561

I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.

JLO64 · 2026-03-17T19:29:22 1773775762

In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.

I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.

asselinpaul · 2026-03-17T20:53:19 1773780799

OpenRouter has this information

coder543 · 2026-03-17T22:40:27 1773787227

I do not see prompt processing, only some kind of nebulous “throughput” that could be output or input+output, but definitely not input only.

coder543 · 2026-03-10T18:30:30 1773167430

> Siri/iOS-Dictation is truly good when it comes to understanding the speech.

What...? It is terrible, even compared to Whisper Tiny, which was released years ago under an Apache 2.0 license so Apple could have adopted it instantly and integrated it into their devices. The bigger Whisper models are far better, and Parakeet TDT V2 (English) / V3 (Multilingual) are quite impressive and very fast.

I have no idea what would make someone say that iOS dictation is good at understanding speech... it is so bad.

For a company that talks so much about accessibility, it is baffling to me that Apple continues to ship such poor quality speech to text with their devices.

derefr · 2026-03-10T18:38:34 1773167914

Maybe they have exactly the accent iOS dictation was trained to recognize.

solarkraft · 2026-03-12T08:39:35 1773304775

Its quality isn’t great, but it is damn fast and that matters a lot! Whisper doesn’t even work live without hacks.

coder543 · 2026-03-12T13:27:43 1773322063

Parakeet is insanely fast and much more accurate, and it doesn't really matter that Whisper requires hacks to work live when those hacks have existed for years and work great. (The Hello Transcribe app on iOS is a great example of how well Whisper can work with live streaming on an iPhone. The smaller models are extremely fast, even with the "hacks".)

Parakeet TDT's architecture is actually a really cool way to boost both the speed and efficiency of real time STT compared to traditional approaches.

fragmede · 2026-03-10T20:05:55 1773173155

Terrible? It's fine. What's your accent that it's terrible? It even pulls last names from my address book and spells them right.

coder543 · 2026-03-10T20:07:49 1773173269

Terrible relative to everything else that exists today. I have a neutral American accent.

Maybe you just don’t know what you’re missing? Google’s default speech to text is still bad compared to Whisper and Parakeet, but even Google’s is markedly better than Apple’s.

I cannot think of a single speech to text system that I’ve run into in the past 5 years that is less accurate than the one Apple ships.

Sure, Apple’s speech to text is incredible compared to what was on the flip phone I had 20 years ago. Terrible is relative. Much better options exist today, and they’re under very permissive licenses. Apple’s refusal to offer a better, more accessible experience to their users is frustrating when they wouldn’t even have to pay a licensing fee to ship something better. Whisper was released under a permissive license nearly 4 years ago.

Apple also restricts third party keyboards to an absurdly tiny amount of memory, so it isn’t even possible to ship a third party keyboard that provides more accurate on-device speech to text without janky workarounds (requiring the user to open the keyboard's own app first each time).

CamJN · 2026-03-10T20:57:18 1773176238

As someone who tried every TTS in existance a few years ago for some product work, Apple’s is so consistantly better that we wound up getting a bunch of apple stuff just for the TTS.

coder543 · 2026-03-10T21:21:09 1773177669

“A few years ago” sounds like it could be before the modern era of STT, as defined by when Whisper was released.

Your comment says TTS, which is different from what I’m discussing, though, so there might be some confusion.

catlifeonmars · 2026-03-11T03:50:50 1773201050

> I have a neutral American accent

This is tangential but is _any_ accent objectively neutral?

coder543 · 2026-03-11T04:17:30 1773202650

Neutral here means not strongly identifiable as any particular regional American accent. Some people have very strong regional accents, some don’t. It is still clearly an American accent, not British or anything else.

catlifeonmars · 2026-03-18T12:12:20 1773835940

TIL (assuming you mean American as in the United States of America) that this is known as a General American accent.

coder543 · 2026-03-06T21:23:52 1772832232

Why on earth would Anthropic commit to interoperability?

That is the company that doesn't interoperate with the standard LLM APIs that OpenAI developed, which everyone else in the industry has adopted and uses. Whether OpenAI's APIs are great or perfect or not, they are the standard that the industry has settled on.

That is the same company that refuses to add support for AGENTS.md that everyone else in the industry uses, despite over 3000 upvotes: https://github.com/anthropics/claude-code/issues/6235

Anthropic's Claude Code is also one of the only agentic coding CLI tools that isn't open source.

I'm not sure which principles you think Anthropic stands by... but interoperability is not one of their strong suits, from what I've seen.

coder543 · 2026-03-04T19:33:15 1772652795

The A18 Pro performs about on par with an M4 in terms of single threaded performance, and a little better than M1 in terms of multi threaded performance.

The MacBook Neo has one of the fastest processors on the market for single threaded tasks, which is what has the most impact on how "fast" a processor feels for day to day usage.

Netbooks had processors that were glacially slow.

sva_ · 2026-03-04T19:49:46 1772653786

I actually used a netbook when I was in school, it wasn't all that bad.

People thinking I mentioned my (somewhat) disappointment about the CPU because it is also used in Phones, but actually what I meant is that I would be interested in doing some reverse engineering work to contribute to the Asahi Linux project for the M-chips if this was a cheap option to attain one.

But I don't really see doing that for the A18, personally; even though I don't doubt its a good chip!

dijit · 2026-03-04T20:31:18 1772656278

> I actually used a netbook when I was in school, it wasn't all that bad.

The reputation problem was kind of baked in. Vista launched the same year netbooks did, and even though Vista was a disaster, "runs the latest Windows" is the smell test normal people use for whether something is a real computer.

Netbooks didn't pass.

The storage situation made Windows users miserable anyway. The SSD models had 4-8GiB of flash, and XP alone ate well over half before you'd done anything. So people bought the HDD variant instead, more space, sure, but spinning at 4,200rpm, which wasn't even the slow-but-acceptable 5,400 of a normal laptop drive. Then pile the standard bloatware on top of that.

Bear in mind, people chose the HDD version because it ran Vista: the thing that made it a "real" computer. The SSD variant, the one that actually worked, got ignored for exactly that reason.

Run Linux on the SSD variants though, and the thing was actually great.

piperswe · 2026-03-04T19:57:24 1772654244

I suspect Asahi Linux would appreciate work to support A18 Macs as well!

throwaway2037 · 2026-03-05T04:26:34 1772684794

    > I would be interested in doing some reverse engineering work to contribute to the Asahi Linux project for the M-chips if this was a cheap option to attain one.

Why don't you buy a used M1 from eBay? You can probably get one for less than 500 USD.

j45 · 2026-03-04T19:59:45 1772654385

That’s pretty impressive

jefftk · 2026-03-04T23:06:21 1772665581

I used a first-gen eeepc with Linux in college. I didn't have any problems with speed for normal use, though I ssh'd into servers for anything more intensive than running a browser.

coder543 · 2026-02-20T14:22:07 1771597327

I would not say a full year... not even close to a year: GLM-5 is very close to the frontier: https://artificialanalysis.ai/

Artificial Analysis isn't perfect, but it is an independent third party that actually runs the benchmarks themselves, and they use a wide range of benchmarks. It is a better automated litmus test than any other that I've been able to find in years of watching the development of LLMs.

And the gap has been rapidly shrinking: https://www.youtube.com/watch?v=0NBILspM4c4&t=642s

zozbot234 · 2026-02-20T14:30:42 1771597842

Benchmarks are always fishy, you need to look at things that you'd use the model for in the real world. From that point of view, the SOTA for open models is quite behind.

lancebeet · 2026-02-20T16:16:49 1771604209

If benchmarks are fishy, it seems their bias would be to produce better scores than expected for proprietary models, since they have more incentives to game the benchmarks.

coder543 · 2026-02-20T14:32:52 1771597972

No... benchmarks are not always "fishy." That is just a defense people use when they have nothing else to point to. I already said the benchmarks aren't perfect, but they are much better than claiming vibes are a more objective way to look at things. Yes, you should test for your individual use case, which is a benchmark.

As I said, I have been following this stuff closely for many years now. My opinion is not informed just by looking at a single chart, but by a lot of experience. The chart is less fishy than blanket statements about the closed models somehow being way better than the benchmarks show.

coder543 · 2026-02-11T17:44:01 1770831841

If someone releases a benchmark/dataset, I'm sure that significantly increases the chances of one of these AI labs training on the task.