For the many DGX Spark and Strix Halo users with 128GB of memory, I believe the ideal model size would probably be a MoE with close to 200B total parameters and a low active count of 3B to 10B.
I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.
It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.
The common 120B size these days leaves a lot of unused memory on the table on these machines.
I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!
Following the current rule of thumb MoE = `sqrt(param*active)` a 200B-A3B would have the intelligence of a ~24B dense model.
That seems pointless. You can achieve that with a single 24G graphics card already.
I wonder if it would even hold up at that level, as 3B active is really not a lot to work with. Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.
I don't see any value proposition for these little boxes like DGX Spark and Strix Halo. Lots of too-slow RAM to do anything useful except run mergekit. imo you'd have been better building a desktop computer with two 3090s.
That rule of thumb was invented years ago, and I don’t think it is relevant anymore, despite how frequently it is quoted on Reddit. It is certainly not the "current" rule of thumb.
For the sake of argument, even if we take that old rule of thumb at face value, you can see how the MoE still wins:
- (DGX Spark) 273GB/s of memory bandwidth with 3B active parameters at Q4 = 273 / 1.5 = 182 tokens per second as the theoretical maximum.
- (RTX 3090) 936GB/s with 24B parameters at Q4 = 936 / 12 = 78 tokens per second. Or 39 tokens per second if you wanted to run at Q8 to maximize the memory usage on the 24GB card.
The "slow" DGX Spark is now more than twice as fast as the RTX 3090, thanks to an appropriate MoE architecture. Even with two RTX 3090s, you would still be slower. All else being equal, I would take 182 tokens per second over 78 any day of the week. Yes, an RTX 5090 would close that gap significantly, but you mentioned RTX 3090s, and I also have an RTX 3090-based AI desktop.
(The above calculation is dramatically oversimplified, but the end result holds, even if the absolute numbers would probably be less for both scenarios. Token generation is fundamentally bandwidth limited with current autoregressive models. Diffusion LLMs could change that.)
The mid-size frontier models are rumored to be extremely sparse like that, but 10x larger on both total and active. No one has ever released an open model that sparse for us to try out.
As I said, I wanted to see what it is possible for Google to achieve.
> Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.
From what I've seen, having used both, I would anecdotally report that the 122B model is better in ways that aren't reflected in benchmarks, with more inherent knowledge and more adaptability. But, I agree those two models are quite close, and that's why I want to see greater sparsity and greater total parameters: to push the limits and see what happens, for science.
Kimi 2.5 is relatively sparse at 1T/32B; GLM 5 does 744B/40B so only slightly denser. Maybe you could try reducing active expert count on those to artificially increase sparsity, but I'm sure that would impact quality.
Reducing the expert count after training causes catastrophic loss of knowledge and skills. Cerebras does this with their REAP models (although it is applied to the total set of experts, not just routing to fewer experts each time), and it can be okay for very specific use cases if you measure which experts are needed for your use case and carefully choose to delete the least used ones, but it doesn't really provide any general insight into how a higher sparsity model would behave if trained that way from scratch.
Large MoE models are too heavily bottlenecked on typical discrete GPUs. You end up pushing just a few common/non-shared layers to GPU and running the MoE part on CPU, because the bandwidth of PCIe transfers to a discrete GPU is a killer bottleneck. Platforms with reasonable amounts of unified memory are more balanced despite the lower VRAM bandwidth, and can more easily run even larger models by streaming inactive weights from SSD (though this quickly becomes overkill as you get increasingly bottlenecked by storage bandwidth: you'd be better off then with a plain HEDT accessing lots of fast storage in parallel via abundant PCIe lanes).
The value prop for the Nvidia one is simple: playing with CUDA with wide enough RAM at okay enough speeds, then running your actual workload on a server someone running the same (not really, lol Blackwell does not mean Blackwell…) architecture.
They’re fine tuning and teaching boxes, not inference boxes. IMO anyway, that’s what mine is for.
That Codex one comes from the new `github` plugin, which includes a `github:yeet` skill. There are several ways to disable it: you can disconnect github from codex entirely, or uninstall the plugin, or add this to your config.toml:
[[skills.config]]
name = "github:yeet"
enabled = false
I agree that skill is too opinionated as written, with effects beyond just creating branches.
What's weird is, I never installed any github plugins, or indeed any customization to Codex, other than updating using brew... so I was so confused when this started happening.
From my point of view, Parakeet is not very good at formatting the output, so it would be nice if a small model focused on having nicely formatted (and correct) text, not just the lowest WER score. Rewarding the model for inserting logical line breaks, quotation marks, etc.
I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.
In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.
I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.
> Siri/iOS-Dictation is truly good when it comes to understanding the speech.
What...? It is terrible, even compared to Whisper Tiny, which was released years ago under an Apache 2.0 license so Apple could have adopted it instantly and integrated it into their devices. The bigger Whisper models are far better, and Parakeet TDT V2 (English) / V3 (Multilingual) are quite impressive and very fast.
I have no idea what would make someone say that iOS dictation is good at understanding speech... it is so bad.
For a company that talks so much about accessibility, it is baffling to me that Apple continues to ship such poor quality speech to text with their devices.
Parakeet is insanely fast and much more accurate, and it doesn't really matter that Whisper requires hacks to work live when those hacks have existed for years and work great. (The Hello Transcribe app on iOS is a great example of how well Whisper can work with live streaming on an iPhone. The smaller models are extremely fast, even with the "hacks".)
Parakeet TDT's architecture is actually a really cool way to boost both the speed and efficiency of real time STT compared to traditional approaches.
Terrible relative to everything else that exists today. I have a neutral American accent.
Maybe you just don’t know what you’re missing? Google’s default speech to text is still bad compared to Whisper and Parakeet, but even Google’s is markedly better than Apple’s.
I cannot think of a single speech to text system that I’ve run into in the past 5 years that is less accurate than the one Apple ships.
Sure, Apple’s speech to text is incredible compared to what was on the flip phone I had 20 years ago. Terrible is relative. Much better options exist today, and they’re under very permissive licenses. Apple’s refusal to offer a better, more accessible experience to their users is frustrating when they wouldn’t even have to pay a licensing fee to ship something better. Whisper was released under a permissive license nearly 4 years ago.
Apple also restricts third party keyboards to an absurdly tiny amount of memory, so it isn’t even possible to ship a third party keyboard that provides more accurate on-device speech to text without janky workarounds (requiring the user to open the keyboard's own app first each time).
As someone who tried every TTS in existance a few years ago for some product work, Apple’s is so consistantly better that we wound up getting a bunch of apple stuff just for the TTS.
Neutral here means not strongly identifiable as any particular regional American accent. Some people have very strong regional accents, some don’t. It is still clearly an American accent, not British or anything else.
Why on earth would Anthropic commit to interoperability?
That is the company that doesn't interoperate with the standard LLM APIs that OpenAI developed, which everyone else in the industry has adopted and uses. Whether OpenAI's APIs are great or perfect or not, they are the standard that the industry has settled on.
The A18 Pro performs about on par with an M4 in terms of single threaded performance, and a little better than M1 in terms of multi threaded performance.
The MacBook Neo has one of the fastest processors on the market for single threaded tasks, which is what has the most impact on how "fast" a processor feels for day to day usage.
I actually used a netbook when I was in school, it wasn't all that bad.
People thinking I mentioned my (somewhat) disappointment about the CPU because it is also used in Phones, but actually what I meant is that I would be interested in doing some reverse engineering work to contribute to the Asahi Linux project for the M-chips if this was a cheap option to attain one.
But I don't really see doing that for the A18, personally; even though I don't doubt its a good chip!
> I actually used a netbook when I was in school, it wasn't all that bad.
The reputation problem was kind of baked in. Vista launched the same year netbooks did, and even though Vista was a disaster, "runs the latest Windows" is the smell test normal people use for whether something is a real computer.
Netbooks didn't pass.
The storage situation made Windows users miserable anyway. The SSD models had 4-8GiB of flash, and XP alone ate well over half before you'd done anything. So people bought the HDD variant instead, more space, sure, but spinning at 4,200rpm, which wasn't even the slow-but-acceptable 5,400 of a normal laptop drive. Then pile the standard bloatware on top of that.
Bear in mind, people chose the HDD version because it ran Vista: the thing that made it a "real" computer. The SSD variant, the one that actually worked, got ignored for exactly that reason.
Run Linux on the SSD variants though, and the thing was actually great.
> I would be interested in doing some reverse engineering work to contribute to the Asahi Linux project for the M-chips if this was a cheap option to attain one.
Why don't you buy a used M1 from eBay? You can probably get one for less than 500 USD.
I used a first-gen eeepc with Linux in college. I didn't have any problems with speed for normal use, though I ssh'd into servers for anything more intensive than running a browser.
I would not say a full year... not even close to a year: GLM-5 is very close to the frontier: https://artificialanalysis.ai/
Artificial Analysis isn't perfect, but it is an independent third party that actually runs the benchmarks themselves, and they use a wide range of benchmarks. It is a better automated litmus test than any other that I've been able to find in years of watching the development of LLMs.
Benchmarks are always fishy, you need to look at things that you'd use the model for in the real world. From that point of view, the SOTA for open models is quite behind.
If benchmarks are fishy, it seems their bias would be to produce better scores than expected for proprietary models, since they have more incentives to game the benchmarks.
No... benchmarks are not always "fishy." That is just a defense people use when they have nothing else to point to. I already said the benchmarks aren't perfect, but they are much better than claiming vibes are a more objective way to look at things. Yes, you should test for your individual use case, which is a benchmark.
As I said, I have been following this stuff closely for many years now. My opinion is not informed just by looking at a single chart, but by a lot of experience. The chart is less fishy than blanket statements about the closed models somehow being way better than the benchmarks show.
I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.
It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.
The common 120B size these days leaves a lot of unused memory on the table on these machines.
I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!