Compute (and lots of it) is absolutely needed for generation - 10s of billions o...

zargon · 2025-07-15T08:30:35 1752568235

Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.

supermatt · 2025-07-15T12:42:32 1752583352

It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.

zargon · 2025-07-15T17:01:47 1752598907

That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.