Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Compute (and lots of it) is absolutely needed for generation - 10s of billions of FLOPs per token on the smaller models (7B) alone - with computations of the larger models scaling proportionally.

Each token requires a forward pass through all transformer layers, involving large matrix multiplications at every step, followed by a final projection to the vocabulary.



Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.


It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.


That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: