Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It doesn't sound to me like it's quite "tokens further in the past get exponentially less attention." What they say is "attention allocation decreases exponentially as the distance between tokens grows." Instead of being quadratic because every pair of tokens gets the same attention, the tokens farther apart from each other get exponentially less. It doesn't matter how far they are from the final token.

This seems to me more like a general computational approach than a hand-coded heuristic. David Shapiro claims it's similar to how the brain works, and has a neat analogy for it here: https://www.youtube.com/watch?v=R0wBMDoFkP0



This is intriguing but I don't quite follow - really naive, but:

isn't the final token as some position N?

And given context size limit Y, when we generate the next token, right now I get attention from N - Y to N?

And this supposes I get attention from 0 to N, but the attention decreases exponentially as we approach token 0?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: