It doesn't sound to me like it's quite "tokens further in the past get exponentially less attention." What they say is "attention allocation decreases exponentially as
the distance between tokens grows." Instead of being quadratic because every pair of tokens gets the same attention, the tokens farther apart from each other get exponentially less. It doesn't matter how far they are from the final token.
This seems to me more like a general computational approach than a hand-coded heuristic. David Shapiro claims it's similar to how the brain works, and has a neat analogy for it here: https://www.youtube.com/watch?v=R0wBMDoFkP0
This seems to me more like a general computational approach than a hand-coded heuristic. David Shapiro claims it's similar to how the brain works, and has a neat analogy for it here: https://www.youtube.com/watch?v=R0wBMDoFkP0