1) Pretty much, it's mathematically equivalent. The only software issues are things like managing dependency versions and data formats in-memory, but Flash Attention 2 is already built into HuggingFace and other popular libraries. Flash Attention 3 probably will be soon, although it requires an H100 GPU to run
2) Flash Attention 2 added support for GQA in past version updates:
2) Flash Attention 2 added support for GQA in past version updates:
https://github.com/Dao-AILab/flash-attention
3) They're comparing this implementation of Flash Attention (which is written in raw CUDA C++) to the Triton implementation of a similar algorithm (which is written in Triton): https://triton-lang.org/main/getting-started/tutorials/06-fu...