I experimented with different optimizations and ended with 128x speedup. The improvement mainly comes from manual SIMD intrinsics, but you can go a long way just by making the code more auto-vectorization friendly as some other comments have mentioned. See:
https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...