The purpose of L1 cache is to avoid long round trips to memory. What you describ...

janwas · on Jan 12, 2025

The code is unfortunately not (yet) open source. The CPU with 50x is an SKX Gold, and it is similar for Zen4. I compute this ratio as #FMA * 4 / total system memory bandwidth. We are indeed not fully memBW bound :)

menaerus · on Jan 14, 2025

I'd be curious if you measured 50x on a single core implementation or is the algorithm distributed to multiple cores?

I ask because you say that the results are similar to Zen4 so this would sorta imply that you run and measure single-core implementation? Intel in multi-core load-store looses a lot of bandwidth when compared to Zen3/4/5 since there's a lot of contention going on due to Intel cache architecture.