Sorry, that's exactly what you said and the reason why we are having this discus...

ryao · on Jan 11, 2025

I am a well known OSS developer with hundreds of commits in OpenZFS and many commits in other projects like Gentoo and the Linux kernel. You keep misreading what I wrote and insist that I said something I did not. The issue is your lack of understanding, not mine.

I said that supporting 2 AVX-512 reads per cycle instead of 1 AVX-512 read per cycle does not actually matter very much for performance. You decided that means I said that AVX-512 does not matter. These are very different things.

If you try to use 2 AVX-512 reads per cycle for some workload (e.g. checksumming, GEMV, memcpy, etcetera), then you are going to be memory bandwidth bound such that the code will run no faster than if it did 1 AVX-512 read per cycle. I have written SIMD accelerated code for CPUs and the CPU being able to issue 2 SIMD reads per cycle would make zero difference for performance in all cases where I would want to use it. The only way 2 AVX-512 reads per cycle would be useful would be if system memory could keep up, but it cannot.

janwas · on Jan 11, 2025

I agree server CPUs are underprovisioned for memBW. Each core's share is 2-4 GB/s, whereas each could easily drive 10 GB/s (Intel) or 20+ (AMD).

I also agree "some" (for example low-arithmetic-intensity) workloads will not benefit from a second L1 read port.

But surely there are other workloads, right? If I want to issue one FMA per cycle, streaming from two arrays, doesn't that require maintaining two loads per cycle?

ryao · on Jan 11, 2025

In an ideal situation where your arrays both fit in L1 cache and are in L1 cache, yes. However, in typical real world situations, you will not have them fit in L1 cache and then what will happen after the reads are issued will look like this:

  * Some time passes
  * Load 1 finishes
  * Some time passes
  * Load 2 finishes
  * FMA executes

As we are doing FMA on arrays, this is presumably part of a tight loop. During the first few loop iterations, the CPU core’s memory prefetcher will figure out that you have two linear access patterns and that your code is likely to request the next parts of both arrays. The memory prefetcher will then begin issuing loads before your code does and when the CPU issues a load that has already been issued by the prefetcher, it will begin waiting on the result as if it had issued the load. Internally, the CPU is pipelined, so if it can only issue 1 load per cycle, and there are two loads to be issued, it does not wait for the first load to finish and instead issues the second load on the next cycle. The second load will also begin waiting on a load that was done early by the prefetcher. It does not really matter whether you are issuing the AVX-512 loads in 1 cycle or 2 cycles, because the issue of the loads will occur in the time while we are already waiting for the loads to finish thanks to the prefetcher beginning the loads early.

There is an inherent assumption in this that the loads will finish serially rather than in parallel, and it would seem reasonable to think that the loads will finish in parallel. However, in reality, the loads will finish serially. This is because the hardware is serial. On the 9800X3D, the physical lines connecting the memory to the CPU can only send 128-bits at a time (well, 128-bits that matter for this reasoning; we are ignoring things like transparent ECC that are not relevant for our reasoning). An AVX-512 load needs to wait for 4x 128-bits to be sent over those lines. The result is that even if you issue two AVX-512 reads in a single cycle, one will always finish first and you will still need to wait for the second one.

I realize I did not address L2 cache and L3 cache, but much like system RAM, neither of those will keep up with 2 AVX-512 loads per cycle (or 1 for that matter), so what will happen when things are in L2 or L3 cache will be similar to what happens when loads come from system memory although with less time spent waiting.

It could be that you will end up with the loop finishing a few cycles faster with the 2 AVX-512 read per cycle version (because it could make the memory prefetcher realize the linear access pattern a few cycles faster), but if your loop takes 1 billion cycles to execute, you are not going to notice a savings of a few cycles, which is why I think being able to issue 2 AVX-512 loads instead of 1 in a single cycle does not matter very much.

Does my explanation make sense?

janwas · on Jan 11, 2025

OK, we agree that L1-resident workloads see a benefit. I also agree with your analysis if the loads actually come from memory.

Let's look at a more interesting case. We have a dataset bigger than L3. We touch a small part of it with one kernel. That is now in L1. Next we do a second kernel where each of the loads of this part are L1 hits. With two L1 ports, the latter is now twice as fast.

Even better, we can work on larger parts of the data such that it still fits in L2. Now, we're going to do the above for each L1-sized piece of the L2. Sure, the initial load from L2 isn't happening as fast as 2x64 bytes per cycle. But still, there are many L1 hits and I'm measuring effective FMA throughput that is _50 times_ as high as the memory bandwidth would allow when only streaming from memory. It's simply a matter of arranging for reuse to be possible, which admittedly does not work with single-pass algorithms like a checksum.

Do you find this reasoning convincing?

ryao · on Jan 11, 2025

The purpose of L1 cache is to avoid long round trips to memory. What you describe is L1 cache doing what it is intended to do. Unfortunately, I do not have your code, so it is not clear to me that it benefits from doing 2 AVX-512 loads per cycle.

I am also not sure what CPU this is. On recent AMD processors at the very least, it should be impossible to get FMA throughput that is 50 times higher from L1 cache bandwidth than system memory bandwidth. On the Ryzen 7 9800X3D for example, a single core is limited to about 64GB/sec. 50 times more would be 3.2TB/sec, which is ~5 times faster than possible to load from L1 cache even with 2 AVX-512 loads per cycle.

I wonder if you are describing some sort of GEMM routine, which is a place where 50 times more FMA throughput is possible if you do things in a clever way. GEMM is somewhat weird, since without copying to force things into L1 cache, it does not run at full speed, and memory bandwidth from RAM is always below peak memory bandwidth, even without the memcpy() trick to force things into L1 cache. That excludes the case where you stuff GEMV in GEMM, where it does become memory bandwidth bound.

janwas · on Jan 12, 2025

The code is unfortunately not (yet) open source. The CPU with 50x is an SKX Gold, and it is similar for Zen4. I compute this ratio as #FMA * 4 / total system memory bandwidth. We are indeed not fully memBW bound :)

menaerus · on Jan 14, 2025

I'd be curious if you measured 50x on a single core implementation or is the algorithm distributed to multiple cores?

I ask because you say that the results are similar to Zen4 so this would sorta imply that you run and measure single-core implementation? Intel in multi-core load-store looses a lot of bandwidth when compared to Zen3/4/5 since there's a lot of contention going on due to Intel cache architecture.