> In I/O heavy workloads, the correct thing to do is to do asynchronous I/O
You can't async mmap into memory reads (or the GPU-equivalent: cudaMallocManaged).
Today's I/O is looking more-and-more like a memory read/write. As such, your typical "node = node->next" pointer traversals could very well be an incredible string of I/O. That indirection could be in RAM, over SSD (mmap), over RDMA (ethernet pretending to be RAM), or into GPU-RAM (cudaMallocManaged)... with an appropriate PCIe command (possibly atomic-PCIe) to boot.
Async only works on the simplest of reading/writing patterns.
EDIT: Consider the "persistent kernel" pattern on GPGPU (which starts off with a lot of the similar thought process you have on CPU-land. Launch only X wavefronts, where X is the size of the GPU). You primarily communicate with a persistent kernel over RAM / atomics. You don't use the classic cuda <<<blocks, threads>>>() (which admittingly has an async interface)... Instead: you read/write to magic managed-memory locations that will be eventually sync'd over PCIe to the GPU. This is because the "persistent kernel" is always executing. You launched it upon program start, there's no event to async-trigger on. You just pass data to the GPU. The GPU operates on the data at its leisure. Then it eventually returns data to a buffer elsewhere (probably with atomic compare-and-swaps, which traverse PCIe these days, to ensure coherence)
I recently spent some time reading the kernel code for io-uring - however only for older revisions (5.4 - 5.7). There I found out that a lot of it is actually implemented on top of existing kernel functions for blocking IO and polled IO, and does not replace it and make the system fully asynchronous. Pratically that means with a 5.4 kernel doing lots of disk IO you will still have lots of threads being blocked on IO - only in this case those will be threads inside a kernel threadpool instead of in userspace. With 5.7 that model changed, and the threadpool is no longer necessary for read/write operations on sockets. Maybe also for files - but I don't really understand the kernel code well enough to confirm or deny that. And things might obviously also have changed for newer Kernel versions.
CUDA-streams probably already use io_uring under the hood.
The issue is that a CUDA-stream based execution kernel will spend lots of time pulling from the queue and spinning up threads (sound like a familiar problem?).
Instead: you sidestep this issue with persistent kernels: you launch exactly the number of kernels that matches your GPU-size, and then pass data to those kernels through an alternative means.
-------
The future of I/O is going to be atomic-operations across PCIe with stronger-and-stronger coherency models. PCIe 3.0 introduced atomic-PCIe commands. PCIe 4.0 strengthened them. PCIe 5.0 is rumored to have even stronger coherency rules and "coherent grids" are being seriously discussed / implemented in high-speed computers (see Compute eXpress Link, or CXL)
Any serious I/O (and GPUs will be part of that I/O future), will likely use PCIe atomics in some regards, quite similar to atomic-compare-and-swaps that the CPU does already between their cores.
I/O is becoming indistinguishable from memcpy + atomics.
-------
EDIT: In effect, I'm saying that GPU-programmers are more concerned about GPU-efficiency rather than CPU-efficiency. Sure, the CPU is less efficient spinning on these memory-reads/writes. But the GPU is where the bulk of the calculations are happening. Therefore, it behooves the architect to optimize the GPU (even if the CPU ends up slightly less efficient).
Whatever argument you have about "queues being more efficient in hypothetical architecture", the GPU has more calculations and more cores to feed (aka: far more overhead when it comes to Amdahl's law). That means you want to apply those principles first to the GPU, and then the CPU "makes up for it", so to speak.
I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?
Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.
Recent datacenter network controlers (Mellanox, marvell) have 'gpu direct' capabilities, so direct interactions with devices with no cpu interaction. I've also seen fpga+network boards do that with success. And with libraries like nccl and 200gbe eth links you could almost forget you have CPUs or network links between.
What I miss is a simple but efficient data queue between cpu and gpu. Everyone's doing manual memory reservation and cudamemcpy, I want an async send (gpu->cpu) with an mpi or socket-like interface. I've seen someone posting stuff on io_uring from gpu code, but just bragging, no code.
Buying Mellanox, and their bluefield dpu (integrated 8 or 16 arm cores in the NIC) stuff I feel, nvidia could probably go the way you're seeing. Haven't seen any Mellanox/NVIDIA tech convergence yet.
You can't async mmap into memory reads (or the GPU-equivalent: cudaMallocManaged).
Today's I/O is looking more-and-more like a memory read/write. As such, your typical "node = node->next" pointer traversals could very well be an incredible string of I/O. That indirection could be in RAM, over SSD (mmap), over RDMA (ethernet pretending to be RAM), or into GPU-RAM (cudaMallocManaged)... with an appropriate PCIe command (possibly atomic-PCIe) to boot.
Async only works on the simplest of reading/writing patterns.
EDIT: Consider the "persistent kernel" pattern on GPGPU (which starts off with a lot of the similar thought process you have on CPU-land. Launch only X wavefronts, where X is the size of the GPU). You primarily communicate with a persistent kernel over RAM / atomics. You don't use the classic cuda <<<blocks, threads>>>() (which admittingly has an async interface)... Instead: you read/write to magic managed-memory locations that will be eventually sync'd over PCIe to the GPU. This is because the "persistent kernel" is always executing. You launched it upon program start, there's no event to async-trigger on. You just pass data to the GPU. The GPU operates on the data at its leisure. Then it eventually returns data to a buffer elsewhere (probably with atomic compare-and-swaps, which traverse PCIe these days, to ensure coherence)