SPDK will be able to fully saturate the PCIe bandwidth from a single CPU core here (no secret 6 threads inside the kernel). The drives are your bottleneck so it won't go faster, but it can use a lot less CPU.
But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.
DDIO is a pure hardware feature. Software doesn't need to do anything to support it.
For an expanding array in a 64 bit address space, reserving a big region and mmaping it in as you go is usually the top performing solution by a wide margin. At least on Linux, it is faster to speculatively mmap ahead with MAP_POPULATE rather than relying on page faults, too.
And, if you find you didn't reserve enough address space, Linux has mremap() which can grow the reserved region. Or map the region to two places at once (the original place and a new, larger place).
One place I had issues was rapidly allocating space I needed temporarily but then discarding it.
The space I needed was too large to be added to the heap, so I used mmap. Because of the nature of the processing (mmap, process, yeet mmap) I put the system under a lot of pressure. Maintaining the set of mapped blocks and reusing them around fixed the issue.
Freeing memory to the OS or doing things like munmap involves a TLB shootdown which by definition is a performance bottleneck. Probably what you ended up experiencing.
Compared to libraries like bgfx and sokol at least, I think there are two key differences.
1) SDL_gpu is a pure C library, heavily focused on extreme portability and no depedencies. And somehow it's also an order of magnitude less code than the other options. Or at least this is a difference from bgfx, maybe not so much sokol_gfx.
2) The SDL_gpu approach is a bit lower level. It exposes primitives like command buffers directly to your application (so you can more easily reason about multi-threading), and your software allocates transfer buffers, fills them with data, and kicks off a transfer to GPU memory explicitly rather than this happening behind the scenes. It also spawns no threads - it only takes action in response to function calls. It does take care of hard things such as getting barriers right, and provides the GPU memory allocator, so it is still substantially easier to use than something like Vulkan. But in SDL_gpu it is extremely obvious to see the data movements between CPU and GPU (and memory copies within the CPU), and to observe the asynchronous nature of the GPU work. I suspect the end result of this will be that people write far more efficient renderers on top of SDL_gpu than they would have on other APIs.
We tried to standardize exactly this - eBPF programs offloaded onto the device. The NVMe standard now has a lot of infrastructure for this standardized, including commands to discover device memory topology, transfer to/from that memory, and discover and upload programs. But one of the blockers is that eBPF isn't itself standardized. The other blockers are vendors ready and willing to build these devices and customers ready to buy them in volume. The extra compute ability will introduce some extra cost.
> The NVMe standard now has a lot of infrastructure for this standardized, including commands to discover device memory topology, transfer to/from that memory, and discover and upload programs.
On the other hand, Windows and Linux still cannot just upgrade the vast majority of firmwares on NVMe devices, least of all consumer ones, despite being completely and utterly standardized.
I think i remember upgrading the nvme disk firmware in work dell laptop (dell latitude 7390) from 2019 using fwupd some years ago (not more than 3 years ago).
Also i think i remember fixing (upgrading?) the firmware on a crucial ssd like 5 or 6 years ago using some live linux system (downloaded off the crucial website i think?)
Not sure about windows, but linux is getting incredibly better at this.
The eBPF programs are strictly bounded. And they're scoped to their own memory that you have to pre-load from the actual storage with separate commands issued from the CPU (presumably from the kernel driver which is doing access control checks). It's no different than uploading a shader to a GPU. You can burn resources but that's about the extent of the damage you can cause.
I wouldn't want random applications (or web pages) to be able to load eBPF modules in the same way they can send shaders to a GPU through a graphics driver.
I don't get it either, and I'm a maintainer of SPDK which provides multiple implementations of virtualized devices and is frequently used inside DPUs to present storage devices.
If I'm implementing a hardware device anyway, why would I not just use NVMe as the interface? NVMe is superior to virtio-blk in every way that I can think of.
Even for a software device in userspace, why not use a technology like vfio-user to present an NVMe device, or just use vhost-user to present the virtio-blk device?
I've never really been able to get a clear value proposition for vDPA for storage laid out for me. Maybe I'm missing something critical - it's certainly possible.
Yes, I've seen some clearer cases made for networking.
In networking there is no standard for the hardware interface. Every vendor does their own thing. Except many can at least handle virtqueues carrying virtio-net messages for the data path, so some framework like vDPA may make sense (I'd prefer to see a full NIC interface standard emerge instead).
In storage, however, the industry has agreed on NVMe. This is a full standard for control and data plane. All storage products on the market, including DPUs and SmartNICs, just present NVMe devices. So there's no case to be made for vDPA at all. It just isn't necessary.
Yes, I see your point and agree that NVMe can be used for the same purpose.
But several HW vendors have implemented virtio-net devices in their SmartNic and may find it convenient to support virtio-blk to reuse most of the building blocks.
As for vhost-user, it's perfect for VM use cases, but with containers or applications in the host, it's not easy to use. Whereas, a vDPA device (HW or SW) can easily be attached to the host kernel (using the virtio-vdpa bus) and be managed with the standard virtio-blk driver.
Is it really that much code? I don't know GPU hardware, but the NVMe spec header file in SPDK is around 4k lines[0]. If there's 7 of them and they're twice as complicated each, we're still well under 100k from register map headers. I didn't actually look through Linux to see how big they are, so maybe it is that much more complex.
NVMe is largely the model people here are complaining about. A small kernel shim driver that is talking to a huge firmware code base on the other side of a mailbox interface.
Even on small m.2 style standalone drives, your looking at code, which not only handles the details of managing flash error correction, wear leveling, garbage collection, etc, etc, but all the code required to manage the thermal, voltage, pcie link training, etc of the 2-5 or so microcontrollers embedded in the drive and possibly an RTOS or two hosting it all.
Never mind fabric attach (DPU?) NVMe devices which do all that, plus deal with thin provisioning, partitioning, deduplication, device sharing, replication, RAID, etc, etc. Frequently themselves embedding a Linux (or similar level of complexity OS) kernel in the control plane.
Pre-commit means before committed to the canonical repo, not before commit locally.
The SPDK project has an elaborate pre-commit review and test system all in public. See https://spdk.io/development . I wouldn't want to work on a project that doesn't have infrastructure like this.
Even mailing lists with patches are really a pre-commit review system, as are GitHub pull requests. Pre-commit testing seems more elusive though.
At 200+ Gbps, the copy from LLC where the packet landed to the userspace buffer dominates the performance profiles on most systems. The TCP processing isn't bad and the expensive parts can often be offloaded. I'd contend that this data copy on the TCP recv path is one of the most important performance issues to solve for the entire industry right now. DDIO let us kick the can way down the road, but it seems like the network speeds have outpaced the CPU speeds so much that it can't save us for much longer.
RDMA is great and similar, but behaves very differently from TCP in the face of network congestion and longer distance traversals. This is essentially trying to get the best parts of TCP and the best parts of RDMA combined.
iWARP maybe, but I don't think you want to offload all of TCP to hardware. You want to leave congestion control and all that to software. I don't entirely know if that's why iWARP isn't very popular, but I suspect that's the reason. You want a software TCP stack that can land the data where you want it directly.
Associating the desire of NVMe vendors to allow users to ship down eBPF programs to run on the device and XRP is a major mistake in the article. XRP has nothing to do with what the NVMe vendors want to do, and XRP is a pure kernel solution that doesn't need any participation from NVMe vendors. I think it's unclear whether XRP even has real value - it certainly may, but I believe the benchmarking in the paper was deeply flawed[1].
I'm not closely tied to what the NVMe vendors want, so you could be right, but I very much doubt you are given that Christoph didn't flag this when he reviewed the article.
Edit: And to be clear, from my understanding of XRP, the device itself calls back into a BPF function in the NVMe driver. That requires some notion of standardization. It's not exactly offloading directly to the storage device, but the storage device still relies on some standardized behavior in the BPF program, such as divide by zero, what instructions are supported in the ISA, etc.
I am very closely tied to what the NVMe vendors want, having written the first internal draft of the proposal to the standards body (since that draft many smart people have taken the pen and done a lot of great work).
XRP is unrelated to offloading eBPF to NVMe devices.
Sure, whatever. "Offloading" was perhaps a poor choice of wording, but it is related to the standardization efforts. The NVMe vendors don't want to be calling out to BPF programs in the driver if the runtime semantics are not standardized.
XRP is a regular BPF hook in Linux and requires no additional standardization. The device never "calls out to BPF programs in the driver" - it generates a normal completion interrupt and Linux runs a BPF hook in the completion path. This is no different than other kernel BPF hooks elsewhere and doesn't provide any additional reason or need to standardize BPF.
The article misstated that XRP was a framework used for offloading BPF programs to NVMe devices. That's not correct, and XRP is not one of the emerging use cases for BPF that is driving standardization.
XRP initially looks to be focused on moving work upstream into the kernel not devices.
But absolutely, once you start shipping ebpf into the kernel, people do quickly start asking, "how can we hardware accelerate that?". Having standards would be helpful.
> the device itself calls back into a BPF function in the NVMe driver
This statement contradicts itself. A driver (a “kernel module” in Linux lingo) runs in the kernel, and sure, that driver can call out to BPF or whatever else it desires, but then that isn’t the device doing so, that’s your computer (running your Linux kernel, in turn executing the aforementioned kernel module / driver) doing so.
Restated from another perspective, drivers don’t run on devices. Something may run on devices too, but that’s different, and we’d call that something like “firmware”.
Edit: the intended takeaway being that device manufacturers/designers should have little to gain from BPF being standardized (unless that BPF is being executed on the device, as this now implies an API contract between device and host) — a driver can always declare that the semantics is whatever the Linux kernel does and call it a day.
But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.
DDIO is a pure hardware feature. Software doesn't need to do anything to support it.
Source: SPDK co-creator