yup, though that means you're wasting that core's compute; something with green threads where language runtime does a cross-core interrupt to submit syscall then continues execing other green threads until it gets a user interrupt for syscall completion would be pretty neat.
Imagine you have a piece of software that runs in an event loop (as many things do). On each loop, queue up all system calls you'd like to perform. At the end of the loop, do one syscall to execute the batch. At the start of the loop, check if anything has completed and continue the operation.
If you're processing a set of sockets and on any given loop N are ready, then with epoll you do N+1 syscalls. With io_uring you do 1. It's independent of N.
And the potential impact is huge! Not only are individual syscalls expensive on their own, and increasingly so (afaik) with spectre and security issues. You also need a thread to execute the call, which is several KB of memory, compound context switches, and (often overlooked) creating and destroying threads also come with syscall overhead.
Now, we’ve had epoll etc so it’s not novel in that respect. However, what’s truly novel is that it’s universal across syscalls, which makes it almost mechanical to port to a new platform. A lot of intricate questions of high-level API design simply go away and become more simple data layout questions. (I’m sure there are little devils hiding in the details, but still)
IOCP certainly was ahead of its time, but it only does the completion batching, not the submission batching. io_uring is significantly better than anything available on Windows right now.
It comes with its own set of challenges. In the integration I've seen, it basically meant that all the latency in the system went into io_uring_enter() call which blocked then for far longer than any individual than any other IO operation we've ever seen. Your application might prefer if it pauses 50 times for 20us (+ syscall overhead) in an eventloop iteration instead of a single time for 1ms (+ less syscall overhead), because that means some IO will just sit around for 1ms and will be totally unhandled.
The only way to avoid big latencies on uring_enter is to use the submission queue polling mechanism using a background kernel thread, which also has its own set ofs pro's and con's.
This sounds abnormal, are you using io_uring_enter in a way that asks it not to return without any cqes?
I don't have much of a feel for this because I am on the "never calling io_uring_enter" plan but I expect I would have found it alarming if it took 1ms while I was using it
For many syscalls, the primary overhead is the transition itself, not the work the kernel does. So doing 50 operations one by one may take, say, 10x as much time as a single call to io_uring_enter for the same work. It really shouldn't be just moving latency around unless you are doing very large data copies (or similar) out of the kernel such that syscall overhead becomes mostly irrelevant. If syscall overhead is irrelevant in your app and you aren't doing an actual asynchronous kernel operation, then you may as well use the regular syscall interface.
There are certainly applications that don't benefit from io_uring, but I suspect these are not the norm.
You need to measure it for your application. A lot of people think „syscalls are expensive“ because that’s repeated for your years, but often it’s actually their implementation and not the overhead.
Eg a UDP syscall will do a whole lot of route lookups, iptable rule evaluations, potential eBPF program evaluations, copying data into other packets, splitting packets, etc. I measured this to be fare more than > 10x of the syscall overhead. But your mileage might vary depending on which calls you use.
As for the applications: these lessons where collected in a CDN data plane. There’s hardly any applications out there which are more async IO intense.
I've spent essentially the last year trying to find the best way to use io_uring for networking inside the NVMe-oF target in SPDK. Many of my initial attempts were also slower than our heavily optimized epoll version. But now I feel like I'm getting somewhere and I'm starting to see the big gains. I plan to blog a bit about the optimal way to use it, but the key concepts seem to be:
1) create one io_uring per thread (much like you'd create one epoll grp)
2) use the provided buffer mechanism to post a pool of large buffers to an io_uring. Bonus points for the newer ring based version.
3) keep a large (128k) async multishot recv posted to every socket in your set always
4) as recvs complete, append the next "segment" of the stream to a per-socket list.
5) parse the protocol stream. As you make it through each segment, return it to the pool*
6) aggressively batch data to be sent. You can only have one outstanding at a time per socket, so make it a big vectored write. Writes are only outstanding until they're confirmed queued in your local kernel, so it is a fairly short time until you can submit more, but it's worth batching into a single larger operation.
* If you need part of the stream to live for an extended period of time, as we do for the payloads in NVMe-oF, build scatter gather lists that point into the segments of the stream and then maintain a reference counts to the segments. Return the segments to the pool when it drops to zero.
Everyone knows the best way to use epoll at this point. Few of us have really figured out io_uring. But that doesn't mean it is slower.
> Few of us have really figured out io_uring. But that doesn't mean it is slower.
seastar.io is a high level framework that I believe has "figured out" io_uring, with additional caveats the framework imposes (which is honestly freeing).
It's also worth noting that io_uring has had at most 10-15 engineer-years worth of performance tuning vs. the many (?) hundreds of years that epoll has received. I work with Jens, Pavel, and others and can confidently say that low-queue-depth perf parity with epoll is an important goal to the effort
As an aside, it's great to see high praise from an spdk maintainer. One of the big reasons for doing io_uring in the first place was that it was impossible to compete in terms of performance with total bypass unless you changed the syscall approach.
I'd be very interested to read that blog post. Besides your tips for maximum performance, I'm curious about the minimum you have to do to get a significant improvement. I can easily imagine someone basically using it to poll for readiness like epoll and being disappointed. But if that's enough to benefit, I'd be surprised and intrigued. More likely you need to actually use it to enqueue the op, but folks have struggled with ownership. Is doing that in a not quite optimal way (extra copies on the user side) enough? Or do you need to optimize those away? Do you need to do the buffer pooling and or multishot stuff?
Do fixed buffers help for network I/O? In August 2022 @axboe said "No benefits for fixed buffers with sockets right now, this will change at some point."
And io_uring itself was more directly inspired by NVMe and RDMA, which of course work with these same queues as GFX cards. The original io_uring patch compares itself to SPDK, whose premise is "what if we expose an abstraction for a hardware queue per thread to an application " - basically the same programming model as io_uring. And SPDK was just taking techniques from networking (DPDK) and applying them to storage.
Windows did already have async ("overlapped") IO, and a completion aggregator (IOCP) kind of like io_uring. What Windows didn't have, and the reason they're now adding their own IORing, is the ability to submit batches of operations in a single system call. Batching operations to reduce system calls on the submission side is one of the most important features of io_uring.
The Windows IORing is only storage today, but hopefully becomes a generic system for making batched, async system calls just like on Linux.
There's also now an open source recreation of the original client, all written in C# that actually uses a GPU so it renders 4k at 250fps instead of 800x600 at 12.5fps. It's a very mature and stable reproduction at this point.
Over 2k active players daily (edit: by active I mean there are over 2k players logged in and playing at any given time). The devs are incredibly active, and there is a lot more to do here than original UO.
I tried playing uo outlands, but what really struck me was the lack of bugs. Tbh my favourite part of playing uo back in the day was exploiting bugs so uo outlands isn't really for me
For me, the killer use case for this is presenting logical volumes to containers. There just has not been an efficient mechanism for a local storage service in one container to serve logical volumes to another container on the same system until this. For VMs there is virtio/vfio-user, but for containers the highest performing option until this was NVMe-oF/TCP loopback.
Basically, you can implement a virtual SAN for containers efficiently with this.
I'd like a built-in iSCSI volume driver for docker, podman, et al. There are third party things (netapp trident[1], etc.) but no generic driver. One would think -- given the ubiquity of SAN boxes populating racks outside of cloud operators -- you could "-v iscsi:<rfc-4173-iscsi-uri>:/mountpoint" a network block device into a container out of the box. I suppose it's difficult to deal with in cross platform way. When you read the golang source for trident you see they're just exec-ing iscsiadm on linux container hosts.
The SPDK project is certainly looking to use this to replace our limited use of NBD, as well as present SPDK block devices as kernel block devices, including devices backed by userspace implementations of iSCSI, NVMe-oF, and various other network protocols.
It's the architectural choices more than the language (one thread per core, async, event loops). But I'm sure the c++ does have some benefit over Java.