Headline is wrong. I/O wasn't the bottleneck, syscalls were the bottleneck. Stup...

cb321 · 2026-01-25T12:04:31 1769342671

One aspect of the question is that "permissions" are mostly regulated at the time of open and user-code should check for failures. This was a driving inspiration for the tiny 27 lines of C virtual machine in https://github.com/c-blake/batch that allows you to, e.g., synthesize a single call that mmaps a whole file https://github.com/c-blake/batch/blob/64a35b4b35efa8c52afb64... which seems like it would have also helped the article author.

ori_b · 2026-01-25T20:07:56 1769371676

It's not the syscalls. There were only 300,000 syscalls made. Entering and exiting the kernel takes 150 cycles on my (rather beefy) Ryzen machine, or about 50ns per call.

Even assume it takes 1us per mode switch, which would be insane, you'd be looking at 0.3s out of the 17s for syscall overhead.

It's not obvious to me where the overhead is, but random seeks are still expensive, even on SSDs.

ncruces · 2026-01-25T22:34:41 1769380481

Didn't test, but my guess is it's not “syscalls” but “open,” “stat,” etc; “read” would be fine. And something like “openat” might mitigate it.

levodelellis · 2026-01-25T12:13:25 1769343205

Not sure, I'd like that too

You could use io_uring but IMO that API is annoying and I remember hitting limitations. One thing you could do with io_uring is using openat (the op not the syscall) with the dir fd (which you get from the syscall) so you can asynchronously open and read files, however, you couldn't open directories for some reason. There's a chance I may be remembering wrong

king_geedorah · 2026-01-25T16:21:09 1769358069

io_uring supports submitting openat requests, which sounds like what you want. Open the dirfd, extract all the names via readdir and then submit openat SQEs all at once. Admittedly I have not used the io uring api myself so I can't speak to edge cases in doing so, but it's "on the happy path" as it were.

https://man7.org/linux/man-pages/man3/io_uring_prep_open.3.h...

https://man7.org/linux/man-pages/man2/readdir.2.html

Note that the prep open man page is a (3) page. You could of course construct the SQEs yourself.

torginus · 2026-01-25T16:54:28 1769360068

You have a limit of 1k simultaneous open files per process - not sure what overhead exists in the kernel that made them impose this, but I guess it exists for a reason. You might run into trouble if you open too many files at ones (either the kernel kills your process, or you run into some internal kernel bottleneck that makes the whole endeavor not so worthwhile)

dinosaurdynasty · 2026-01-25T19:12:07 1769368327

That's mainly for historical reasons (select syscall can only handle fds<1024), modern programs can just set their soft limit to their hard limit and not worry about it anymore: https://0pointer.net/blog/file-descriptor-limits.html

arter45 · 2026-01-25T12:08:25 1769342905

>why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory?

You mean like a range of file descriptors you could use if you want to save files in that directory?

direwolf20 · 2026-01-25T15:12:05 1769353925

You can probably do it with io_uring, as a generic syscall batching mechanism.

paulddraper · 2026-01-25T16:21:48 1769358108

io_uring can open multiple files.

justsomehnguy · 2026-01-25T12:48:27 1769345307

If you don't need the security at all then yes. Otherwise you need to check every file for the permissions.

stabbles · 2026-01-25T12:08:01 1769342881

What comes closest is scandir [1], which gives you an iterator of direntries, and can be used to avoid lstat syscalls for each file.

Otherwise you can open a dir and pass its fd to openat together with a relative path to a file, to reduce the kernel overhead of resolving absolute paths for each file.

[1] https://man7.org/linux/man-pages/man3/scandir.3.html

direwolf20 · 2026-01-25T15:12:37 1769353957

This is a (3) man page which means it's not a syscall. Have you checked it doesn't call lstat on each file?

stabbles · 2026-01-25T15:49:56 1769356196

Fair, https://www.man7.org/linux/man-pages/man2/getdents64.2.html is a better link. You'd have to call lstat when d_type is DT_UNKNOWN

zokier · 2026-01-25T14:23:25 1769351005

in what way does scandir avoid stat syscalls?

stabbles · 2026-01-25T15:43:15 1769355795

Because you get an iterator over `struct dirent`, which includes `d_type` for popular filesystems.

Notice that this avoids `lstat` calls; for symlinks you may still need to do a stat call if you want to stat the target.