Zip with no compression is a nice contender for a container format that shouldn't be slept on. It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file, this is possible even via mmap, over HTTP range queries, etc.
You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.
It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.
I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
Strangely enough, there is a tool out there that gives Zip-like functionality while preserving Tar metadata functionality, that nobody uses. It even has extra archiving functions like binary deltas. dar (Disk ARchive) http://dar.linux.free.fr/
This is how Haiku packages are managed, from the outside its a single zstd file, internally all dependacies and files and included in read only file. Reduces IO, reduces file clutter, instant install/uninstall, zero chance for user to corrupt files or dependancy, and easy to switch between versions. The Haiku file system also supports virtual dir mapping so the stubborn Linux port thinks its talking to /usr/local/lib, but in reality its part of the zstd file in /system/packages.
Isn't this what is already common in the Python community?
> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?
ZIP retains timestamps. This makes sense because timestamps are a global concept. Consider them a attribute dependent on only the file in ZIP, similar to the file's name.
Owners and permissions are dependent also on the computer the files are stored on. User "john" might have a different user ID on another computer, or not exist there at all, or be a different John. So there isn't one obvious way to handle this, while there is with timestamps. Archiving tools will have to pick a particular way of handling it, so you need to pick the tool that implements the specific way you want.
It does, but unless the 'zip' archive creator being used makes use of the extensions for high resolution timestamps, the basic ZIP format retains only old MSDOS style timestamps (rounded to the closed two seconds). So one may lose some precision in ones timestamps when passing files through a zip archive.
> Why should I throw away all the timestamps, just because the file were temporarily in an archive?
In case anyone is unaware, you don't have to throw away all the timestamps when using "zip with no compression". The metadata for each zipped file includes one timestamp (originally rounded to even number of seconds in local time).
I am a big last modified timestamp fan and am often discouraged that scp, git, and even many zip utilities are not (at least by default).
git updates timestamps in part by necessity of compatibility with build systems. If it applied the timestamp of when the file was last modified on checkout then most build systems would break if you checked out an older commit.
Gzip will make most line protocols efficient enough that you can do away with needing to write a cryptic one that will just end up being friction every time someone has to triage a production issue. Zstd will do even better.
The real one-two punch is make your parser faster and then spend the CPU cycles on better compression.
> It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.
JAR files generally do/did use compression, though. I imagine you could forgo it, but I didn't see it being done. (But maybe that was specific to the J2ME world where it was more necessary?)
> Zip with no compression is a nice contender for a container format that shouldn't be slept on
SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.
It has an index at the end of the file, yeah, but once you've read that bit, you learn where the contents are located and if compression is disabled, you can e.g. memory map them.
With tar you need to scan the entire file start-to-finish before you know where the data is located, as it's literally a tape archiving format, designed for a storage medium with no random access reads.
Honestly, sometimes I just want to mark all files on a Linux system as executable and see what would even break and why. Seriously, why is there a whole bit for something that's essentially an 'read permission, but you can also directly execute it from the shell'?
You might be thinking of ar, the classic Unix ARchive that is used for static libraries?
The format used by `ar` is a quite simple, somewhat like tar, with files glued together, a short header in between and no index.
Early Unix eventually introduced a program called `ranlib` that generates and appends and index for libraries (also containing extracted symbols) to speed up linking. The index is simply embedded as a file with a special name.
The GNU version of `ar` as well as some later Unix descendants support doing that directly instead.
Besides `ar` as a sibiling observed, you might also be thinking of pixz - https://github.com/vasi/pixz , but really any archive format (cpio, etc.) can, in principle, just put a stake in the ground to have its last file be any kind of binary / whatever index file directory like Zip. Or it could hog a special name like .__META_INF__ instead.
Headline is wrong. I/O wasn't the bottleneck, syscalls were the bottleneck.
Stupid question: why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory? Seems like the simplest solution, no?
One aspect of the question is that "permissions" are mostly regulated at the time of open and user-code should check for failures. This was a driving inspiration for the tiny 27 lines of C virtual machine in https://github.com/c-blake/batch that allows you to, e.g., synthesize a single call that mmaps a whole file https://github.com/c-blake/batch/blob/64a35b4b35efa8c52afb64... which seems like it would have also helped the article author.
You could use io_uring but IMO that API is annoying and I remember hitting limitations. One thing you could do with io_uring is using openat (the op not the syscall) with the dir fd (which you get from the syscall) so you can asynchronously open and read files, however, you couldn't open directories for some reason. There's a chance I may be remembering wrong
io_uring supports submitting openat requests, which sounds like what you want. Open the dirfd, extract all the names via readdir and then submit openat SQEs all at once. Admittedly I have not used the io uring api myself so I can't speak to edge cases in doing so, but it's "on the happy path" as it were.
You have a limit of 1k simultaneous open files per process - not sure what overhead exists in the kernel that made them impose this, but I guess it exists for a reason. You might run into trouble if you open too many files at ones (either the kernel kills your process, or you run into some internal kernel bottleneck that makes the whole endeavor not so worthwhile)
>why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory?
You mean like a range of file descriptors you could use if you want to save files in that directory?
What comes closest is scandir [1], which gives you an iterator of direntries, and can be used to avoid lstat syscalls for each file.
Otherwise you can open a dir and pass its fd to openat together with a relative path to a file, to reduce the kernel overhead of resolving absolute paths for each file.
Something that struck me earlier this week was when profiling certain workloads, I'd really like a flame graph that included wall time waiting on IO, be it a database call, filesystem or other RPC.
For example, our integration test suite on a particular service has become quite slow, but it's not particularly clear where the time is going. I suspect a decent amount of time is being spent talking to postgres, but I'd like a low touch way to profile this
There are a few challenges here.
- Off-cpu is missing the interrupt with integrated collection of stack traces, so you instrument a full timeline when they move on and off cpu or periodically walk every thread for its stack trace
- Applications have many idle threads and waiting for IO is a common threadpool case, so its more challenging to associate the thread waiting for a pool doing delegated IO from idle worker pool threads
Some solutions:
- Ive used nsight systems for non GPU stuff to visualize off CPU time equally with on CPU time
- gdb thread apply all bt is slow but does full call stack walking. In python, we have py-spy dump for supported interpreters
- Remember that any thing you can represent as call stacks and integers can be converted easily to a flamegraph. eg taking strace durations by tid and maybe fd and aggregating to a flamegraph
See if you can wrap the underlying library call to pg.query or whatever it is with a generic wrapper that logs time in the query function. Should be easy in a dynamic lang.
I still use a 10x faster lexer, RE2C over flex, because it does so much more at compile-time. And on top of that has a bunch of optimization options for better compilers, like computed goto's.
Of course syscalls suck, slurping the whole file at once always wins, and in this case all files at once.
Kernels suck in general. You don't really need one for high perf and low space.
This would not be surprising at all! An impressive amount of work has gone into making the Linux VFS and filesystem code fast and scalable. I'm well aware that Linux didn't invent the RCU scheme, but it uses variations on RCU liberally to make filesystem operations minimally contentious, and aggressively caches. (I've also learned recently that the Linux VFS abstractions are quite different from BSD/UNIX, and they don't really map to eachother. Linux has many structures, like dentries and generic inodes, that map to roughly one structure in BSD/UNIX, the vnode structure. I'm not positive that this has huge performance implications but it does seem like Linux is aggressive at caching dentries which may make a difference.)
That said, I'm certainly no expert on filesystems or OS kernels, so I wouldn't know if Linux would perform faster or slower... But it would be very interesting to see a comparison, possibly even with a hypervisor adding overhead.
I actually ran into this issue building dependency graphs of a golang monorepo. We analyzed the cpu trace and found that the program was doing a lot of GC so we reduced allocations. This was just noise though as the runtime was just making use of time waiting for I/O as it had shelled out to go list to get a json dep graph from the CLI program. This turns out to be slow due to stat calls and reading from disk. We replaced our usage of go list with a custom package import graph parser using the std lib parser packages and instead of reading from disk we give the parser byte blobs from git, also using git ls-files to “stat” the files. Don’t remember the specifics but I believe we brought the time from 30-45s down to 500ms to build the dep graph.
compressing the kernel loads it faster on RAM even if it still has to execute the un compressing operation. Why?
Load from disk to RAM is a larger bottleneck than CPU uncompressing.
Same is applied to algorithms, always find the largest bottleneck in your dependent executions and apply changes there as the rest of the pipeline waits for it.
Often picking the right algorithm “solves it” but it may be something else, like waiting for IO or coordinating across actors (mutex if concurrency is done as it used to).
That’s also part of the counterintuitive take that more concurrency brings more overhead and not necessarily faster execution speeds (topic largely discussed a few years ago with async concurrency and immutable structures).
You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.
It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.
I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
reply