Would love to be able to have a database file system and be able to use standard Linux tools to query and make changes, back up, etc.
For example, having a folder of contacts with each file named after the person and having key/value pairs. Similar to how static site generators use YAML/TOML/JSON.
Try your nearest shell to experiment with this? In my scripts and programs I've stored data just in files many times: simple, transparent and effective. Works everywhere. There's so much in search that you can just brute-force through unless you have a huge amount of data or a lot of concurrent tasks that it doesn't matter much in today's hardware.
This reminds me a little bit of the filesystem from BeOS (RIP) and Haiku. You cant attach any metadata to any file you want, and it has a system to create queries on that metadata.
Very much miss this. I love how they used this technique to build their email client. Emails were just files on the filesystem with a filesystem plugin that made their metadata queryable. So the email client was just a file browser window with some pre-saved queries that would let you find all emails by date, sender, etc.
If you do this, you can add all the files to git and get historical views, changelogs... and if you distribute the "database", you even can know who made what changes!
Along the lines of what I was thinking. Would be really neat! I already do this to an extent, but with a virtual file system it could automatically sort/categorize/etc like a regular database.
This makes me wonder how efficient filesystems are with millions of file in a single directory? Do they create some sort of index? Are there limits to the number of files a directory can hold? Is that what inodes are for? I remember seeing “inode” limits on some VPS I was using a while back.
Indeed, putting too many files in a single directory is inefficient, both for read and write. When writing a file (or changing metadata like permissions), the entire directory inode may have to be rewritten. When searching for a file or opening it, the entire file list for the directory needs to be read (in the worst case where the file is at the end of the list).
When you need to store lots of files on disk, it's a common pattern to spread them out in subdirectories. For instance, instead of `files/2d8af74bcb29ad84`, you would have `files/2d/8a/f74bcb29ad84`.
> When writing a file (or changing metadata like permissions), the entire directory inode may have to be rewritten.
That isn't how inode filesystems work -- if you change a file's permissions, it's just an inode update on the file -- not the containing directory.
Even in DOS-type filesystems (FAT/exFAT), it's just a record update in the corresponding dirent for that file.
If you add a new file to a directory, that causes an mtime update on the directory's inode.
The rest is accurate -- many older filesystems have lookup performance that scales poorly with directory size (for DOS filesystems and BSD UFS, you have to do a full directory scan). Also ls defaults to sorting output, which is O(N log N) and can be slow in large directories.
Large directory sizes suck on NFS and parallel file systems like Panasas, Lustre, GPFS, and the like. Python and Rust's Cargo also suck on networked file systems and would be greatly improved by pushing things into a sqlite file.
most things suck on networked file systems. I still have trauma from having to use NFS more than a decade ago, on systems that would freeze on boot because the network couldn't be reached or the NFS server was down.
I'd also never put sqlite on NFS. Locking is often broken on NFS. Unless you can guarantee a heterogeneous environment. I can imagine some Excel guy in marketing is going to launch his SQLite UI on Windows and completely hose it all.
Nfs locking broken.. excel? Sqlite ui on Windows? Are you sure you're not confusing SMB with nfs? I've had SMB cache files that ducks everything up. Not nfs.
That's sometimes true but it depends a lot on the filesystem used, and sometimes on the options used when building it. ReiserFS, ext4 and FAT 32 won't have the same performance profile.
I think breaking down large collections in subdirectories is mainly done in order to ensure that it'll work even with FS that don't deal well with very large directories and also because it makes inspecting the files manually a little more convenient given that many file explorers (and especially the GUI ones) can have trouble with large directories.
What about on the other end? Why not have each hex digit be a directory along a path? Then you have very, very few files per directory at the cost of deeper hierarchy. What's the practical downside?
Directories are just another type of file. So, if you do this, and your filenames are n characters long, you'll end up needing to do n file accesses just to find the file you're looking for. Unless the underlying file system does something to make that particular access pattern fast, well... it's going to be stupidly slow after a certain point.
> Unless the underlying file system does something to make that particular access pattern fast
I don't know exactly how Linux does it. Windows hands off whole paths to the filesystem, so this idea is possible there.
In FreeBSD, there is a generic routine (lookup(9)) that goes component by component, so at each step the filesystem is only asked to resolve a single component to a vnode. I think a clever filesystem implementation (in FreeBSD) could look at the remaining path and kick off asynchronous prefetch... but I am not aware of anything doing this.
There are two modes you can implement for your filesystem: in the so-called 'high level' mode you get the whole path. In the 'low level' mode the filesystem asks you for one piece of the path at a time.
The low level mode seemed faster in my tests, and I think it's also closer to how Linux kernel works internally?
> Are there limits to the number of files a directory can hold?
Yes, depending on the file system.
For example, ext4 with default settings uses 32-bit hashes. Upon the first collision, you can no longer add more files to the directory (ENOSPC error).
Because it tries to make people work in a way they don’t want to work. People want to organize their files in folders. They don’t want to tag them all with labels or tags and they don’t want to input all that metadata.
Perhaps they want to tag a few, and it’s useful to have some autodetected metadata but they don’t want to tag them all and they don’t want a gigantic ‘Untagged files’ list. They want folders and they want more than a flat folder list, they want nested folders.
You can implement it, it’s not hard and most modern file systems have all the features you need. But users will hate it and won’t use it the way you want.
You are right that users don't want to do that busy work.
But I am less sure users actually want folders.
Some power-users, sure. But most normal people don't want to deal with folders, either.
For evidence: look at the guy who saves everything on his overflowing desktop.
Any system that allows people to find their stuff, and perhaps make a few annotations, will be good for them.
Google Photos is almost a good example: I don't have to do annotate anything, yet I can search for eg pictures of snow or by location.
(I say only 'almost', because while impressive, that system isn't good enough yet to find obscure stuff or to work on contextual cues like 'those pictures I took at home after we came back from shopping sometime in the last few months'.)
That’s all very nice but a filesystem needs to be able to deal with every type of file on the planet. Which means you can’t automatically detect the contents.
And really, a lot of users don’t want an interface that doesn’t allow them to do what they want just because someone else just dumps all their files on the desktop.
Apple tried this on iCloud and had to go back. Because, while it makes for nice presentation and usability, there’s a lot of users that it can’t cater for.
For example, having a folder of contacts with each file named after the person and having key/value pairs. Similar to how static site generators use YAML/TOML/JSON.