May I also mention that N log N is the total cost of compaction for a DB of size N. You don't perform a compaction on every single write. Amortised per write the cost is more like N log(N)/N == log(N).
Also, N log(N) is nowhere near exponential. O(2^N) would be exponential, and that's not what you have here.
"Amortised per write" - now you're getting down into the constant factors, which Big-O disregards. But you can't ignore them in real implementations. First the actual writes have a 2x constant factor, since you're writing to a WAL in addition to the DB itself.
The original LSM paper claims that writes to Level 0 are free because that's all in-memory. But that's not really true; if you have a stream of incoming writes then everything that goes into Level 0 must eventually be pushed out to Level 1. Buffering doesn't make writes free, it only displaces their occurrence in time.
So you have a rolling merge every M writes. As far as Big-O goes, that's N log(N) / M == N log(N) because Big-O disregards constant factors!
In the context of an implementation like LevelDB, theory and reality diverge even further. Since it's chunking data into 2MB files and deleting them during each merge operation, and also writing a bunch of bookkeeping into Manifest files and other stuff, the number of actual I/Os is much higher. A lot of wasted activity in allocating and deallocating space - filesystem metadata overhead that's also not transactionally safe.
In LevelDB a single merge reads 26MB and writes 26MB at a time to push 2MB of data from a level L to level L+1. So now instead of a single merge op costing only N, it actually costs 13*N. Again, if you're only talking about Big-O complexity you sweep this under the rug. But in reality, this is a huge cost.
Stated another way - assume you want to sustain a user workload writing 20MB/sec, and you don't do any throttling. Level 0 consists of 4 1MB files - it will fill in 1/5th of a second, and then compaction will reduce it by 1MB. After that it will be compacting continuously every 1/20th of a second. To sustain this workload for the 1st second will thus require 17 compactions to Level 1. Assuming an already populated Level 1 and worst-case key distribution that means in 1 second it will trigger compactions that read 238MB and write 238MB to store the incoming 20MB.
Level 1 is only 10MB, so if it was empty it would fill in the first 1/2 second. For the remaining 1/2 second it would trigger 5 more compactions to Level 2, reading 130MB and writing 130MB. If it started out full then this would be 260MB/260MB respectively.
So for a 20MB/sec input workload you would need a disk subsystem capable of sustaining 498MB/sec of reads concurrent with 498MB/sec of writes. And that's only for a small DB, only Level 0-2 present (smaller than 110MB), and excluding the actual cost of filesystem operations (create/delete/etc).
That's only for the 1st second of load. For every second after that, you're dumping from Level 0 to Level 1 at 280MB read and 280MB write/sec. And dumping from Level 1 to Level 2 at 260/260 as before. 540/540 - so a disk capable of 1080MB/sec I/O is needed to sustain a 20MB/sec workload. And this is supposed to be HDD-optimized? Write-optimized? O(N logN) - what a laugh.
Maybe LSMs in general can be more efficient than this. LevelDB is pretty horrible though.
It would only trigger compaction if sst tables have overlapping keys. And if you only write new items, goleveldb implementation would just create 3.7Mb sst tables by default without trying to merge them into bigger chunks (what's the point? they are all sorted and non-overlapping). When you have queue consumption workload it would start merging tombstones with sst tables and since tombstones are also in sorted order it would not pick up multiple sst tables at a time, and just either completely or partially remove stale sst files. I added some more benchmarks including queue packing with 200M messages of 64 byte size, and benchmarks of consumption of 200M messages. The speed is sustainable. https://github.com/bogdanovich/siberite/blob/master/docs/ben...
In the case of an architecture diagram being handed down from on high, sure. But in the case of product feature whack-a-mole, this doesn't seem like something that bottom-up decision-making could fix.
Recursive CTEs are a cool feature, but it's a shame the syntax is so awful. You really have to work to understand what the query is trying to do.
Other query languages that make recursion much more bearable are SPARQL (on RDF data), Cypher (Neo4j) or Datalog (Datomic, Cascalog, etc). Would love to see more of those query languages being used.
Ternary (or even higher n-ary) storage is already widely used: multi-level cell (MLC) SSDs store more than one bit per cell by using multiple voltages (most commonly 4, i.e. 2 bits per cell). I believe Ethernet also uses 3 voltage levels.
Does anyone know what they mean when they say "grounded in trust"? (Seems odd to imply that other OSes are not trustworthy.) Referring to code signing, perhaps?
Do you also have stats on the high percentile latencies, besides the average? Sounds like your avoidance of compaction pauses ought to lead to lower latency at the high percentiles.
Also, N log(N) is nowhere near exponential. O(2^N) would be exponential, and that's not what you have here.