databass's comments

databass · on Oct 23, 2015

May I also mention that N log N is the total cost of compaction for a DB of size N. You don't perform a compaction on every single write. Amortised per write the cost is more like N log(N)/N == log(N).

Also, N log(N) is nowhere near exponential. O(2^N) would be exponential, and that's not what you have here.

hyc_symas · on Oct 23, 2015

"Amortised per write" - now you're getting down into the constant factors, which Big-O disregards. But you can't ignore them in real implementations. First the actual writes have a 2x constant factor, since you're writing to a WAL in addition to the DB itself.

The original LSM paper claims that writes to Level 0 are free because that's all in-memory. But that's not really true; if you have a stream of incoming writes then everything that goes into Level 0 must eventually be pushed out to Level 1. Buffering doesn't make writes free, it only displaces their occurrence in time.

So you have a rolling merge every M writes. As far as Big-O goes, that's N log(N) / M == N log(N) because Big-O disregards constant factors!

In the context of an implementation like LevelDB, theory and reality diverge even further. Since it's chunking data into 2MB files and deleting them during each merge operation, and also writing a bunch of bookkeeping into Manifest files and other stuff, the number of actual I/Os is much higher. A lot of wasted activity in allocating and deallocating space - filesystem metadata overhead that's also not transactionally safe.

In LevelDB a single merge reads 26MB and writes 26MB at a time to push 2MB of data from a level L to level L+1. So now instead of a single merge op costing only N, it actually costs 13*N. Again, if you're only talking about Big-O complexity you sweep this under the rug. But in reality, this is a huge cost.

hyc_symas · on Oct 23, 2015

Stated another way - assume you want to sustain a user workload writing 20MB/sec, and you don't do any throttling. Level 0 consists of 4 1MB files - it will fill in 1/5th of a second, and then compaction will reduce it by 1MB. After that it will be compacting continuously every 1/20th of a second. To sustain this workload for the 1st second will thus require 17 compactions to Level 1. Assuming an already populated Level 1 and worst-case key distribution that means in 1 second it will trigger compactions that read 238MB and write 238MB to store the incoming 20MB.

Level 1 is only 10MB, so if it was empty it would fill in the first 1/2 second. For the remaining 1/2 second it would trigger 5 more compactions to Level 2, reading 130MB and writing 130MB. If it started out full then this would be 260MB/260MB respectively.

So for a 20MB/sec input workload you would need a disk subsystem capable of sustaining 498MB/sec of reads concurrent with 498MB/sec of writes. And that's only for a small DB, only Level 0-2 present (smaller than 110MB), and excluding the actual cost of filesystem operations (create/delete/etc).

That's only for the 1st second of load. For every second after that, you're dumping from Level 0 to Level 1 at 280MB read and 280MB write/sec. And dumping from Level 1 to Level 2 at 260/260 as before. 540/540 - so a disk capable of 1080MB/sec I/O is needed to sustain a 20MB/sec workload. And this is supposed to be HDD-optimized? Write-optimized? O(N logN) - what a laugh.

Maybe LSMs in general can be more efficient than this. LevelDB is pretty horrible though.

Bogdanovich · on Oct 25, 2015

It would only trigger compaction if sst tables have overlapping keys. And if you only write new items, goleveldb implementation would just create 3.7Mb sst tables by default without trying to merge them into bigger chunks (what's the point? they are all sorted and non-overlapping). When you have queue consumption workload it would start merging tombstones with sst tables and since tombstones are also in sorted order it would not pick up multiple sst tables at a time, and just either completely or partially remove stale sst files. I added some more benchmarks including queue packing with 200M messages of 64 byte size, and benchmarks of consumption of 200M messages. The speed is sustainable. https://github.com/bogdanovich/siberite/blob/master/docs/ben...

hyc_symas · on Oct 23, 2015

For reference - http://leveldb.googlecode.com/svn/trunk/doc/impl.html

databass · on April 13, 2015

In the case of an architecture diagram being handed down from on high, sure. But in the case of product feature whack-a-mole, this doesn't seem like something that bottom-up decision-making could fix.

andrewvc · on April 13, 2015

And that's where the "as possible" part of "as low as possible" comes in. :)

databass · on Feb 26, 2015

Recursive CTEs are a cool feature, but it's a shame the syntax is so awful. You really have to work to understand what the query is trying to do.

Other query languages that make recursion much more bearable are SPARQL (on RDF data), Cypher (Neo4j) or Datalog (Datomic, Cascalog, etc). Would love to see more of those query languages being used.

databass · on Feb 2, 2015

Ternary (or even higher n-ary) storage is already widely used: multi-level cell (MLC) SSDs store more than one bit per cell by using multiple voltages (most commonly 4, i.e. 2 bits per cell). I believe Ethernet also uses 3 voltage levels.

darkmighty · on Feb 2, 2015

Communication systems in general routinely have up to 1024 modulation symbols (32 voltage levels for both the sin and cos carrier).

http://en.wikipedia.org/wiki/Quadrature_amplitude_modulation...

databass · on Feb 2, 2015

Does anyone know what they mean when they say "grounded in trust"? (Seems odd to imply that other OSes are not trustworthy.) Referring to code signing, perhaps?

SixSigma · on Feb 2, 2015

New Breakfast Flakes - Asbestos Free

http://tvtropes.org/pmwiki/pmwiki.php/Main/AsbestosFreeCerea...

databass · on Feb 1, 2015

Do you also have stats on the high percentile latencies, besides the average? Sounds like your avoidance of compaction pauses ought to lead to lower latency at the high percentiles.

hyc_symas · on Feb 1, 2015

See the table of data here http://symas.com/mdb/hyperdex/#100M

  100M records, 1M ops
 		MinLatency(us)	AvgLatency(us)	95th%ile(ms)	99th%ile(ms)	MaxLatency(ms)	Runtime(sec)	Throughput(ops/sec)	CPUtime (mm:ss)
  LMDB update	256		33565		130		209		637		8385		119			4:21.41
  LMDB read	215		33493		130		207		817
  LevelDB update 241		63660		231		385		20370		15863		63			17:27
  LevelDB read	188		63250		230		383		7904

(It's formatted properly if you follow the link)

Also look at the latency distribution compared to MySQL's InnoDB http://symas.com/mdb/memcache/