sghemawat's comments

sghemawat · 2026-01-09T03:32:42 1767929562

Hi Kenton! No worries at all. I tend to be quieter than Jeff anyway (less public speaking etc.) and I am happy to not have a dedicated website. :-). -Sanjay

kentonv · 2026-01-09T04:00:55 1767931255

Hey Sanjay, long time no see. Thanks for the note!

But I'm fully aware you wouldn't want a "Sanjay Facts", and that's not the point. ;)

Ozzie_osman · 2026-01-09T09:57:48 1767952668

You are both legends. Your original MapReduce paper is what inspired me to work for Google (2006-2009), narrowly dodging a career as a quant on Wall Street.

underdeserver · 2026-01-09T15:44:49 1767973489

blink

Legend. Popping up here after his last comment was 13 years ago.

sghemawat · on Feb 22, 2013

About [1] Sorry about that: the document you linked to is amazingly stale. tcmalloc has been releasing memory to the system for many years. See for example the IncrementalScavenge routine in a version of page_heap.cc from Dec 2008:

https://code.google.com/p/gperftools/source/browse/trunk/src...

One caveat: physical memory and swap space is released, but the process's virtual size will not decrease since tcmalloc uses madvise(MNONE) to release memory.

About [2], code using tcmalloc-specific features/symbols is definitely a problem. I would strongly advise against doing that and sticking to the libc interfaces instead for the reason you pointed out.

jeffdavis · on Feb 22, 2013

Strange, the page showed up first when I googled "tcmalloc" and the problem was also present in the version that I was using (at least I think it was). My apologies.

Yeah, regarding [2], that was definitely not my idea.

sghemawat · on Feb 22, 2013

Not your fault. We just plain forgot to update the documentation, so the freshest available document is a few years out of date.

jeffffff · on Feb 22, 2013

wow i didn't realize madvise could actually modify the memory contents and return pages, but it makes sense that it can because that is a useful feature. very cool!

sghemawat · on May 9, 2011

We have no such plans. The library is designed so that concurrency control is outside its scope. If I needed it for an application, I might consider using some application level code that looks like:

    bool CAS(db, key, oldvalue, newvalue) {
      lock some mutex;
      read key's value from db;
      bool result = (value == oldvalue);
      if (result) write key=>newvalue to db;
      unlock;
      return result;
    }

This should be not much slower than any hard-wired CAS we could provide from inside leveldb.

sghemawat · on May 9, 2011

The use of c++0x should be limited to the small AtomicPointer class in port_posix.h. The class uses <cstdatomic> for its implementation. If you can provide a different implementation of "acquire store" and "release load" for your architecture, you should be all set without the need for c++0x.

I suspect that you may find suitable code in the chromium source code. See leveldb's port_chromium.h.

xinmans · on May 12, 2011

I suspect you need a newer version of gcc/g++ (one that supports the c++0x standard). You can probably download a later gcc package and then use that to install.

  -Jeff

On Wed, May 11, 2011 at 3:37 AM, conglin.deng <conglin.deng@aliyun-inc.com> wrote: > Hi jeff, > > > > I have checkout leveldb code from code.google.com, > > But when I run make command on the root folder, encounter a error as > below, > > > > /////////////////////////// > > [root@host]$make > > g++ -c -DLEVELDB_PLATFORM_POSIX -I. -I./include -std=c++0x -g2 > db/db_bench.cc -o db/db_bench.o > > cc1plus: error: unrecognized command line option "-std=c++0x" > > make: * [db/db_bench.o] Error 1 > > [root@host]$pwd > > /home/admin/leveldb > > [root@host]$ > > /////////////////////////// > > > > > > Would you please to tell me how to build and setup leveldb on a redhat > 5.4 env. Thanks very much! > > > > Thanks > > linc

davisp · on May 9, 2011

I managed to get an initial patch up here:

http://code.google.com/p/leveldb/issues/detail?id=2

sghemawat · on May 8, 2011

leveldb is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration).

bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications.

bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.

sriramk · on May 8, 2011

I just need to say - it's pretty cool to see Sanjay Ghemawat hanging out on HN.

stephth · on May 8, 2011

Also worth noting: bitcask is written in erlang and leveldb in c++.

devpotato · on May 8, 2011

Technically bitcask is a combination of Erlang and C.

sghemawat · on May 8, 2011

One of the leveldb authors here. TokyoCabinet is something we seriously considered using instead of writing leveldb. TokyoCabinet has great performance usually. I haven't done a careful head-to-head comparison, but it wouldn't surprise me if it was somewhat faster than leveldb for many workloads. Plus TokyoCabinet is more mature, has matching server code etc. and may therefore be a better fit for many projects.

However because of a fundamental difference in data structures (TokyoCabinet uses btrees for ordered storage; leveldb uses log structured merge trees), random write performance (which is important for our needs) is significantly better in leveldb. This part we did measure. IIRC, we could fill TokyoCabinet with a million 100-byte writes in less than two seconds if writing sequentially, but the time ballooned to ~2000 seconds if we wrote randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds (random).

lisper · on May 8, 2011

Did you consider fractal trees?

http://tokutek.com/presentations/bender-Scalperf-9-09.pdf

yingfeng · on May 9, 2011

fractal trees should have a faster read performance with leveldb with a comparable updating performance. There does not exist any open source solution for such structure, while we could hope that www.acunu.com might provide an open source solution for similary Stratified B-tree http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-...

br1 · on May 9, 2011

Being in-kernel, acunu is very far from the lightness of LevelDB.

psaccounts · on May 8, 2011

Is there a more useful technical paper on Fractal Trees? Better yet is there a open source implementation of the same?

jbapple · on May 9, 2011

Cache-oblivious streaming B-trees:

http://supertech.csail.mit.edu/papers/sbtree.pdf

I don't know for certain any open-source implementation, but I have heard COLAs are used in HBase.

davidhollander · on May 8, 2011

> difference in data structures

Always good to have more tools in one's arsenal in that case. I'll drop one of you a message if I write a Lua binding for it (using LuaJIT+embedded k/v for data services atm).

spullara · on May 8, 2011

Have you looked at implementing CAS? It is the one thing missing for me to adopt it as a underlying store for my AvroBase API: http://www.javarants.com/2010/06/30/havrobase-a-searchable-e...

cmartella · on May 20, 2011

Could you please report about caching setup for random read test? was there any cache (a part of OS's) for either position/index lookup or payload? Or did you hit the disk without any caching of past operations?

thanks

sghemawat · on May 8, 2011

It is durable from an OS perspective. I started typing in a long explanation, but I think the comment for the "sync" field in the options.h file captures things well:

  struct WriteOptions {
    // If true, the write will be flushed from the operating system
    // buffer cache (by calling WritableFile::Sync()) before the write
    // is considered complete.  If this flag is true, writes will be
    // slower.
    //
    // If this flag is false, and the machine crashes, some recent
    // writes may be lost.  Note that if it is just the process that
    // crashes (i.e., the machine does not reboot), no writes will be
    // lost even if sync==false.
    //
    // In other words, a DB write with sync==false has similar
    // crash semantics as the "write()" system call.  A DB write
    // with sync==true has similar crash semantics to a "write()"
    // system call followed by "fsync()".
    //
    // Default: false
    bool sync;

We don't have much experience with scaling to larger databases yet. There are known problems which will cause a (smallish) constant factor slowdown in write performance after the database becomes a few (10?) GB in size, but I don't recall the details all that well, and the implementation has changed somewhat since that experiment. I would like to characterize this better and fix things so we can support somewhere between 100GB-1TB databases well. It just hasn't become a priority yet.

The benchmark numbers on the linked page were from a small million entry database that easily fits in the OS buffer cache.

jedc · on May 8, 2011

Welcome to HN commenting, Sanjay!

(For those that aren't familiar, his bio is here: http://research.google.com/people/sanjay/index.html)

JulianMorrison · on May 8, 2011

Writes being lost means what, a trashed file? Or merely an incomplete one?

sghemawat · on May 8, 2011

Merely an incomplete one. Leveldb never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.

leif · on May 8, 2011

Thanks!