IceFireDB: Distributed disk storage database based on Raft and Redis protocol

tyingq · on Aug 21, 2021

  SET: 253232.12 requests per second
  GET: 2130875.50 requests per second

The 10:1 throughput ratio for GET vs SET is interesting. Redis being in-memory, the rates there are pretty close to the same for read/write.

Is a 10:1 ratio typical for a storage backed distributed kv store?

Edit: Looks like CockroachDb has roughly a 3:1 ratio, similar for YugabyteDB:

https://www.cockroachlabs.com/docs/stable/performance.html

https://forum.yugabyte.com/t/large-cluster-perf-1-25-nodes/5...

Also ~3:1 for etcd:

https://etcd.io/docs/v3.4/op-guide/performance/

jandrewrogers · on Aug 21, 2021

Comparisons of read/write ratios has to account for several differences in design and implementation. Representative benchmarks are difficult.

Things that can make a difference: Databases have subtly different definitions of "durability", so they aren't always doing semantically equivalent operations. Write throughput sometimes scales with the number of clients and it is not possible to saturate the server with a single client due to limitations of the client protocol, so single client benchmarks are misleading. Some databases allow read and write operations to be pipelined; in these implementations it is possible for write performance to sometimes exceed read performance.

For open source databases in particular, read and write throughput is significantly throttled by poor storage engine performance, so the ratio of read/write performance is almost arbitrary. That 3:1 ratio isn't a good heuristic because the absolute values in these cases could be much higher. A more optimal design would offer integer factor throughput improvements for both reading and writing, but it is difficult to estimate what the ratio "should" be on a given server absent a database engine that can really drive the hardware.

sitkack · on Aug 22, 2021

It depends, yes but ... (not discounting any of the above).

One sees a lot of 3:1 in practice due to the replication factor. If you have 3 copies of the data and the client can read from any node, you get 3x the read performance as having to have a quorum write on two out of three nodes.

To the GP, for a rough swag of what is possible out of given hardware, a combination of FIO and ACT (measures IO latency under a fixed load) is a good start.

https://fio.readthedocs.io/en/latest/fio_doc.html

https://github.com/aerospike/act

bob1029 · on Aug 21, 2021

> Is a 10:1 ratio typical for a storage backed distributed kv store?

In a single-node system, the best way to increase your write throughput is to batch requests over small chunks of time. Ultimately, the amount of writes you can perform per unit time is either bounded by the underlying I/O sequential throughput, or the business constraints regarding maximum allowable request latency. In the most trivial case, you are writing a buffer containing the entire day's work to disk in 1 shot while everyone sleeps. Imagine how fast that could be.

A distributed system has all of the same properties, but then you have to put this over a denominator that additionally factors in the number of nodes and the latency between all participants. A single node is always going to give you the most throughput when talking about 1 serial narrative of events wherein any degree of contention is expected.

refenestrator · on Aug 21, 2021

Raft involves waiting for fsync on a majority of nodes, so that's not too surprising.

'Typical' is a matter of what guarantees you want to give.

toolz · on Aug 21, 2021

Typically people use raft for leader election which in turn can coordinate writes. I don't think the writes are being fsync'd in the raft logs here. At least I wouldn't expect that behavior.

skyde · on Aug 21, 2021

For the raft algorithm to be correct Fsync is required on a majority of node otherwise you are technically not implementing Raft.

The reason is that in Raft if a node acknowledges to the leader that it wrote something to the log it must not later accept a different write in the same log position.

This mean if for some reason server rebooted with dirty buffered writes that could not be flushed in time. it’s supposed to forgot everything it know and rejoin the cluster using a brand new node id.

alexchamberlain · on Aug 21, 2021

Each write should be fsync'd to the WAL, right?

stingraycharles · on Aug 21, 2021

Yes but those can happen at the convenience of the particular node, not necessarily as a globally chrckpointed fsync()

refenestrator · on Aug 21, 2021

Then you're sacrificing consistency guarantees. If less than a majority have committed a write, it could be lost while the cluster still has a quorum up.

Waiting to report success until a majority have committed allows you to make guarantees with a straight face.. "it will probably be committed in the near future" is not the same thing.

rubiquity · on Aug 22, 2021

Not necessarily. As long as your quorums always overlap you can have weaker commit requirements. See the FPaxos paper and some of Heidi Howard’s blogs for more on this.

https://fpaxos.github.io/

jimsimmons · on Aug 22, 2021

FPaxos requires commit quorum and subsequent promise request quorum to intersect. It doesn't bring any significant thruput benefits

rubiquity · on Aug 22, 2021

I didn't say FPaxos brought throughput benefits. I pointed out that you can still have a correct implementation without requiring majority quorums.

But since you brought it up, FPaxos can have both lower latency and higher throughput than MultiPaxos/Raft[0] ;)

0 - https://dl.acm.org/doi/pdf/10.1145/3299869.3319893

jimsimmons · on Aug 22, 2021

Great link. Need to look into what WPaxos is.

jimsimmons · on Aug 21, 2021

You don't understand Raft. Quorum has to fsync for commit

rubiquity · on Aug 22, 2021

It reads like you and GP are talking about two different things. You’re talking about Raft specifics and the GP seems to be talking about a Vertical Paxos like setup where Raft is used for configuration and the data path uses another replication algorithm such as Primary-Backup or Chain Replication.

jimsimmons · on Aug 22, 2021

Hmmm that sorta makes sense I guess. Sorta because raft is a replication algorithm. If you don't use raft in datapath you don't get any of its guarantees

rubiquity · on Aug 22, 2021

Most data paths don't need consensus for replication. You can implement ACID Transactions on top of many other replication algorithms for example. Ultimately the choice comes down to read/write ratios. Chain Replication has much better read throughout than Paxos and Raft.

jimsimmons · on Aug 22, 2021

I see. What's the catch though. Sounds like free lunch. Is there some gotcha with partition tolerance?

rubiquity · on Aug 23, 2021

It’s the case of 2F+1 versus F+1. Paxos/Raft offer fault tolerance where as other replication algorithms don’t. If you have a node failure in Primary-Backup or Chain for example you need to reconfigure before committing more writes. However in practice and certain environments reconfiguration can be faster than or the same as a leader failure in Paxos that requires running Phase 1 again.

jimsimmons · on Aug 23, 2021

Got it, makes sense. To the extent this can be formalized, I feel this is a much better alternative than consensus based approaches

inglor · on Aug 21, 2021

Often reads of data already committed only need to hit one node but writes need to wait for a majority so they need to wait for multiple nodes to receive and acknowledge the write.

I haven't checked the code though so I might be off.

AlphaSite · on Aug 21, 2021

I wonder how geode performs here.

maxpert · on Aug 21, 2021

A database without any test harness? While this could be a good toy or PoC I would never use it in production. Readers should be aware, just because it's on HN doesn't mean it's production ready.

hughrr · on Aug 21, 2021

It uses Raft underneath as well which means there's a bunch of non-determinism and hell for anyone who invokes it as well from experience. The thing is cursed.

Source: several years dealing with vault and consul.

tempest_ · on Aug 21, 2021

We use consul a bit as some "light" service discovery and a KV store for a few things without much issue so far.

What demons did you encounter with vault/consul?

hughrr · on Aug 21, 2021

We had some massive problems including complete cluster collapse requiring rebuilds from scratch, eternal leadership elections and occasionally nodes would just entirely stop responding to KV requests causing cascading failures outside. Vault is a massive damage multiplier for these issues plus some other nasty ones like buggy barely supported plugins.

zetsurin · on Aug 21, 2021

Wouldn't mind knowing what versions you were using/how long ago this was. We haven't seen anything like this in > 3 years. Possible a barely supported plugin caused issues?

hughrr · on Aug 21, 2021

This is post 1.0 and without any plugins for consul. Vault has been a persistent point of failure.

It’s a good idea in principle but it’s not got a good ROI

chucky_z · on Aug 22, 2021

Huh. I've been running Consul since 0.7 or so, and the only time any of this happened was my own fault.

Most of the problems I've had with Vault have been around it's Terraform provider which they've improved enough that it's not an issue anymore.

I think the only thing about Raft that folks don't realize is how disk hungry it gets, if you want fast write performance you gotta make dang sure all those fsyncs can keep up. Our largest Consul cluster today runs on storage-heavy boxes as it does ~500Mb/s of writes pretty much 24/7.

I use Consul w/ Vault today instead of the internal storage for Vault just cause Consul has really nice monitoring around some stuff that Vault doesn't (path-based stuff for the most part), I think the internal storage is a really good option for 90% of use-cases.

cortesoft · on Aug 21, 2021

What is a better consensus protocol to use?

hughrr · on Aug 21, 2021

Look higher up the problem domain and solve it without requiring a consensus protocol.

rubiquity · on Aug 22, 2021

Not sure why you’re being down voted as this is definitely the path forward. The industry and research have both explored the polar opposites of weak consistency protocols and linearizibility and consensus on the other end. Understanding your domain and knowing how you can step down to something weaker than consensus will be critical for future applications.

cortesoft · on Aug 21, 2021

Hmm, how do you have high availability consistent data without a consensus protocol? No matter where in the problem chain you move, you have to eventually solve that problem.

hughrr · on Aug 21, 2021

Easy you push the configuration to every target. Then they are logically consistent.

We did this for 30 years fine before someone invented this stack on deployments much larger then the average consul or vault deployment these days.

I had something running 15,000 dynamic rps on Apache about 15 years ago.

People are blinded from simplicity by complexity. Eventually complexity owns you. You can only own simplicity.

At the end of the day this is one way to solve a problem that doesn’t need to be solved that someone has convinced you is a problem.

cortesoft · on Aug 22, 2021

People need distributed consistent data for a lot of things besides configuration.

yencabulator · on Aug 23, 2021

People say they do, but then they proceed to read stale data from replicas all the time...

asadawadia · on Aug 21, 2021

>Easy you push the configuration to every target. Then they are logically consistent.

What does this mean? This doesn't mean anything?

Are you saying to push the DB to the client?

caffeine · on Aug 22, 2021

I think he’s saying you have a cronjob with an rsync to every client

PYTHONDJANGO · on Aug 21, 2021

Please post URLs to bugs / issues to give your comment some cred. Thanks!

corerman · on Aug 27, 2021

Project tests have been added, from open source community partners, thanks for the initial attention.

tomnipotent · on Aug 21, 2021

A database without any code, actually.

It's less than a few hundred lines of Go that just wraps two other databases (syndtr/goleveldb and ledisdb/ledisdb) with a third library (tidwall/uhaha) that provides a Raft API.

3np · on Aug 21, 2021

Oooh that means I can form it to do Redis instead right? Because that could be a nice way to to Redis clustering

tomnipotent · on Aug 21, 2021

I suppose you could stick tidwall/uhaha directly in front of redis, but I'm not entirely certain what you'd call that...

Here's the LSET code:

https://github.com/gitsrc/IceFireDB/blob/main/lists.go#L232

    func cmdLSET(m uhaha.Machine, args []string) (interface{}, error) {
     if len(args) != 4 {
      return nil, rafthub.ErrWrongNumArgs
     }
    
     index, err := ledis.StrInt64([]byte(args[2]), nil)
     if err != nil {
      return nil, err
     }

     if err := ldb.LSet([]byte(args[1]), int32(index), []byte(args[3])); err != nil {
      return nil, err
     }
     return redcon.SimpleString("OK"), nil
    }

So what "IceFireDB" is:

1. tidwall/uhaha - Raft server (m uhaha.Machine, rafthub)

2. tidwall/redcon - Read/write redis protocol (redcon.SimpleString)

3. ledisdb/ledisdb - Redis-compatible with disk persistence via leveldb (ldb.LSet)

4. syndtr/goleveldb/leveldb - Provides snapshots, other scattered references throughout code

It also includes this seemingly random file below, which seems to implement some string slice overloads using unsafe.Pointer:

https://github.com/siddontang/go/blob/master/hack/hack.go

    // no copy to change slice to string
    // use your own risk
    func String(b []byte) (s string) {
     pbytes := (*reflect.SliceHeader)(unsafe.Pointer(&b))
     pstring := (*reflect.StringHeader)(unsafe.Pointer(&s))
     pstring.Data = pbytes.Data
     pstring.Len = pbytes.Len
     return
    }

    // no copy to change string to slice
    // use your own risk
    func Slice(s string) (b []byte) {
     pbytes := (*reflect.SliceHeader)(unsafe.Pointer(&b))
     pstring := (*reflect.StringHeader)(unsafe.Pointer(&s))
     pbytes.Data = pstring.Data
     pbytes.Len = pstring.Len
     pbytes.Cap = pstring.Len
     return
    }

sitkack · on Aug 22, 2021

Do you need Redis or the Redis protocol?

Aerospike and ScyllaDB both support a subset of the Redis API and run as a durable cluster. Both have been tested with Jepsen.

tomnipotent · on Aug 22, 2021

> Do you need Redis or the Redis protocol?

Only if you want to be a drop-in replacement and take advantage of existing compatible libraries. With the recent tea around the official Elasticsearch Python library, it becomes a more interesting question.

tptacek · on Aug 21, 2021

See also rqlite, which seems like this, but sqlite instead of Redis. Super interesting.

https://github.com/rqlite/rqlite

otoolep · on Aug 21, 2021

rqlite author here, happy to answer any questions.

edoceo · on Aug 21, 2021

See also Tendis https://github.com/Tencent/Tendis

Tendis is a high-performance distributed storage system which is fully compatible with the Redis protocol.

wiremine · on Aug 21, 2021

One of the child comments made the observation that "this speaks Redis."

Makes me wonder if there is any spec for the Redis commands. I.e., in the same way that SQL defines an interface, but leaves the details up to individual implementations, is there a "Redis" interface that leaves the details up to the implementation?

I'm thinking of something similar to ISO or RFC.

rndgermandude · on Aug 21, 2021

I've implemented a subset of redis in the past, and went by their official docs, first the protocol[1] level protocol, then the docs for individual comments such as SET[2]. They also have a test suite, and I extracted the bits that applied to my partial implementation from there.

The only real pitfall was what part of the CONFIG stuff I needed to implement to make popular redis client libs talk to me and/or use the newer protocol features.

The rest was pretty straight forward, just read the docs for a command, implement the stuff, run the test suite, fix any bugs, repeat.

As far as I know there is no RFC let alone an ISO standard.

[1] https://redis.io/topics/protocol

[2] https://redis.io/commands/set

aranchelk · on Aug 21, 2021

You’d probably want to define two specs, a basic and full. There are several Redis-compatible data stores, but (if memory serves) you’ll find they almost always lack some advanced Redis features, e.g. transactions.

WJW · on Aug 21, 2021

https://redis.io/topics/protocol ?

linux2647 · on Aug 21, 2021

Probably not like an ISO or RFC. Probably more like AWS S3: it has an API that other software conforms to, but it isn’t strictly speaking a standard

spookylettuce · on Aug 21, 2021

Quasi-related: what are some good hosted alternatives to AWS dynamodb / GCloud Firestore that are a) fast b) affordable at scale c) have a good local dev experience?

A hosted disk based redis protocol compliant capable of sub TB size datasets would be a dream for me.

tomnipotent · on Aug 21, 2021

Cloudflare Workers KV is really promising, but needs a better local dev story (no stable project so simulate services locally e.g. cloudworkers). Pricing is reasonable depending on what you interpret "scale" to be.

valzam · on Aug 22, 2021

I am not sure what weight you assign to each of your requirements but DynamoDB has official Docker containers that you can use for local development. I don't find it different than developing against postgres etc. If you have tried it what problems have you encountered that make you wish for a better experience?

https://hub.docker.com/r/amazon/dynamodb-local

spookylettuce · on Aug 22, 2021

I actually really like Dynamo, just hate throwing my dollars at Bezos.

skinnyarms · on Aug 21, 2021

I was surprised out how easy it was to get started with Cassandra on DataStax: https://www.datastax.com/

sitkack · on Aug 22, 2021

ScyllaDB, https://www.scylladb.com/alternator/

bayesian_horse · on Aug 21, 2021

Not quite sure what this is supposed to be good for.

inglor · on Aug 21, 2021

It speaks Redis so ideally it can replace Redis in cases where persistence is required.

There are already several community solutions for Redis persistence - this one provides different guarantees.

The name implies the goal is to make it easy to mix "hot" (from memory) and "cold" (from disk) data. The author suggests this.

brasic · on Aug 21, 2021

Redis itself already supports a number of persistence schemes and has since the beginning: https://redis.io/topics/persistence

inglor · on Aug 21, 2021

That's a good point and I should have been clearer.

I might be off (and probably am) but if I remember correctly Redis persistence is more for disaster recovery - you can create snapshots and recover them or replay a log file. That's very different in terms of performance guarantees from persisting the data itself to disk and reading from it.

I was under the impression that's what tools (like this one) and stuff like Ardb try to solve.

bayesian_horse · on Aug 21, 2021

I wouldn't call it disaster recovery per se.

It's just that Redis is mostly an in-memory database and if the process is terminated and restarted (for all sorts of reasons) the data can be restored from disk.

So what IceFireDB might be good for is data which would not fit easily into the memory of one node.

Again, it's really not clear to me.

rubicon33 · on Aug 21, 2021

I often see projects like this posted on HN, and it's very unclear to me what the actual use case is. Does anyone even end up actually using these things? I guess the developers hope it takes off, and they gain notoriety as 'the guy who made X'?

It's unclear.

parhamn · on Aug 21, 2021

Youre right. Redis will persist either the AOF or log but your whole dataset must fit in memory (the AOF file is used to fill existing memory on boot).

detaro · on Aug 21, 2021

I'd guess for "I want redis, but more durable clustering" (although I don't quite remember how much redis nowadays offers there itself). Would want a lot more info before trusting it for that though.

bayesian_horse · on Aug 21, 2021

You can configure Redis to be plenty durable... but all the data has to fit into memory somewhere.

didip · on Aug 21, 2021

Does it have a helm chart?