Comparisons of read/write ratios has to account for several differences in design and implementation. Representative benchmarks are difficult.
Things that can make a difference: Databases have subtly different definitions of "durability", so they aren't always doing semantically equivalent operations. Write throughput sometimes scales with the number of clients and it is not possible to saturate the server with a single client due to limitations of the client protocol, so single client benchmarks are misleading. Some databases allow read and write operations to be pipelined; in these implementations it is possible for write performance to sometimes exceed read performance.
For open source databases in particular, read and write throughput is significantly throttled by poor storage engine performance, so the ratio of read/write performance is almost arbitrary. That 3:1 ratio isn't a good heuristic because the absolute values in these cases could be much higher. A more optimal design would offer integer factor throughput improvements for both reading and writing, but it is difficult to estimate what the ratio "should" be on a given server absent a database engine that can really drive the hardware.
It depends, yes but ... (not discounting any of the above).
One sees a lot of 3:1 in practice due to the replication factor. If you have 3 copies of the data and the client can read from any node, you get 3x the read performance as having to have a quorum write on two out of three nodes.
To the GP, for a rough swag of what is possible out of given hardware, a combination of FIO and ACT (measures IO latency under a fixed load) is a good start.
> Is a 10:1 ratio typical for a storage backed distributed kv store?
In a single-node system, the best way to increase your write throughput is to batch requests over small chunks of time. Ultimately, the amount of writes you can perform per unit time is either bounded by the underlying I/O sequential throughput, or the business constraints regarding maximum allowable request latency. In the most trivial case, you are writing a buffer containing the entire day's work to disk in 1 shot while everyone sleeps. Imagine how fast that could be.
A distributed system has all of the same properties, but then you have to put this over a denominator that additionally factors in the number of nodes and the latency between all participants. A single node is always going to give you the most throughput when talking about 1 serial narrative of events wherein any degree of contention is expected.
Typically people use raft for leader election which in turn can coordinate writes. I don't think the writes are being fsync'd in the raft logs here. At least I wouldn't expect that behavior.
For the raft algorithm to be correct Fsync is required on a majority of node otherwise you are technically not implementing Raft.
The reason is that in Raft if a node acknowledges to the leader that it wrote something to the log it must not later accept a different write in the same log position.
This mean if for some reason server rebooted with dirty buffered writes that could not be flushed in time. it’s supposed to forgot everything it know and rejoin the cluster using a brand new node id.
Then you're sacrificing consistency guarantees. If less than a majority have committed a write, it could be lost while the cluster still has a quorum up.
Waiting to report success until a majority have committed allows you to make guarantees with a straight face.. "it will probably be committed in the near future" is not the same thing.
Not necessarily. As long as your quorums always overlap you can have weaker commit requirements. See the FPaxos paper and some of Heidi Howard’s blogs for more on this.
It reads like you and GP are talking about two different things. You’re talking about Raft specifics and the GP seems to be talking about a Vertical Paxos like setup where Raft is used for configuration and the data path uses another replication algorithm such as Primary-Backup or Chain Replication.
Hmmm that sorta makes sense I guess. Sorta because raft is a replication algorithm. If you don't use raft in datapath you don't get any of its guarantees
Most data paths don't need consensus for replication. You can implement ACID Transactions on top of many other replication algorithms for example. Ultimately the choice comes down to read/write ratios. Chain Replication has much better read throughout than Paxos and Raft.
It’s the case of 2F+1 versus F+1. Paxos/Raft offer fault tolerance where as other replication algorithms don’t. If you have a node failure in Primary-Backup or Chain for example you need to reconfigure before committing more writes. However in practice and certain environments reconfiguration can be faster than or the same as a leader failure in Paxos that requires running Phase 1 again.
Often reads of data already committed only need to hit one node but writes need to wait for a majority so they need to wait for multiple nodes to receive and acknowledge the write.
I haven't checked the code though so I might be off.
A database without any test harness? While this could be a good toy or PoC I would never use it in production. Readers should be aware, just because it's on HN doesn't mean it's production ready.
It uses Raft underneath as well which means there's a bunch of non-determinism and hell for anyone who invokes it as well from experience. The thing is cursed.
Source: several years dealing with vault and consul.
We had some massive problems including complete cluster collapse requiring rebuilds from scratch, eternal leadership elections and occasionally nodes would just entirely stop responding to KV requests causing cascading failures outside. Vault is a massive damage multiplier for these issues plus some other nasty ones like buggy barely supported plugins.
Wouldn't mind knowing what versions you were using/how long ago this was. We haven't seen anything like this in > 3 years. Possible a barely supported plugin caused issues?
Huh. I've been running Consul since 0.7 or so, and the only time any of this happened was my own fault.
Most of the problems I've had with Vault have been around it's Terraform provider which they've improved enough that it's not an issue anymore.
I think the only thing about Raft that folks don't realize is how disk hungry it gets, if you want fast write performance you gotta make dang sure all those fsyncs can keep up. Our largest Consul cluster today runs on storage-heavy boxes as it does ~500Mb/s of writes pretty much 24/7.
I use Consul w/ Vault today instead of the internal storage for Vault just cause Consul has really nice monitoring around some stuff that Vault doesn't (path-based stuff for the most part), I think the internal storage is a really good option for 90% of use-cases.
Not sure why you’re being down voted as this is definitely the path forward. The industry and research have both explored the polar opposites of weak consistency protocols and linearizibility and consensus on the other end. Understanding your domain and knowing how you can step down to something weaker than consensus will be critical for future applications.
Hmm, how do you have high availability consistent data without a consensus protocol? No matter where in the problem chain you move, you have to eventually solve that problem.
It's less than a few hundred lines of Go that just wraps two other databases (syndtr/goleveldb and ledisdb/ledisdb) with a third library (tidwall/uhaha) that provides a Raft API.
Only if you want to be a drop-in replacement and take advantage of existing compatible libraries. With the recent tea around the official Elasticsearch Python library, it becomes a more interesting question.
One of the child comments made the observation that "this speaks Redis."
Makes me wonder if there is any spec for the Redis commands. I.e., in the same way that SQL defines an interface, but leaves the details up to individual implementations, is there a "Redis" interface that leaves the details up to the implementation?
I've implemented a subset of redis in the past, and went by their official docs, first the protocol[1] level protocol, then the docs for individual comments such as SET[2]. They also have a test suite, and I extracted the bits that applied to my partial implementation from there.
The only real pitfall was what part of the CONFIG stuff I needed to implement to make popular redis client libs talk to me and/or use the newer protocol features.
The rest was pretty straight forward, just read the docs for a command, implement the stuff, run the test suite, fix any bugs, repeat.
As far as I know there is no RFC let alone an ISO standard.
You’d probably want to define two specs, a basic and full. There are several Redis-compatible data stores, but (if memory serves) you’ll find they almost always lack some advanced Redis features, e.g. transactions.
Quasi-related: what are some good hosted alternatives to AWS dynamodb / GCloud Firestore that are a) fast b) affordable at scale c) have a good local dev experience?
A hosted disk based redis protocol compliant capable of sub TB size datasets would be a dream for me.
Cloudflare Workers KV is really promising, but needs a better local dev story (no stable project so simulate services locally e.g. cloudworkers). Pricing is reasonable depending on what you interpret "scale" to be.
I am not sure what weight you assign to each of your requirements but DynamoDB has official Docker containers that you can use for local development. I don't find it different than developing against postgres etc. If you have tried it what problems have you encountered that make you wish for a better experience?
That's a good point and I should have been clearer.
I might be off (and probably am) but if I remember correctly Redis persistence is more for disaster recovery - you can create snapshots and recover them or replay a log file. That's very different in terms of performance guarantees from persisting the data itself to disk and reading from it.
I was under the impression that's what tools (like this one) and stuff like Ardb try to solve.
It's just that Redis is mostly an in-memory database and if the process is terminated and restarted (for all sorts of reasons) the data can be restored from disk.
So what IceFireDB might be good for is data which would not fit easily into the memory of one node.
I often see projects like this posted on HN, and it's very unclear to me what the actual use case is. Does anyone even end up actually using these things? I guess the developers hope it takes off, and they gain notoriety as 'the guy who made X'?
Youre right. Redis will persist either the AOF or log but your whole dataset must fit in memory (the AOF file is used to fill existing memory on boot).
I'd guess for "I want redis, but more durable clustering" (although I don't quite remember how much redis nowadays offers there itself). Would want a lot more info before trusting it for that though.
Is a 10:1 ratio typical for a storage backed distributed kv store?
Edit: Looks like CockroachDb has roughly a 3:1 ratio, similar for YugabyteDB:
https://www.cockroachlabs.com/docs/stable/performance.html
https://forum.yugabyte.com/t/large-cluster-perf-1-25-nodes/5...
Also ~3:1 for etcd:
https://etcd.io/docs/v3.4/op-guide/performance/