A review of Ark - A new replication algorithm for TokuMX
TokuMX is a fork of MongoDB designed for high performance. After MongoDB got torn to shreds in a Jepsen test, they set out to fix some of the issues discovered. Their blog post describes a new consensus algorithm they designed, based on Raft, to make MongoDB a CP system.
Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.
I’ll give a brief overview of the system, then outline a few small issues I had.
Mongo/Toku (henceforth Moku) allows the user to to run leader/follower replication. There is a single leader which accepts all writes, writes them to an oplog and then asynchronously applies them to the database. The oplog stores a totally ordered sequence of operations applied to the master. The followers pull the latest changes from this oplog asynchronously, then apply those changes asynchronously to their copy of the database.
If a client specifies a write concern of quorum when writing, Moku will write as normal, but will wait for a quorum of followers to replicate this operation before returning a successful write to the client. This means that the cluster can lose up to half the nodes - 1 without losing a write as it is guaranteed to be on at least one of the remaining machines. When the remaining machines compare to see who is most up to date, the machine that has the highest oplog will be elected leader and writes will continue.
Clients can use “majority” Write Concern on every update, which guarantees that the update will be persisted even if failover happens immediately after the update succeeds and reports majority acknowledgement.
Note that false negatives can still occur in this asynchronous model: an update may be replicated to a majority of replicas and be safe, but the acknowledgement might fail to return to the client. In Ark, just as in Raft, we consider this an acceptable failure mode.
After reading through the paper I was pretty positive. It follows the Raft model pretty closely and there were no holes that I could see in the proof. However the difference between theory and practice are two different beasts. The TokuMX team have a public branch of their algorithm. Verifying the algorithm under actual node failure is an important next step and one that I’d like to see. Time depending, I may look at verifying this myself.
There were two small issues I had with the paper:
- Because secondaries asynchronously apply the oplog to the database, it wasn’t clear if a new leader could begin serving reads and writes before it had finished applying the oplog. Common sense would assume that it would apply the oplog first.
- I think it would be helpful if the paper explicitly stated that it was only applicable for leader-follower replication and that all reads and writes needed to be served against the leader. When reading from a follower, writes will be eventually consistent as they are presently.
The TokuMX team are doing a blog series explaining the paper in laymans terms. However if you’re interested in this I encourage you to have a look at the paper first as it is very readable.