What’s new in Cassandra 0.6.4
I skipped ahead chronologically to get What’s new in Cassandra 0.6.5 written in time for its release. Here I’ll fill in the gap we left by covering 0.6.4.
- Process digest mismatch re-reads in parallel
- Respect snitch when determining whether to do local read for ConsistencyLevel.ONE
- Add ack to Binary write verb
- Pre-emptively drop requests that cannot be processed within RPCTimeout
Process digest mismatch re-reads in parallel
When Cassandra does a read at a low ConsistencyLevel, it still checks all replicas of the requested data in the background to make sure they agree; if they do not, it will send the most recent version to out-of-date replicas, so that repeated queries will return current data even if the first did not. This is called read repair.
To minimize the network overhead of read repair, only one copy of the actual data is read; other replicas are sent digest queries, and will reply with a hash of the result set rather than the actual data. In the common case, the hashes agree and that is the end of the repair. But when the hashes disagree, we handle that digest mismatch by reading the full result from the digest replicas.
This patch parallelizes that re-read for a multiget request covering multiple rows.
Respect snitch when determining whether to do local read for ConsistencyLevel.ONE
Cassandra decides which replica to send a data request to by consulting a pluggable Snitch interface: the closet replica to the coordinator node (the one the client made the request to) gets a data query; the others get digest queries.
But, Cassandra has Hinted Handoff and Anti-Entropy to fix out-of-date replicas as well as Read Repair. If you’re willing to accept potentially longer inconsistency periods after failure conditions like a machine going down, you can turn Read Repair off to improve throughput in the normal case: instead of performing each read against each replica, just read it from one.
But for this to give the most benefit, you want to force each read for a particular row to a single replica, so that we only cache hot data in a single place. This can be achieved with an appropriate Snitch; the last piece of the puzzle was this patch, which removed the incorrect assumption that the local node is always the “closest” for any data it holds a replica of.
Add ack to Binary write verb
Cassandra has a “bulk load” interface called, for historical reasons, the Binary Memtable. It’s important to note that using this API is almost always premature optimization: ordinary writes in Cassandra are very, very fast; the window where Binary Memtable is useful to offload CPU load from Cassandra to the client, but you’re not yet network bound instead of CPU bound, is smaller than most people think.
Originally the binary write verb (command) was fire-and-forget: you told Cassandra to load a row of data and moved on to the next. This is great for firehosing data in but bad for dealing with error conditions like temporary connectivity loss. 0.6.4 adds a reply from the data target so that the source can retry if necessary.
Pre-emptively drop requests that cannot be processed within RPCTimeout
This is a flow control problem: how does Cassandra deal with clients sending requests faster than it can satisfy them? Prior to 0.6.4, Cassandra did not degrade gracefully here: it answered each request in FIFO order with no attempt at throttling, and could even run out of memory in extreme cases from holding onto millions of outstanding requests.
Traditional non-distributed databases deal with this by running one query per client and waiting until that finishes before starting another; if it is overloaded, queries slow down, often dramatically, but there is no chance of pathological behavior such as millions-of-outstanding-requests.
This approach does not work for a distributed system like Cassandra, where a machine could lose connectivity to others after receiving a request but before sending the reply. This is why Cassandra introduced the RPCTimeout configuration directive: if a request from the node to which your client is connected to the node with the necessary data takes longer than that timeout, the request is assumed failed and a TimedOutException is sent to the client.
Thus, you could actually have an arbitrary number of unfinished requests per client in Cassandra. The classic scenario is that a server would be i/o bound on a read-heavy workload and get farther and farther behind.
In a staged architecture like Cassandra’s, the classic way to prevent this problem is through backpressure: a full stage will not accept any more requests from earlier stages, which will then fill up themselves, until the stage talking to the clients themselves stops reading more requests. But on closer examination, this turns out to be a bad fit for Cassandra: if only one node in a cluster is having trouble, we’d like to be able to continue fulfilling requests to the others.
So the approach we took for 0.6.4 (and refined in 0.6.5) was to pre-emptively drop already-timed-out requests without processing them. These are requests that were going to generate a TimedOutException anyway no matter what the answer was, so we drop them and move on to a more recent request.