What’s new in Cassandra 0.6.5
0.6.5 is part of the stable 0.6 release series; no API changes were made, but there are important improvements for operations:
- Dynamic Snitch
- Use mlockall via JNA, if present, to prevent Linux from swapping out parts of the JVM
- Page within a single row during hinted handoff
- Faster UUIDType, LongType comparisons
- Log summary of dropped messages instead of spamming log
The full changelog is here.
Cassandra has always been good at dealing with cluster members who are all the way dead, thanks to its failure detector. The dynamic snitch lets us also handle members who are only mostly dead, that is, are still responding but with impaired performance.
The Snitch is the component that tells the rest of Cassandra about how nodes are distributed across your network. Typically, this is done in terms of racks and data centers; Cassandra will then route read requests to the closest replica: a node in the same rack is considered closer than others in the same datacenter, which are closer than nodes in a different datacenter entirely.
The dynamic Snitch incorporates real-time request latency into its closeness metric, and routes requests to nodes that respond the fastest, no matter where they are actually located. This elegantly handles any type of impaired performance, no matter the cause, whether it’s a hardware problem like a failing disk or just a transient condition like a restarted node with a cold cache.
The dynamic snitch augments your existing snitch rather than replacing it entirely; the static snitch is used until there is enough information to usefully route requests dynamically. To enable the dynamic snitch, add
To your JVM options in cassandra.in.sh.
Use mlockall via JNA
Linux aggressively swaps out infrequently used memory to make more room for its file system buffer cache. Unfortunately, modern generational garbage collectors like the JVM’s leave parts of its heap un-touched for relatively large amounts of time, leading Linux to swap it out. When the JVM finally goes to use or GC that memory, swap hell ensues.
Setting swappiness to zero can mitigate this behavior but does not eliminate it entirely. Turning off swap entirely is effective. But to avoid surprising people who don’t know about this behavior, the best solution is to tell Linux not to swap out the JVM, and that is what we do now with mlockall via JNA.
Because of licensing issues, we can’t distribute JNA with Cassandra, so you must manually add it to the Cassandra lib/ directory or otherwise place it on the classpath. If the JNA jar is not present, Cassandra will continue as before.
Page within a single row during hinted handoff
When sending hinted data to a recovered node, 0.6.5 will break sending a large row up into multiple pieces. This is not usually crucial in 0.6, since hint replay is serialized and row sizes are capped at 2 GB; still, it’s nice to avoid hogging the intra-cluster socket — and row size is uncapped in 0.7, so we’re getting ready for that too.
Faster UUIDType, LongType comparisons
These column compararators were rewritten to avoid allocating any objects during the compare.
Log summary of dropped messages instead of spamming log
In the prior stable release, 0.6.4, we updated Cassandra to deal with overload scenarios by dropping requests that have already timed out, rather than attempting to complete them, so that it has a chance to catch up on non-timed-out requests. (Clients will get a TimedOutException, which is what they would have gotten anyway even if the requested ultimately succeeded too late.)
As part of this change, we logged a warning when we dropped a message in this fashion. But when you have thousands of requests per machine per second, and you’re tring to deal with a load spike, logging one warning per request is not optimal (although generating a log message is still much faster than performing the request would have been). In 0.6.5 we changed to logging a summary once per second of any requests that were deliberately dropped to reduce load.
Remove message deserialization stage
Cassandra loosely follows the SEDA design of breaking up requests into “stages” that are connected by work queues. The first stage any inter-node request went into after being read from the socket was the message deserialization stage, which would turn it into a request object and submit it to the appropriate queue, typically the one for the read or write stages.
But, the read and write queues were capped at 4096 pending tasks. If either filled up, the deserialization stage would block for space on that queue to become free. This is fine if reads and writes were your only tasks, but while they are a large majority, there are others, particularly gossip messages that need to be processed by gossip and failure detection. These would also get stuck in the deserializer stage if the read or write queues filled up.
0.6.5 simplifies the stage design, performing deserialization on the threads reading from sockets, and uncapping read and write queues so they never block other tasks. The dropping of already-timed-out messages that is explained above prevents the queues from becoming arbitrarily large.
(I covered this with extra diagram goodness in my State of Cassandra talk at the recent Cassandra Summit.)