The Cassandra annotated changelog: 0.6.2
- make concurrent_reads, concurrent_writes configurable at runtime via JMX
- disable GCInspector on non-Sun JVMs
- expose PhiConvictThreshold [in the configuration file]
- improve default JVM GC options
Make ConcurrentReads, ConcurrentWrites configurable at runtime via JMX
Cassandra exposes both metrics and tuneable parameters via JMX. So this was a straightforward change making it so that these values can be configured at runtime rather than requiring restarting the Cassandra process.
The more interesting part here is what these values represent: these are NOT the number of client connections that Cassandra will accept. Client connections are controlled by Thrift and are “unlimited,” that is, until you run out of sockets or thread stack space. (Why we use thread-per-client on the Thrift side is a subject for a separate post; to summarize, performance is actually quite good and Java nonblocking socket i/o has been buggy as sin until recently.)
ConcurrentReads and ConcurrentWrites are instead the maximum number of threads to allow Cassandra to allocate on the internal read and write stages. Typically, you’ll want to set concurrent writes higher than reads, since the log-structured data format used by Cassandra means writes are essentially CPU-bound, while reads will be i/o bound.
You don’t want to set either too high, though, or contention will become an issue. One cluster I helped troubleshoot had ConcurrentWrites set to 2560; cutting that to 64 improved performance dramatically.
Disable GCInspector on non-Sun JVMs
On the whole, building on the JVM has been a Good Thing for Cassandra, but one of the drawbacks is obviously that application code gets very little control over or even visibility into memory management. (This is one reason our caches are configured in terms of number of rows, rather than an exact memory ceiling.)
One of the less desirable consequences can be what we call a “GC storm,” where when the JVM is not given enough memory to do its job and it gets to the point where it is almost-but-not-quite out of memory, it will usurp almost all the CPU performing GC to collect just enough to keep going a little farther, but not enough to relieve the pressure. This results in the unpleasant situation of the Cassandra process being alive, but not responding quickly enough to actually be useful.
The solution is usually simple (give Cassandra more memory by adjusting the Xmx paramter and/or tell it to cache less agressively), but recognizing the problem can be difficult for people unfamiliar with the JVM, so we added the GCInspector to log garbage collector progress with lines like
INFO [GC inspection] 2010-07-21 01:01:49,661 GCInspector.java (line 110) GC for ConcurrentMarkSweep: 11748 ms, 413673472 reclaimed leaving 9779542600 used; max is 10873667584
But GCInspector is specific to the Sun JVM, so we had to make it optional. This is part of our philosophy: we always want to offer a 100% pure java option for minority platforms like Windows, or other JVMs. But, recognizing that over 90% of Cassandra deployments are on Linux and the Sun JVM, we’re open to making optimizations for that environment.
This refers to the ϕ in The ϕ accrual failure detector, which Cassandra implements to decide when to assume a node that hasn’t participated in Gossip for a while has failed. The default value of 8 will take about 9 seconds for the rest of the cluster to recognize a failed node, but occasionally results in false positives on exceptionally poor networks, and on ec2 in particular can be exacerbated when your Cassandra node is sharing the machine with other VMs.
We recommend increasing PhiConvictThreshold to 10 in environments like this, which will result in a detection time of about 14s but eliminates false positives.
Improve default JVM GC options
Prior to 0.6.2, we used almost exactly the options we inherited when Facebook first open-sourced Cassandra almost two years ago. (The one change we made before 0.6.2 was to remove the almost comically bad “CMSInitiatingOccupancyFraction=1” directive, which essentially says “never ever stop running ConcurrentMarkSweep.”) The parts we kept in 0.6.2 are the directives for what kind of GC to use:
The tuning options we changed from
The main thing you want to avoid in the JVM is filling up the old generation while CMS is in progress, because then it has to stop all the other threads to perform a stop-the-world compaction. The thrust of these changes is to reserve more of the new generation for the survivor spaces (SurvivorRatio), and to keep objects in the survivor space for the duration of one new-generation collection (MaxTenuringThreshold) before assuming it will be long-lived and moving it to the old generation. For more details, see the documentation.
For the brave, another option worth experimenting with is -XX:+UseCompressedOops, which compresses references to 32 bits, improving heap efficiently on heaps up to about 25GB (IBM’s recommendation) or 32GB (Sun’s). Historically CompressedOops has been very unstable on Sun’s JVM with CMS enabled, but post-u21 it works well enough that I know of at least one Cassandra cluster with that enabled.