What’s new in Cassandra 0.6.6
0.6.6 is part of the stable 0.6 release series; no API changes were made, but there are important improvements for operations:
- Configurable IndexInterval
- Add CMSInitiatingOccupancyFraction=75 and UseCMSInitiatingOccupancyOnly to default JVM options
- Document DoConsistencyChecksBoolean option to disable Read Repair
- Use JNA to take snapshots
- Add memtable, cache information to GCInspector logs
- Cache save and load
- Tombstone removal during non-major compactions
The full changelog is here.
Instead of b-trees, Cassandra uses a more i/o-efficient design to find row locations: as part of each sstable, there is an index file containing row keys and the position at which that row starts in the data file. At startup, Cassandra reads a sample of that index into memory — by default, every 1/128 entry. To find a row, Cassandra performs a binary search on the sample, then does just one disk read of the index “block” corresponding to the closest sampled entry.
We had a customer with a large amount of cold data and small rows (thus, a higher index size to data size ratio) find that too much of their memory was being used by this index sample. We gave them a custom build of 0.6.5 reducing the sampling rate to 1/512 and added this configuration setting.
Add CMSInitiatingOccupancyFraction=75 and UseCMSInitiatingOccupancyOnly to default JVM options
By default the JVM tries to estimate when it needs to begin a major compaction to strike a balance between on the one hand wasting CPU by performing GC before it was necessary and on the other running out of heap space before it can finish the collection, forcing it to fall back to a stop-the-world collection. (For gory details, the term you want to look up is concurrent mode failure.)
For many applications this is fine, but with Cassandra it’s worth spending extra CPU to avoid even a small possibility of being paused for several seconds for a stop-the-world collection. These options tell the JVM to always start a collection when the heap is 75% full. (This is a reasonable default based on Cassandra deployments, but some workloads may need to begin GC even earlier, especially with relatively small heaps.)
Document DoConsistencyChecksBoolean option to disable Read Repair
Read repair is how Cassandra restores consistency in frequently-accessed data after downtime of one or more replicas. The downside is that it reduces your throughput by a factor equal to the number of replicas, since each request is performed against each replica. If you’re okay with longer periods of stale data at low consistency levels (until anti-entropy repair finishes), you can disable read repair and increase throughput with this option (which has been available, but not documented, since the first Cassandra release).
Use JNA to take snapshots
Cassandra uses a log-structured storage design, meaning that updates to a row are appended to new data files (called sstables) rather than overwriting the old row in place. Thus, taking a snapshot is as simple as creating a hard link to all the current sstables, so when they are unlinked by compaction, the snapshot link remains.
We introduced the use of JNA in 0.6.5 to perform OS-specific optimizations. Here, we’re using it to create the hard links; if JNA is not available, Cassandra will fall back to the old method of executing ln (or on Windows, either mklink or fsutil). The drawback there is that it generally requires enabling either swap or overcommit, neither of which is desirable on a Cassandra server.
Add memtable, cache information to GCInspector logs
It turns out that logging information after a garbage collection run is a good way to get bare-bones monitoring information when nothing better is configured. In 0.6.4, we started logging internal Cassandra thread pool statuses; for 0.6.6 we added ColumnFamily latency, memtable, and cache metrics.
Cache save and load
Any system that relies on caching for improved performance can suffer when the cache is cold; in Cassandra’s case, the key cache and row cache had to be rebuilt when the server restarted. The dynamic snitch introduced in 0.6.5 mitigates this by routing around slow nodes but if we can start the node with a pre-warmed cache so much the better.
For 0.6.6 we introduced periodic saving of the row and/or key caches to be reloaded at the next restart. This is off by default; update KeyCacheSavePeriodInSeconds and/or RowCacheSavePeriodInSeconds on a per-ColumnFamily basis to enable. You can also manually force cache save from JMX as a one-off, e.g. when preparing to upgrade.
Tombstone removal during non-major compactions
Tombstones are the special values written to represent a delete (remember, log-structured storage). Since Cassandra needs to handle updates in any order, we need to make sure that if we drop a tombstone during compaction we’ve looked at all the rows that could contain data that the tombstone should suppress. The easy way to guarantee this is to only drop tombstones during major compactions, that is, compactions of all sstables.
Now Cassandra takes advantage of the bloom filter that we already use to avoid doing index lookups in sstables that don’t contain any data for a row. Most installations should no longer need to periodically force a major compaction for tombstone purging.