Performance improvements in Cassandra 1.2
date: December 6, 2012
Cassandra 1.2 adds a number of performance optimizations, particularly for clusters with a large amount of data per node.
Moving internals off-heap
Disk capacities have been increasing. RAM capacities have been increasingly roughly in step. But the JVM's ability to manage a large heap has not kept pace. So as Cassandra clusters deploy more and more data per node, we've been moving storage engine internal structures off-heap, managing them manually in native memory instead.
1.2 moves the two biggest remaining culprits off-heap: compression metadata and per-row bloom filters.
Bloom filters help Cassandra avoid scanning data files that can’t possibly include the rows being queried. They weigh in at 1-2GB per billion rows, depending on how aggressively they are tuned.
Both of these use the existing sstable reference counting with minor tweaking to free native resources when the sstable they are associated with is compacted away.
Column index performance
Cassandra has supported indexes on columns for over two years, but our implementation has been simplistic: when an indexed column was updated, we'd read the old version of that column, mark the old index entry invalid, and add a new index entry.
There are two problems with this approach:
- This needed to be done with a (sharded) row lock, so for heavy insert loads lock contention could be a problem.
- If your rows being updated aren't cached in memory, doing an update will cause a disk seek (to read the old value). This violates our design principle of avoiding random i/o on writes.
I've long been a proponent of having a tightly integrated storage engine in Cassandra, and this is another time we see the benefits of that approach. Starting in 1.2, index updates work as follows:
- Add an index entry for the new column value
- If the old column value was still in the memtable (common for updating a small set of rows repeatedly), remove the old column value
- Otherwise, let the old value get purged by compaction
- If a read sees a stale index entry before compaction purges it, the reader thread will invalidate it
Parallel leveled compaction
Leveled compaction is a big win for update-intensive workloads, but has had one big disadvantage vs the default size-tiered compaction: only one leveled compaction at a time could run at a time per table, no matter how many hard disks or SSDs you had your data spread across. SSD users in particular have been vocal in demanding this feature.
Cassandra 1.2 fixes this, allowing the LCS to run up to concurrent_compactors compactions across different sstable ranges (including multiple compactions within the same level).
Cassandra 1.2 ships with a new default partitioner, the Murmur3Partitioner based on the Murmur3 hash. Cassandra's use of consistent hashing does not require cryptographic hash properties (in particular, collisions are fine), so the older RandomPartitioner's use of MD5 was just a matter of using a convenient function with good distribution built into the JDK. Murmur3 is faster than MD5, but since hashing the partition key is only a small amount of the work Cassandra does to service requests the performance gains in real world workloads are negligible.
Murmur3Partitioner is NOT compatible with RandomPartitioner, so if you're upgrading and using the new cassandra.yaml file, be sure to change the partitioner back to RandomPartitioner. (If you don't, Cassandra will notice that you've picked an incompatible partitioner and refuse to start, so no permanent harm done.)
We've also switched bloom filters from Murmur2 to Murmur3.
Streaming is when one Cassandra node transfers an entire range of data to another, either for bootstrapping new nodes into the cluster or for repair.
When we added compression to Cassandra 1.0 we had to switch back temporarily to a manual data read-uncompress-stream process, which is much less efficient than letting the kernel handle the transfer.
1.2 adds that optimization back in as much as possible: we let the kernel do the transfer whenever we have entire compressed blocks to transfer, which is the common case.
Asynchronous hints delivery
Hinted handoff is where a request coordinator saves updates that it couldn't deliver to a replica, to retry later.
Cassandra 1.2 allows many hints to be delivered to the target replica concurrently, subject to hinted_handoff_throttle_in_kb. This allows recovering replicas to become consistent with the rest of the cluster much faster.
We've blogged previously about optimizing tombstone removal and making Cassandra start up faster.