DataStax Developer Blog

Performance improvements in Cassandra 1.2

By Jonathan Ellis -  December 6, 2012 | 4 Comments

Cassandra 1.2 adds a number of performance optimizations, particularly for clusters with a large amount of data per node.

Moving internals off-heap

Disk capacities have been increasing. RAM capacities have been increasingly roughly in step. But the JVM’s ability to manage a large heap has not kept pace. So as Cassandra clusters deploy more and more data per node, we’ve been moving storage engine internal structures off-heap, managing them manually in native memory instead.

1.2 moves the two biggest remaining culprits off-heap: compression metadata and per-row bloom filters.

Compression metadata takes about 1-3GB of memory per TB of compressed data, depending on compression block size and compression ratio. Moving this into native memory is especially important now that compression is enabled by default.

Bloom filters help Cassandra avoid scanning data files that can’t possibly include the rows being queried. They weigh in at 1-2GB per billion rows, depending on how aggressively they are tuned.

Both of these use the existing sstable reference counting with minor tweaking to free native resources when the sstable they are associated with is compacted away.

Column index performance

Cassandra has supported indexes on columns for over two years, but our implementation has been simplistic: when an indexed column was updated, we’d read the old version of that column, mark the old index entry invalid, and add a new index entry.

There are two problems with this approach:

  1. This needed to be done with a (sharded) row lock, so for heavy insert loads lock contention could be a problem.
  2. If your rows being updated aren’t cached in memory, doing an update will cause a disk seek (to read the old value). This violates our design principle of avoiding random i/o on writes.

I’ve long been a proponent of having a tightly integrated storage engine in Cassandra, and this is another time we see the benefits of that approach. Starting in 1.2, index updates work as follows:

  1. Add an index entry for the new column value
  2. If the old column value was still in the memtable (common for updating a small set of rows repeatedly), remove the old column value
  3. Otherwise, let the old value get purged by compaction
  4. If a read sees a stale index entry before compaction purges it, the reader thread will invalidate it

Parallel leveled compaction

Leveled compaction is a big win for update-intensive workloads, but has had one big disadvantage vs the default size-tiered compaction: only one leveled compaction at a time could run at a time per table, no matter how many hard disks or SSDs you had your data spread across. SSD users in particular have been vocal in demanding this feature.

Cassandra 1.2 fixes this, allowing the LCS to run up to concurrent_compactors compactions across different sstable ranges (including multiple compactions within the same level).

Murmur3Partitioner

Cassandra 1.2 ships with a new default partitioner, the Murmur3Partitioner based on the Murmur3 hash. Cassandra’s use of consistent hashing does not require cryptographic hash properties, so RandomPartitioner‘s use of MD5 was just a matter of having a convenient function with good distribution built into the JDK. Murmur3 is 3x-5x faster, which translates into overall performance gains of over 10% for index-heavy workloads.

Murmur3Partitioner is NOT compatible with RandomPartitioner, so if you’re upgrading and using the new cassandra.yaml file, be sure to change the partitioner back to RandomPartitioner. (If you don’t, Cassandra will notice that you’ve picked an incompatible partitioner and refuse to start, so no permanent harm done.)

We’ve also switched bloom filters from Murmur2 to Murmur3.

NIO Streaming

Streaming is when one Cassandra node transfers an entire range of data to another, either for bootstrapping new nodes into the cluster or for repair.

When we added compression to Cassandra 1.0 we had to switch back temporarily to a manual data read-uncompress-stream process, which is much less efficient than letting the kernel handle the transfer.

1.2 adds that optimization back in as much as possible: we let the kernel do the transfer whenever we have entire compressed blocks to transfer, which is the common case.

Asynchronous hints delivery

Hinted handoff is where a request coordinator saves updates that it couldn’t deliver to a replica, to retry later.

Cassandra 1.2 allows many hints to be delivered to the target replica concurrently, subject to hinted_handoff_throttle_in_kb. This allows recovering replicas to become consistent with the rest of the cluster much faster.

Others

We’ve blogged previously about optimizing tombstone removal and making Cassandra start up faster.



Comments

  1. David Vanderfeesten says:

    “Compression metadata takes about 20GB of memory per TB of compressed data in previous releases.”
    What is the advised max. datavolume per node in 1.0 (compression enabled/disabled). And how does this compare to the new 1.2?

  2. Jonathan Ellis says:

    For 1.0 we recommend 300-500GB. For 1.2 we are looking to be able to handle 10x (3-5TB).

  3. Sri says:

    Jonathan, 3 to 5 TB per node. Is the data volume per node without replication data volume?

  4. Jonathan Ellis says:

    It’s replication-agnostic. 3 raw TB with one replica, or 1 raw TB x 3 replicas = 3 TB per machine will have identical memory and disk footprints.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>