Performance improvements in Cassandra 1.2

By Jonathan Ellis -  December 6, 2012 | 4 Comments

Cassandra 1.2 adds a number of performance optimizations, particularly for clusters with a large amount of data per node.

Moving internals off-heap

Disk capacities have been increasing. RAM capacities have been increasingly roughly in step. But the JVM's ability to manage a large heap has not kept pace. So as Cassandra clusters deploy more and more data per node, we've been moving storage engine internal structures off-heap, managing them manually in native memory instead.

1.2 moves the two biggest remaining culprits off-heap: compression metadata and per-row bloom filters.

Compression metadata takes about 1-3GB of memory per TB of compressed data, depending on compression block size and compression ratio. Moving this into native memory is especially important now that compression is enabled by default.

Bloom filters help Cassandra avoid scanning data files that can’t possibly include the rows being queried. They weigh in at 1-2GB per billion rows, depending on how aggressively they are tuned.

Both of these use the existing sstable reference counting with minor tweaking to free native resources when the sstable they are associated with is compacted away.

Column index performance

Cassandra has supported indexes on columns for over two years, but our implementation has been simplistic: when an indexed column was updated, we'd read the old version of that column, mark the old index entry invalid, and add a new index entry.

There are two problems with this approach:

  1. This needed to be done with a (sharded) row lock, so for heavy insert loads lock contention could be a problem.
  2. If your rows being updated aren't cached in memory, doing an update will cause a disk seek (to read the old value). This violates our design principle of avoiding random i/o on writes.

I've long been a proponent of having a tightly integrated storage engine in Cassandra, and this is another time we see the benefits of that approach. Starting in 1.2, index updates work as follows:

  1. Add an index entry for the new column value
  2. If the old column value was still in the memtable (common for updating a small set of rows repeatedly), remove the old column value
  3. Otherwise, let the old value get purged by compaction
  4. If a read sees a stale index entry before compaction purges it, the reader thread will invalidate it

Parallel leveled compaction

Leveled compaction is a big win for update-intensive workloads, but has had one big disadvantage vs the default size-tiered compaction: only one leveled compaction at a time could run at a time per table, no matter how many hard disks or SSDs you had your data spread across. SSD users in particular have been vocal in demanding this feature.

Cassandra 1.2 fixes this, allowing the LCS to run up to concurrent_compactors compactions across different sstable ranges (including multiple compactions within the same level).

Murmur3Partitioner

Cassandra 1.2 ships with a new default partitioner, the Murmur3Partitioner based on the Murmur3 hash. Cassandra's use of consistent hashing does not require cryptographic hash properties (in particular, collisions are fine), so the older RandomPartitioner's use of MD5 was just a matter of using a convenient function with good distribution built into the JDK. Murmur3 is faster than MD5, but since hashing the partition key is only a small amount of the work Cassandra does to service requests the performance gains in real world workloads are negligible.

Murmur3Partitioner is NOT compatible with RandomPartitioner, so if you're upgrading and using the new cassandra.yaml file, be sure to change the partitioner back to RandomPartitioner. (If you don't, Cassandra will notice that you've picked an incompatible partitioner and refuse to start, so no permanent harm done.)

We've also switched bloom filters from Murmur2 to Murmur3.

NIO Streaming

Streaming is when one Cassandra node transfers an entire range of data to another, either for bootstrapping new nodes into the cluster or for repair.

When we added compression to Cassandra 1.0 we had to switch back temporarily to a manual data read-uncompress-stream process, which is much less efficient than letting the kernel handle the transfer.

1.2 adds that optimization back in as much as possible: we let the kernel do the transfer whenever we have entire compressed blocks to transfer, which is the common case.

Asynchronous hints delivery

Hinted handoff is where a request coordinator saves updates that it couldn't deliver to a replica, to retry later.

Cassandra 1.2 allows many hints to be delivered to the target replica concurrently, subject to hinted_handoff_throttle_in_kb. This allows recovering replicas to become consistent with the rest of the cluster much faster.

Others

We've blogged previously about optimizing tombstone removal and making Cassandra start up faster.









DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.



register for classes

get certified

DBA's Guide to NoSQL







Comments

  1. David Vanderfeesten says:

    “Compression metadata takes about 20GB of memory per TB of compressed data in previous releases.”
    What is the advised max. datavolume per node in 1.0 (compression enabled/disabled). And how does this compare to the new 1.2?

  2. Jonathan Ellis says:

    For 1.0 we recommend 300-500GB. For 1.2 we are looking to be able to handle 10x (3-5TB).

  3. Sri says:

    Jonathan, 3 to 5 TB per node. Is the data volume per node without replication data volume?

  4. Jonathan Ellis says:

    It’s replication-agnostic. 3 raw TB with one replica, or 1 raw TB x 3 replicas = 3 TB per machine will have identical memory and disk footprints.

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

Tel. +1 (650) 389-6000 sales@datastax.com Offices France GermanyJapan

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.