Jonathan Ellis

<p>Cassandra 1.2 adds a number of performance optimizations, particularly for clusters with a large amount of data per node.</p>

<h3>Moving internals off-heap</h3>

<p>Disk capacities have been increasing. RAM capacities have been increasingly roughly in step. But the JVM's ability to manage a large heap has not kept pace. So as Cassandra clusters deploy more and more data per node, we've been moving&nbsp;<a href="http://2012.nosql-matters.org/cgn/wp-content/uploads/2012/06/Sylvain_Lebresne-Cassandra_Storage_Engine.pdf">storage engine internal structures</a>&nbsp;off-heap,&nbsp;<a href="http://www.slideshare.net/jbellis/dealing-with-jvm-limitations-in-apache-cassandra-fosdem-2012">managing them manually in native memory</a>&nbsp;instead.</p>

<p>1.2 moves the two biggest remaining culprits off-heap: compression metadata and per-row bloom filters.</p>

<p><compression 1-3gb="" about="" compressed="" data="" memory="" metadata="" of="" per="" takes="" tb=""> </compression></p>

<p><a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</a>&nbsp;help Cassandra avoid scanning data files that can’t possibly include the rows being queried. They weigh in at 1-2GB per billion rows, depending on how aggressively they are tuned.</p>

<p>Both of these use the existing sstable reference counting with&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-4865">minor tweaking</a>&nbsp;to free native resources when the sstable they are associated with is compacted away.</p>

<h3>Column index performance</h3>

<p>Cassandra has supported indexes on columns for over two years, but our implementation has been simplistic: when an indexed column was updated, we'd read the old version of that column, mark the old index entry invalid, and add a new index entry.</p>

<p>There are two problems with this approach:</p>

<ol>
	<li>This needed to be done with a (sharded) row lock, so for heavy insert loads lock contention could be a problem.</li>
	<li>If your rows being updated aren't cached in memory, doing an update will cause a disk seek (to read the old value). This violates our design principle of avoiding random i/o on writes.</li>
</ol>

<p>I've long been a proponent of having a tightly integrated storage engine in Cassandra, and this is another time we see the benefits of that approach. Starting in 1.2, index updates&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-2897">work as follows</a>:</p>

<ol>
	<li>Add an index entry for the new column value</li>
	<li>If the old column value was still in the memtable (common for updating a small set of rows repeatedly), remove the old column value</li>
	<li>Otherwise, let the old value get purged by compaction</li>
	<li>If a read sees a stale index entry before compaction purges it, the reader thread will invalidate it</li>
</ol>

<h3>Parallel leveled compaction</h3>

<p><a href="https://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra">Leveled compaction</a>&nbsp;is a big win for&nbsp;<a href="https://www.datastax.com/dev/blog/when-to-use-leveled-compaction">update-intensive workloads</a>, but has had one big disadvantage vs the default size-tiered compaction: only one leveled compaction at a time could run at a time per table, no matter how many hard disks or SSDs you had your data spread across.&nbsp;<a href="http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives">SSD users</a>&nbsp;in particular have been vocal in demanding this feature.</p>

<p>Cassandra 1.2&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-4310">fixes this</a>, allowing the LCS to run up to <tt>concurrent_compactors</tt>&nbsp;compactions across different sstable ranges (including multiple compactions within the same level).</p>

<h3>Murmur3Partitioner</h3>

<p>Cassandra 1.2 ships with a new default partitioner, the&nbsp;<tt>Murmur3Partitioner</tt>&nbsp;based on the Murmur3 hash. Cassandra's use of consistent hashing does not require cryptographic hash properties (in particular, collisions are fine), so the older&nbsp;<tt>RandomPartitioner</tt>'s use of MD5 was just a matter of using a convenient function with good distribution built into the JDK.&nbsp;<a href="http://code.google.com/p/smhasher/">Murmur3</a>&nbsp;is faster than MD5, but since hashing the partition key is only a small amount of the work Cassandra does to service requests the performance gains in real world workloads are negligible.</p>

<p>Murmur3Partitioner is NOT compatible with RandomPartitioner, so if you're upgrading and using the new&nbsp;<tt>cassandra.yaml</tt>&nbsp;file, be sure to change the partitioner back to RandomPartitioner. (If you don't, Cassandra will notice that you've picked an incompatible partitioner and refuse to start, so no permanent harm done.)</p>

<p>We've also switched bloom filters&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-2975">from Murmur2 to Murmur3</a>.</p>

<h3>NIO Streaming</h3>

<p>Streaming is when one Cassandra node transfers an entire range of data to another, either for bootstrapping new nodes into the cluster or for repair.</p>

<p>When we added compression to Cassandra 1.0 we had to switch back temporarily to a manual data read-uncompress-stream process, which is much less efficient than letting the&nbsp;<a href="http://www.kernel.org/doc/man-pages/online/pages/man2/sendfile.2.html">kernel handle the transfer</a>.</p>

<p>1.2 adds that optimization back in as much as possible: we let the kernel do the transfer&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-4297">whenever we have entire compressed blocks to transfer</a>, which is the common case.</p>

<h3>Asynchronous hints delivery</h3>

<p>Hinted handoff is where a&nbsp;<a href="https://www.datastax.com/docs/1.1/cluster_architecture/about_client_requests">request coordinator</a>&nbsp;saves updates that it couldn't deliver to a replica, to retry later.</p>

<p>Cassandra 1.2 allows many hints to be delivered to the target replica concurrently, subject to&nbsp;<tt>hinted_handoff_throttle_in_kb</tt>. This allows recovering replicas to become consistent with the rest of the cluster much faster.</p>

<h3>Others</h3>

<p>We've blogged previously about&nbsp;optimizing tombstone removal&nbsp;and&nbsp;making Cassandra start up faster.</p>


Performance improvements in Cassandra 1.2

Jonathan EllisTechnology

Share

Share

Moving internals off-heap

Column index performance

Parallel leveled compaction

Murmur3Partitioner

NIO Streaming

Asynchronous hints delivery

Others

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI