Jonathan Ellis

<p>The first two minor releases after&nbsp;Cassandra 2.0.0&nbsp;contained many bug fixes, but also some new features and enhancements. For the benefit of those who don't read the&nbsp;<a href="https://github.com/apache/cassandra/blob/cassandra-2.0/CHANGES.txt">CHANGES</a>&nbsp;religiously, let's take a look at some of the highlights.</p>

<h3>Rapid read protection is enabled by default</h3>

<p>I wrote a separate article about how rapid read protection in 2.0.2&nbsp;improves availability and latency. I'll just reproduce one graph to pique your curiosity:</p>
<img alt="Node Death" data-align="center" data-entity-type="file" data-entity-uuid="c744b207-1d00-4d5d-9756-9522e12ea114" src="https://www.datastax.com/sites/default/files/inline-images/5932-node-death-250x268.png" />
<h3>Improved blob support</h3>

<p>Cassandra technically allows column values to be up to 2GB, but it's tuned to deal with much smaller columns by default.&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5982">CASSANDRA-5982</a>&nbsp;adds a number of improvements to 2.0.1 to deal with larger columns. Most are improvements that require no operator intervention; the exception is the&nbsp;<tt>commitlog_periodic_queue_size</tt>&nbsp;setting, which should be reduced for a blob-heavy workload. (Our tests show that 16*cpucores is a reasonable setting for 1MB blobs, for instance.)</p>

<p>If you need to store blobs larger than 10MB, I recommend splitting them across multiple rows or using a library like Astyanax that&nbsp;<a href="https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store">supports this transparently</a>.</p>

<h3>Limited support for DISTINCT</h3>

<p>Cassandra still does not support&nbsp;<tt>SELECT DISTINCT</tt>&nbsp;in general, but as a special case we now&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-4536">allow it on partition keys</a>, since that can be done with minimal work by the storage engine. E.g., for the&nbsp;playlist example, I could write&nbsp;<tt>SELECT DISTINCT id FROM playlists</tt>.</p>

<h3>Cleanup performance</h3>

<p>Recall that&nbsp;cleanup&nbsp;refers to purging data that no longer belongs locally after adding new nodes to the cluster. Until now, this has been a fairly slow operation that does a simple sequential scan over all the local data to locate partitions that have been evicted.</p>

<p>Starting in 2.0.1, we leverage the metadata we have for each data file (including first and last partition) to&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5722">optimize away checking files</a>&nbsp;that only contain data that is still local. Files that only contain non-local data can be dropped without inspection as well, which only leaves files containing a mix. For those we can&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-2524">use the partition index</a>&nbsp;to only scan the data range that is still local (and rewrite it to a new file).</p>

<p>Now that cleanup is relatively lightweight, In 2.1 we will go further and&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5051">cleanup automatically</a>&nbsp;after bootstrapping new nodes into the cluster.</p>

<h3>Repair instrumentation</h3>

<p>Repair&nbsp;builds its&nbsp;<a href="http://en.wikipedia.org/wiki/Merkle_tree">merkle tree</a>&nbsp;with a fixed size of 32,768 leaf nodes (thus, 65535 total nodes). If the repair covers more partitions than that, multiple partitions will be hashed into a single leaf node, and the repair will be correspondingly less precise: if we repair 1,000,000 partitions, the smallest unit of repair will be about 30 partitions.</p>

<p>Subrange repair&nbsp;(repairing less than an entire vnode's worth of data at once) allows you to restrict the repair range to retain a desired level of precision; the problem is that Cassandra didn't log enough information about repair operations to tell how much precision you were losing and hence whether subrange repair is worth the trouble.</p>

<p><a href="https://issues.apache.org/jira/browse/CASSANDRA-2698">Starting in 2.0.1</a>, Cassandra logs statistics like this:</p>

<pre>
<tt>
Validated 331 partitions for 64106960-4362-11e3-976a-71a93e2d33ad.  
Partitions per leaf are:

[0..0]: 32437
[1..1]: 331
</tt></pre>

<p>This says there are 32437 leaf nodes with no rows hashed, and 331 with one row each. Our precision (for this very small repair) is fine.</p>

<p>Note that these are logged at DEBUG; enable debug logging on org.apache.cassandra.repair.Validator to see them.</p>

<h3>ConsistencyLevel.LOCAL_ONE</h3>

<p>Similar to LOCAL_QUORUM, LOCAL_ONE is a&nbsp;consistency level&nbsp;that restricts an operation to the datacenter of the coordinator handling the request.</p>

<p><a href="https://issues.apache.org/jira/browse/CASSANDRA-6202">New in 2.0.2</a>, LOCAL_ONE is now the default for Hadoop reads and is otherwise useful when failing a request is better than going over the WAN to fulfil it.</p>

<h3>Compaction history</h3>

<p>Compaction logs information about merged data files like this:</p>

<p>Compacted 6 sstables to [<br />
/var/lib/cassandra/data/kstest/demo1/kstest-demo1-jb-1854,<br />
...<br />
/var/lib/cassandra/data/kstest/demo1/kstest-demo1-jb-1867,<br />
].<br />
986,518,120 bytes to 986,518,120 (~100% of original) in 142,917ms = 6.582961MB/s.<br />
3,523,279 total partitions merged to 3,523,279. Partition merge counts were {1:3523279, }</p>

<p>Here, I have a table configured to use&nbsp;<a href="https://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra">leveled compaction</a>, but every partition was unique -- no merging took place. This suggests that&nbsp;<a href="https://www.datastax.com/dev/blog/when-to-use-leveled-compaction">size-tiered compaction is a better fit</a>&nbsp;for this table.</p>

<p>What's new in 2.0.2 is not this logging but that we&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5078">save the results</a>&nbsp;to&nbsp;<tt>system.compaction_history</tt>:</p>

<pre>
<tt>
CREATE TABLE compaction_history (
    id uuid PRIMARY KEY,
    keyspace_name text,
    columnfamily_name text,
    compacted_at timestamp,
    bytes_in bigint,
    bytes_out bigint,
    rows_merged map,
)
</tt></pre>

<p>This information is available to tools like&nbsp;OpsCenter&nbsp;to help tune compaction.</p>

<h3>Improved memory use defaults</h3>

<p>2.0.2&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6059">changes</a>&nbsp;the default&nbsp;<tt>memtable_total_space_in_mb</tt>&nbsp;to 1/4 of the heap (from 1/3) and&nbsp;<tt>write_request_timeout_in_ms</tt>&nbsp;to 2 seconds (from 10).</p>

<p>The primary motivation behind these changes (especially the second) is to make it more difficult to OOM Cassandra with a sudden spike in write activity. Cassandra's&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-685">load shedding</a>&nbsp;discards requests older than the timeout, but ten seconds worth of writes on a large cluster with many coordinators feeding requests can easily consume many GB of memory. Cutting that back by a factor of five will help a lot while still allowing most requests delayed by network hiccups or GC pauses to complete. (The old 10s value was grandfathered in from before we had separate timeouts for reads, writes, and internal management.)</p>

<p>If you want to be even more aggressive, you could cut the write timeout to 500ms and enable {{cross_node_timeout}}, which starts load shedding based on when the coordinator starts the request rather than when the replica receives it.</p>

<h3>Workload-aware compaction</h3>

<p>We&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5515">laid the groundwork for this in 2.0.2</a>&nbsp;but the&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6109">payoff</a>&nbsp;isn't until 2.0.3 (coming later in November): Cassandra will track which data files are most frequently read and prioritize compacting those. Optionally, it can omit compacting "cold" files entirely, which dramatically improves performance for workloads with billions of archived or seldom-requested rows.</p>

<h3>CQL-aware SSTableWriter</h3>

<p>SSTableWriter is the API to&nbsp;<a href="https://www.datastax.com/dev/blog/bulk-loading">create raw Cassandra data files</a>&nbsp;locally for bulk load into your cluster. For 2.0.3, we've&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5894">added the CQLSSTableWriter implementation</a>&nbsp;that allows inserting rows without needing to understand the details of&nbsp;<a href="https://www.datastax.com/dev/blog/thrift-to-cql3">how those map to the underlying storage engine</a>. Usage looks like this:</p>

<pre>
<tt>
    String schema = "CREATE TABLE myKs.myTable ("
                  + "  k int PRIMARY KEY,"
                  + "  v1 text,"
                  + "  v2 int"
                  + ")";
    String insert = "INSERT INTO myKs.myTable (k, v1, v2) VALUES (?, ?, ?)";

    CQLSSTableWriter writer = CQLSSTableWriter.builder()
                                              .inDirectory("path/to/directory")
                                              .forTable(schema)
                                              .using(insert).build();
 
    writer.addRow(0, "test1", 24);
    writer.addRow(1, "test2", null);
    writer.addRow(2, "test3", 42);
 
    writer.close();
</tt></pre>

<h3>Special thanks</h3>

<p>Among the many community members who contributed to these releases, we'd like to give a special shout out to Oleg Anastasyev, Chris Burroughs, Kyle Kingsbury, Sankalp Kohli, and Mikhail Stepura. Thanks for the help!</p>


Cassandra 2.0.1, 2.0.2, and a quick peek at 2.0.3

Jonathan EllisTechnology

Discover more

Share

Share

Rapid read protection is enabled by default

Improved blob support

Limited support for DISTINCT

Cleanup performance

Repair instrumentation

ConsistencyLevel.LOCAL_ONE

Compaction history

Improved memory use defaults

Workload-aware compaction

CQL-aware SSTableWriter

Special thanks

More Company

DataStax Acquires Langflow to Accelerate Generative AI Development

The Top 5 DataStax Stories from 2023

2023 Recap: Data = AI

DataStax Astra DB Nabs Three Prestigious 2023 TrustRadius “Best of” Awards, Dominates the Vector Databases Category

One-stop Data API for Production GenAI