Aleksey Yeschenko

<h2>What is Hinted Handoff?</h2>

<p>Hinted handoff&nbsp;is an important part of the Cassandra write path - it allows us to reduce the inconsistency window caused by temporary node unavailability periods. A comprehensive&nbsp;blog post&nbsp;by Jonathan Ellis explains why HH is useful and how modern HH works - in two years that passed since then the implementation at the high level has basically remained the same.</p>

<h2>Current Hinted Handoff Implementation</h2>

<p>In Cassandra 2.1 and below, hints are stored in a regular Cassandra table in the local&nbsp;<code>system</code>&nbsp;keyspace -&nbsp;<code>system.hints</code>. Here is its schema:</p>

<pre>
<code>CREATE TABLE system.hints (
    target_id uuid,
    hint_id timeuuid,
    message_version int,
    mutation blob,
    PRIMARY KEY ((target_id), hint_id, message_version)
) WITH COMPACT STORAGE;
</code></pre>

<p><code>target_id</code>&nbsp;- target node’s unique host id - is the partition key here;&nbsp;<code>hint_id</code>&nbsp;is a unique Class 1 UUID;&nbsp;<code>message_version</code>&nbsp;stores the Cassandra version used to serialise the mutation, and&nbsp;<code>mutation</code>&nbsp;is used to store the actual serialised Mutation that couldn’t be delivered to the node - the hint to replay. Partitioning by the host id means that all the hints for a particular node belong to one partition, internally; clustering by the time-based hint id means that new hints get appended to the end of that logical partition. To avoid resurrecting deleted rows during hints replay, all the entries have TTL set to the smallest&nbsp;<code>gc_grace_seconds</code>&nbsp;of all the tables in the hinted mutation.</p>

<p>In this implementation saving a hint for an unresponsive node is as simple as doing an&nbsp;<code>INSERT INTO system.hints ..</code>&nbsp;query, internally. And replay isn’t difficult, either. To deliver hints to a recovered node, Cassandra simply scans the partition with&nbsp;<code>target_id</code>&nbsp;= the node’s host id (in a paginated fashion), deserialises each mutation from the&nbsp;<code>mutation</code>&nbsp;blob, then sends the mutation to the node, and deletes the delivered hint from the&nbsp;<code>hints</code>&nbsp;table (by simply writing a tombstone). If all the hints have been successfully delivered, we do flush, and run a major compaction, to get rid of all the accumulated tombstones and leave no trace of the delivered hints *.</p>

<p>It’s a simple mechanism. And it allows us to reuse what we already have - our battle tested storage engine - for hints storage and delivery. It lets us reuse streaming to simply stream the node’s hints to a different one when we decommission it, too. But this reusability and simplicity comes with a price.</p>

<h2>Hinted Handoff as Queue Anti-Pattern</h2>

<p>If you look at the current storage/replay mechanism carefully, you’ll notice that it’s based on perhaps the worst Cassandra anti-pattern there is - the queue!</p>

<p>You can read my previous&nbsp;blog post&nbsp;for more details on how queues and Cassandra don’t get along well.</p>

<p>There are two ways that hints replay can hit a large number of tombstones and&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6117">blow up</a>:</p>

<ol>
	<li>During previous replay for the target node some of the mutations timed out, and delivery got aborted, so that post-delivery compaction wasn’t triggered.</li>
	<li>The target node has been down for a while, and it ended up accumulating a large number of expired hints (remember - all the hints are TTLd, to preserve correctness).</li>
</ol>

<p>You can read&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6666">CASSANDRA–6666</a>&nbsp;and&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6998">CASSANDRA–6998</a>&nbsp;tickets to gain deeper understanding of both scenarios.</p>

<h2>The 3.0 Way</h2>

<p>In order to fix that and reduce the overall overhead surrounding hints, we’ve decided to completely&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6230">rewrite the implementation</a>&nbsp;of hints in Cassandra 3.0.</p>

<p>Starting with 3.0, Cassandra will simply store hints in flat files, bypassing the storage engine altogether.</p>

<p>As&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5409">noticed previously</a>, Cassandra’s storage engine introduces a lot of overhead for something as trivial as storage of hints: they are immutable, write-once data, that we only read once and then discard after replay. We don’t care about the order we write them in, and we ultimately don’t care about partition-level isolation when writing to&nbsp;<code>system.hints</code>.</p>

<p>Storing hints in flat files - a-la per-node commit logs - allows us to avoid that overhead:</p>

<ul>
	<li>we no longer need to go through the memtable and the commit log on the&nbsp;<a href="https://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_write_path_c.html">write path</a></li>
	<li>we no longer need to perform IO - and - CPU - consuming compaction for hints</li>
	<li>we no longer suffer from contention on huge&nbsp;<code>system.hints</code>&nbsp;partitions when a node is down - which can be a&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-7545">very serious issue</a></li>
</ul>

<p>Saving a hint in Cassandra 3.0 is as trivial as appending the serialised mutation to a file.</p>

<h2>Hints Replay in 3.0</h2>

<p>Having hints in regular flat (segmented) files allows us to simplify optimise replay process as well.</p>

<p>Starting with 3.0, replay no longer operates on individual hints, competing for&nbsp;<code>MUTATION</code>&nbsp;stage with other writes. Instead, we will stream hints in bulk, segment by segment, and let the receiving node apply them locally. After streaming a segment to the target node, the replaying node can simply discard the replayed segment - by removing the file - with no tombstones involved.</p>

<ul>
	<li><a href="https://issues.apache.org/jira/browse/CASSANDRA-6998">CASSANDRA–6998</a>&nbsp;fix, included in Cassandra 2.0.11 and 2.1.1, drastically improves the situation around the tombstones during replay. Still, only a comprehensive rewrite can deal with the overhead of compaction and properly cure&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-7545">CASSANDRA–7545</a>.</li>
</ul>


What’s Coming to Cassandra in 3.0: Improved Hint Storage and Delivery

Aleksey Yeschenko

Share

Share

What is Hinted Handoff?

Current Hinted Handoff Implementation

Hinted Handoff as Queue Anti-Pattern

The 3.0 Way

Hints Replay in 3.0

More Company

DataStax Acquires Langflow to Accelerate Generative AI Development

The Top 5 DataStax Stories from 2023

2023 Recap: Data = AI

DataStax Astra DB Nabs Three Prestigious 2023 TrustRadius “Best of” Awards, Dominates the Vector Databases Category

One-stop Data API for Production GenAI