Brandon Williams

Anti-entropy repair in Cassandra can sometimes be a pain point for those doing deletes in their cluster, since it must be run before gc_grace expires to ensure deleted data is not resurrected.&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-2034">Reliable hints</a>&nbsp;can go a long way to alleviating this, but if you lose a node at any point, you'll still need to repair (though it's worth mentioning that if you&nbsp;only&nbsp;delete via TTL, and only inserted with a TTL to begin with, you can&nbsp;skip repair if your cluster has synchronized time, which it should for a variety&nbsp;of reasons.)

Repair can be a sore point for a couple of reasons that I'll outline, and then&nbsp;show you how to avoid them. First, let's recall how repair works. There are&nbsp;two phases to repair, the first of which is building a&nbsp;<a href="http://en.wikipedia.org/wiki/Merkle_tree">Merkle tree</a>&nbsp;of the data.&nbsp;The second is having the replicas actually compare the differences between their&nbsp;trees and then streaming them to each other as needed.

This first phase can be intensive on disk io, however. You can mitigate this&nbsp;to some degree with compaction throttling (since this phase is what we call a&nbsp;validation compaction.) Sometimes that isn't enough though, and some people&nbsp;try to mitigate this further by using the -pr (--partitioner-range) option to&nbsp;nodetool repair, which repairs only the primary range for that node.&nbsp;Unfortunately, the other replicas for that range will still have to perform the&nbsp;Merkle tree calculation, causing a validation compaction. This can be a&nbsp;problem, since all the replicas will be doing it at the same time, possibly&nbsp;making them all slow to respond for that portion of your data. Fortunately,&nbsp;there is way around this by using the -snapshot option. What this will do is&nbsp;take a snapshot of your data (and recall that snapshots are just hardlinks to&nbsp;existing sstables, exploiting the fact that sstables are immutable, thus making&nbsp;snapshots extremely cheap) and&nbsp;sequentially&nbsp;repair from the snapshot. This&nbsp;means that for any given replica set, only one replica at a time will be&nbsp;performing the validation compaction, allowing the&nbsp;<a href="https://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future">dynamic snitch</a>&nbsp;to maintain&nbsp;performance for your application via the other replicas.

There's a possible catch in the second phase of repair, too: overstreaming.&nbsp;This is when you maybe have only one damaged partition, but many more end up&nbsp;being sent. This happens because the Merkle trees Cassandra builds don't have&nbsp;infinite resolution, and enabling a high enough resolution for all scenarios&nbsp;would end up being prohibitive in terms of heap &nbsp;usage, since the tree is held&nbsp;in memory. So Cassandra makes a tradeoff between the size and space, and&nbsp;currently uses a fixed depth of 32K for the tree. What this means is that if&nbsp;your node contains a million partitions and one of them is damaged, about 30&nbsp;partitions are going to be streamed, since that is how many fall into each of&nbsp;the 'leaves' of the tree. Of course if you have many more partitions per node,&nbsp;the problem gets worse, and can end up using a lot of disk space that will&nbsp;eventually have to be compacted away, needlessly.

There is a solution for this&nbsp;problem too, beginning with Cassandra 1.1.11, called&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5280">subrange repair</a>. As the name suggests, this allows repairing only a portion of the data that&nbsp;belongs to a node. Since the tree precision is fixed, this effectively&nbsp;increases the precision overall. Using the describe_splits call, you can ask&nbsp;for a split containing 32K partitions (and if you're running DSE it makes this&nbsp;even easier,) and then repair it with 100% precision&nbsp;if you so choose, iterating throughout the entire range incrementally (or&nbsp;even in parallel.) This would completely eliminate the overstreaming behavior&nbsp;and have no wasted overhead in terms of disk usage. To do this, you pass the&nbsp;tokens you received for the split to the -st (--start-token) and -et&nbsp;(--end-token) options to nodetool repair, respectively. Finally, you can pass&nbsp;the -local (--in-local-dc) option to nodetool to only repair within the local&nbsp;datacenter, reducing cross-datacenter transfer.

&nbsp;

<center><img alt="Illusation of subrange vs full repair" data-entity-type="file" data-entity-uuid="d727834f-0dde-41e8-996b-ccff79198c5c" src="https://www.datastax.com/sites/default/files/inline-images/subrange.png" /></center>

&nbsp;

Above is a diagram illustrating the difference between a full and subrange repair. Node0 shows a full repair of the data in its range, while Node2 shows repairing only a subset of its data. When all the subsets have been repaired it will be equivalent to the full repair.

I hope this article has both increased your knowledge about repair and its&nbsp;available options. In future repair operations during management of your Cassandra&nbsp;cluster, it can hopefully improve your experience and solve any issues you may&nbsp;encounter. We hear quite frequently from operations folks and admins that they struggle with knowing exactly when and how often to run repair on their clusters. This being the case, at DataStax, we're looking into how to make repair a transparent operation that runs automatically when needed via OpsCenter with minimal impact on performance.

Advanced repair techniques

Brandon Williams

Discover more

Share

Share

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI