Running node repair.
To understand how repair works and the information described in this topic, please read the blog article Advanced repair techniques.
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data. Run repair in these situations:
- During normal operation as part of regular, scheduled cluster maintenance unless Cassandra applications perform no deletes.
- During node recovery; for example, when bringing a node back into the cluster after a failure.
- On nodes containing data that is not read frequently.
- To update data on a node that has been down.
Guidelines for running routine node repair include:
- The hard requirement for routine repair frequency is the value of gc_grace_seconds. Run a repair operation at least once on each node within this time period. Following this important guideline ensures that deletes are properly handled in the cluster.
- Use caution when running routine node repair on more than one node at a time and schedule regular repair operations for low-usage hours.
- In systems that seldom delete or overwrite data, you can raise the value of gc_grace with minimal impact to disk space. This allows wider intervals for scheduling repair operations with the nodetool utility.
Repair requires intensive disk I/O. This occurs because of the validation compaction used for building the Merkle tree. To mitigate heavy disk usage:
- Use the nodetool compaction throttling options (setcompactionthroughput and setcompactionthreshold).
- Use nodetool snapshot and sequentially repair from
the snapshot. Recall that snapshots are just hardlinks to existing SSTables, immutable, and
require almost no disk space. This means that for any given replica set, only one replica at a
time performs the validation compaction. This allows the dynamic snitch to maintain performance
for your application via the other replicas.Note: Using the nodetool repair -pr (–partitioner-range) option repairs a specified range to ensure that the node is repaired only in its primary range and not the replicated ranges. The primary range is replicated to other nodes and those nodes perform repair work to ensure that the designated range is fully repaired. Because other replicas for that range have to perform the Merkle tree calculation, there may be a validation compaction. This happens because all the replicas are compacting at the same time and any nodes containing the repaired range (due to topology and the replication factor) may be slow to respond for that portion of the data.
This happens because the Merkle trees don’t have infinite resolution and Cassandra makes a tradeoff between the size and space. Currently, Cassandra uses a fixed depth of 15 for the tree (32K leaf nodes). For a node containing a million partitions with one damaged partition, about 30 partitions are streamed, which is the number that fall into each of the leaves of the tree. Of course, the problem gets worse when more partitions exist per node, and results in a lot of disk space usage and needless compaction.
To mitigate overstreaming, you can use subrange repair (available starting in Cassandra 1.1.11). Subrange repair allows for repairing only a portion of the data belonging to the node. Because the Merkle tree precision is fixed, this effectively increases the overall precision.
To use subrange repair:
- Use the Java describe_splits call to ask for a split containing 32K partitions.
- Iterate throughout the entire range incrementally or in parallel. This completely eliminates the overstreaming behavior and wasted disk usage overhead.
- Pass the tokens you received for the split to the nodetool repair -st (–start-token) and -et (–end-token) options.
- Pass the -local (–in-local-dc) option to nodetool to repair only within the local data center. This reduces the cross data-center transfer load.