Running node repair.
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data. Run repair in these situations:
- During normal operation as part of regular, scheduled cluster maintenance unless Cassandra applications perform no deletes.
- During node recovery; for example, when bringing a node back into the cluster after a failure.
- On nodes containing data that is not read frequently.
- To update data on a node that has been down.
Guidelines for running routine node repair include:
- The hard requirement for routine repair frequency is the value of gc_grace_seconds. Run a repair operation at least once on each node within this time period. Following this important guideline ensures that deletes are properly handled in the cluster.
- Use caution when running routine node repair on more than one node at a time and schedule regular repair operations for low-usage hours.
- In systems that seldom delete or overwrite data, you can raise the value of gc_grace with minimal impact to disk space. This allows wider intervals for scheduling repair operations with the nodetool utility.
Repair requires intensive disk I/O. This occurs because of the validation compaction used for building the Merkle tree. To mitigate heavy disk usage:
- Use the nodetool compaction throttling options (setcompactionthroughput and setcompactionthreshold).
- Use nodetool snapshot and sequentially
repair from the snapshot. Recall that snapshots are just hardlinks to existing SSTables,
immutable, and require almost no disk space. This means that for any given replica set, only one
replica at a time performs the validation compaction. This allows the dynamic snitch to maintain
performance for your application via the other replicas.Note: Using the nodetool repair -pr (–partitioner-range) option repairs only the primary range for that node, the other replicas for that range still have to perform the Merkle tree calculation, causing a validation compaction. Because all the replicas are compacting at the same time, all the nodes may be slow to respond for that portion of the data.
Repair can result in overstreaming. Overstreaming occurs, for example, when there is a single damaged partition, but many more streams are sent.
This happens because the Merkle trees don’t have infinite resolution and Cassandra makes a tradeoff between the size and space. Currently, Cassandra uses a fixed depth of 15 for the tree (32K leaf nodes). For a node containing a million partitions with one damagaed partition, about 30 partitions are streamed, which is the number that fall into each of the leaves of the tree. Of course, the problem gets worse when more partitions exist per node, and results in a lot of disk space usage and needless compaction.
To mitigate overstreaming, you can use subrange repair (available starting in Cassandra 1.1.11). Subrange repair allows for repairing only a portion of the data belonging to the node. Because the Merkle tree precision is fixed, this effectively increases the overall precision.
To use subrange repair:
- Use the Java describe_splits call to ask for a split containing 32K partitions.
- Iterate throughout the entire range incrementally or in parallel. This completely eliminates the overstreaming behavior and wasted disk usage overhead.
- Pass the tokens you received for the split to the nodetool repair -st (–start-token) and -et (–end-token) options.
- Pass the -local (–in-local-dc) option to nodetool to repair only within the local datacenter. This reduces the cross data-center transfer load.