Repair Service in OpsCenter 4.0
Repairing a Cassandra cluster resolves inconsistencies amongst replicas for a given range of data. Periodically running
nodetool repair is highly recommended, but generating Merkle trees can get expensive in terms of disk I/O and CPU. This is further complicated by the need to schedule a repair to run at least once within
gc_grace_seconds on each node but not on more than one node at a time.
Figuring out how often to repair each node, how to configure a repair, or how to schedule repairs properly is a major pain point of Cassandra maintenance.
As part of the new Datastax Management Services, the Repair Service in OpsCenter 4.0 is a service that will manage automated and continuous repairs across your cluster with minimal impact.
The Repair Service utilizes subrange repair to continuously repair small chunks of data that belong to a particular node. The service runs continuously, and takes into account the amount of time an average repair is taking and the actual throughput achieved while repairing. The ranges that are repaired are chosen to have minimal impact on your cluster, and repairs on those ranges are performed in such a manner that consecutive repairs on the same node are avoided.
In the diagram, each dash represents a subrange. Subrange 1, which is owned by Node A, will get repaired, then Subrange 2, which is located on the other side of the ring and owned by Node C, will be repaired next. Repairs will continue on different sub ranges owned by different nodes until the entire cluster is repaired and the repair cycle begins over again. More information about how the Repair Service works behind the scenes can be found in this blog post about advanced repair techniques.
Repair Service Configuration
In the OpsCenter user interface, there is a new Services section to manage DataStax Enterprise Management Services. The Repair Service, which is off by default, can be configured and started here. There is only one configuration option: the time to complete a repair of all data in the cluster. OpsCenter provides an estimate for this time to completion by checking
gc_grace_seconds across all column families and calculating 90% of the lowest value. A lower time to complete a repair cycle means that all data will be repaired more quickly, but there may be more of an impact on your cluster.
Since OpsCenter estimates an initial throughput, the value of
compaction_throughput_mb_per_sec also has an effect on the initial cycle of repairs in your cluster. OpsCenter then uses the throughput of actual repairs to determine how many repairs to run in parallel in order to finish in the given time.
More information on using the Repair Service can be found in the documentation.
The Repair Service runs continuously on your cluster and thus generates a lot of log messages that may not be useful to you if you are watching the OpsCenter logs for another reason. The Repair Service logs to a different, configurable log location. See the Repair Service documentation on configuration for more information.
The Repair Service is part of the DataStax Enterprise Management Services available in OpsCenter 4.0, which you can download here.