How Repair Service works
The Repair Service works by repairing small chunks of your cluster in the background. The service takes a single parameter, time_to_completion, which is the maximum amount of time it takes to repair the entire cluster once. Typically, you set this to a value lower than your lowest gc_grace_seconds setting (the default for gc_grace_seconds is 10 days). The service may run multiple repairs in parallel, but will run as few as needed in order to complete within the amount of time specified, and will always avoid running more than one repair in a single replica set.
The current state of the Repair Service is persisted locally on the opscenterd server every five minutes by default. If opscenterd is restarted, the Repair Service resumes where it left off.
If a cluster is data center aware and has keyspaces using SimpleStrategy, the repair service will fail to start
Changes in cluster topology
If a change in cluster topology occurs, the Repair Service stops its current cycle and waits for the ring to stabilize before starting a new cycle. This check occurs every five minutes.
- Nodes moving
- Nodes joining
- Nodes leaving
- Nodes going up/down
Changes in schemas
- Keyspaces added while the repair service is running are not repaired until the current cycle is completed, and a new cycle starts.
- Column families added to existing keyspaces are repaired immediately during the current cycle of the Repair Service.
- Keyspace or column family can be removed while the Repair Service is running without causing any issues.