OpsCenter 4.0 User Guide

Repair Service error handling

The Repair Service handles the following errors.

Error of a single range repair
When a single range repair fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single range fails ten times, the Repair Service shuts down and fires an alert.
Too many errors in a single run
After a total of 100 errors during a single run, the Repair Service shuts down and fires an ALERT.
Time-outs
The Repair Service times out a single repair command after one hour. This counts towards an error for that repair command and it is placed at the end of the queue of repairs and retried later.
Too many repairs in parallel
The Repair Service errors and shuts down if it has to run too many repairs in parallel. By default, this happens if it estimates that it needs to run more than one repair in a single replica set to complete on time.
Note: These are all configurable.
Show/hide