Repair Service error handling
The Repair Service handles the following errors.
- Error of a single range repair
- When a single range repair fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single range fails ten times, the Repair Service shuts down and fires an alert.
- Too many errors in a single run
- After a total of 100 errors during a single run, the Repair Service shuts down and fires an ALERT.
- The Repair Service times out a single repair command after one hour. This counts towards an error for that repair command and it is placed at the end of the queue of repairs and retried later.
- Too many repairs in parallel
- The Repair Service errors and shuts down if it has to run too many repairs in parallel. By default, this happens if it estimates that it needs to run more than one repair in a single replica set to complete on time.
Note: These are all configurable.