DataStax Developer Blog

Handling Disk Failures In Cassandra 1.2

By Aleksey Yeschenko -  October 11, 2012 | 5 Comments

Cassandra is great at handling entire node failures. It’s not just robust, it’s almost indestructible.

But until Cassandra 1.2, a single unavailable disk has the potential to make the whole replica unresponsive, while still technically alive and part of the cluster: memtables will be unable to flush and the node will eventually run out of memory. Commitlog append may also fail if you happen to lose the commitlog disk.

The traditional workaround has been to deploy on raid10 volumes, but as Cassandra handles increasingly large data volumes the prospect of paying an extra 50% space penalty on top of Cassandra’s own replication is becoming unpalatable.

The upcoming Cassandra 1.2 release (currently in beta) fixes both of these issues by introducing a disk_failure_policy setting that allows you to choose from two policies that deal with disk failure sensibly: best_effort and stop. Here is how these work:

  • stop is the default behavior for new 1.2 installations. Upon encountering a file system error Cassandra will shut down gossip and Thrift services, leaving the node effectively dead, but still inspectable via JMX for troubleshooting.
  • best_effort Cassandra will do its best in the face of disk errors: if it can’t write to a disk, the disk will become blacklisted for writes and the node will continue writing elsewhere; if Cassandra can’t read from a disk, it will be marked as unreadable, and the node will continue serving data from readable sstables only. This implies that it’s possible for stale data to be served when the most recent version was on the disk that is no longer accessible and consistency level is ONE, so choose this option with care. This allows you to get the most out of your disks.

An ignore policy also exists for upgrading users. In this mode Cassandra will behave in the exact same manner as 1.1 and older versions did – all file system errors will logged but otherwise ignored. DataStax recommends users opt in to stop or best_effort instead.

Summary

Starting with version 1.2, Cassandra will be able to properly react to a disk failure – either by stopping the affected node or by blacklisting the failed drive, depending on your availability/consistency requirements. This allows deploying Cassandra nodes with large disk arrays without the overhead of raid10.



Comments

  1. Greg Lindahl says:

    You might find this blog post useful for ideas:

    http://highscalability.com/blog/2012/7/9/data-replication-in-nosql-databases.html

    I don’t think anything in it is really new; it describes a scheme similar to how Google handled their R3 databases from early on.

  2. Sankalp says:

    It will be nice if you could accept writes but forward reads to other replicas. it can then run a repair in the backgorund and fetch the lost data. It can resume reads after that.

    This will be useful if you put one spare drive on the node. So for a node to be offline, it should lose those many disks till it could not hold all the data.

  3. Sankalp says:

    How do you guys handle disk failures with memory mapped IO?

  4. Aleksey Yeschenko says:

    Sankalp: in the same way as we deal with other fs-related errors: ignore if disk_failure_policy is ‘ignore’, stop the node is the policy is ‘stop’ and blacklist the directory for either writes or reads and writes both, depending on the context, if the policy is ‘best_effort’.

  5. Sachin says:

    Hi

    I wanted to understand how and when Cassandra detects a drive has gone bad. If my set disk_failure_policy is “stop”, for how long was Cassandra oblivious to the disk problem before detecting this and then “stopping”?

    Cheers

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>