DataStax Developer Blog

Linux, Cassandra, and Saturday’s leap second problem

By Jonathan Ellis -  July 2, 2012 | 1 Comment

Saturday’s leap second caused some trouble over the weekend, compounding the damage experienced in the wake of the Amazon Web Services outage that started Friday and continued into Saturday.

The primary symptom of the leap second problem was extremely high system load, with no corresponding increase in requests seen. Particularly unlucky systems would crash. Once diagnosed, a simple reboot or an even more simple reset of Linux’s timekeeping (e.g., via date `date +"%m%d%H%M%C%y.%S"`) was enough to fix the problem; the only difficulty was in determining the cause.

Initial reporting often fingered Java or even Cassandra as the culprit, which is a testament to the popularity of these systems in high-traffic web sites, but the actual problem was a kind of livelock in the Linux system calls responsible for timers. What made this non-obvious (if you weren’t one of the unlucky admins whose servers actually crashed) is that tools like top would report that the application in question was consuming the CPU; digging deeper to see that the culprit was system calls like futex_wait misbehaving is beyond the scope of most systems administration.

This affected Java systems software like Cassandra, Hadoop, ElasticSearch, and Jetty, as well as non-Java code like MySQL or even client software like Firefox.

A fix for the Linux kernel is in progress as of this writing, but will certainly be finished by the next leap second. Related Red Hat knowledge base entries are 154793 and 154713.



Comments

  1. Marcos says:

    I very much doubt that anyone, at least reasonably IT literate, has singled out either cassandra or java for the weekend outages.
    But nevertheless that reddit twitter update triggered a few non-tech/clueless folks to point fingers – namely on twitter and blogs.
    One truly hopes this episode does not become a stain on Cassandra’s reputation.
    Cassandra is perfectly fit for its purpose.
    FCOL what kind of sys admin is not aware of the perks time.h and other great Unix libraries?!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>