Linux, Cassandra, and Saturday’s leap second problem
The primary symptom of the leap second problem was extremely high system load, with no corresponding increase in requests seen. Particularly unlucky systems would crash. Once diagnosed, a simple reboot or an even more simple reset of Linux’s timekeeping (e.g., via date `date +"%m%d%H%M%C%y.%S"`) was enough to fix the problem; the only difficulty was in determining the cause.
Initial reporting often fingered Java or even Cassandra as the culprit, which is a testament to the popularity of these systems in high-traffic web sites, but the actual problem was a kind of livelock in the Linux system calls responsible for timers. What made this non-obvious (if you weren’t one of the unlucky admins whose servers actually crashed) is that tools like top would report that the application in question was consuming the CPU; digging deeper to see that the culprit was system calls like futex_wait misbehaving is beyond the scope of most systems administration.