Preparing for the Leap Second
NOTE, 12 December 2016: An updated version of this blog post has been made as it pertains to the leap second being added on January 1st, 2017. Thanks to Andy Tolbert for bringing this information up to date.
System administrators are probably already aware of the upcoming leap second in June. From a civil time perspective, June 30th will have an extra second added to the end. From the Linux time system’s point of view, the last second of June 30th will repeat itself as another second with the same timestamp as the one previous is inserted at the end of the day.
Those of you who were using Cassandra -- or running a number of other databases or applications under Linux -- back in 2012 may have had problems when a leap second was added at the end of June. In this blog post, we’ll explain how things have changed since then, what we’ve done to anticipate other problems that may be caused by the leap second, and what you can do to prepare for it.
Livelock in Pre-3.4 Linux Kernel and Pre-7u60 JDK
As explained in Jonathan's 2012 leap-second blog post, many of the failures that occurred in 2012 were caused by a bug in the Linux kernel that caused a livelock in the timer subsystem when the leap second was inserted. Luckily, a fix for that particular problem was applied to the kernel as part of version 3.4.
Determining if Your System is Affected
As an initial assessment, run
uname -r to determine the version of the kernel you're running. Kernel versions 3.4 and higher aren’t affected by the bug. For a more comprehensive assessment, and to demonstrate problems that can be caused by the kernel bug, the author of the bug fix wrote two programs that exercise the bug. These are useful diagnostic tools, but do not use them on production systems. They alter the host system's clock and shouldn't be run on systems currently in production or that contain data you want to keep.
- This program can lock up kernels that still contain the bug.
- This program, run with the
-soption, will repeatedly insert leap seconds and check for any timing errors resulting from the insertion.
We've tested both of these programs on Ubuntu images on AWS and verified that they fail on systems with old kernels and succeed on newer ones. You may not see the expected failures on systems running under other forms of virtualization; for instance, we saw different timer-resetting behavior on images running under VirtualBox. If you're a Red Hat Enterprise Linux user with a Red Hat account, Red Hat's lab on the subject may be helpful. It assumes you use RHEL, but if you do, it can determine if your system is susceptible to the livelock without interacting with your system clock.
UPDATE 24 June 2015: If you use RHEL 2.6 or higher, your system may be safe from kernel livelocks even on older kernels. There was a workaround applied to the kernel that prevents the livelock from causing problems, though it does not fix the underlying issue. See this bug report and this update report for more information.
Java-based applications like Cassandra were particularly affected by this kernel issue due to thread parking operations' reliance on the CLOCK_REALTIME system clock. Recent versions of JDK 7 (7u60+) and all versions of JDK 8 include an enhancement (JDK-6900441) that instead uses CLOCK_MONOTONIC instead for these operations. CLOCK_MONOTONIC in the general case is not affected by system time changes, such as insertion of a leap second.
We were able to reproduce kernel lockups using pre-7u60 JDKs on pre-3.4 kernels. We have not yet seen a kernel lockup, even with older kernels, with JDK 7u60 and higher. Still, we strongly discourage using this as a workaround -- if you are using a kernel older than 3.4, you are still at risk of a livelock in the kernel.
On newer kernel versions that do not demonstrate this issue, it still may be of value to be at a JDK level greater than or equal to 7u60, as time-sensitive operations will behave more correctly than in older versions.
Timestamp Behavior Over the Leap Second
The timestamps that Cassandra nodes use do not increase monotonically over the inserted leap second. Instead, the last second of the day is repeated, so timestamps generated during the inserted second will appear interleaved with timestamps generated during the previous second. We’ve run some tests on a 4-node cluster on AWS to help anticipate what problems you might encounter when running Cassandra during the leap second.
We simulated a leap second using NTP and inserted values at incrementing keys. Our test logic looked something like:
simple_insert = session.prepare( 'INSERT INTO test (foo, bar) VALUES (?, ?);') for i in itertools.count(): if past_midnight(): break sleep(.5) session.execute(simple_insert, [i, i]) result = session.execute("SELECT bar, WRITETIME(bar) FROM test;")
These tests were run with client-side timestamps turned off, so the timestamps were generated by the Cassandra nodes themselves.
When timestamps increase monotonically, as they do most of the time, the values and writetimes selected would increase together. However, when the insertions happen over the leap second, the writes’ timestamps are interleaved:
The values were inserted in increasing order, but their writetimes are in a different order because of the repeated second. During the first instance of 23:59:59, the values 579, 580, and 581 were inserted at the beginning, middle, and end of the second. During the leap second, which is also 23:59:59, 582, 583, and 584 were inserted, also at the beginning, middle, and end of the second. However, since the two seconds are the same second, they appear interleaved with respect to timestamps, as shown above. Because underlying system timestamps are responsible, we see similar results with UPDATE statements, both with and without lightweight transactions.
For many applications, this interleaved ordering will not affect correct operation. However, if your application requires that values’ writetime order are the same as their wall-clock-time insertion order, you should make sure your strategy for ensuring that property holds also works during inserted leap seconds.
Clock Sync Problems Around the Leap Second
Cassandra’s behavior depends on your cluster having well-synced clocks on all your servers. The timestamps on writes and deletes are, in most cases, generated by the coordinator node (though they can also be generated by the client, in the case of drivers like the Python driver that use protocol version 3). Thus, if clocks are out of sync, timestamps on writes that were coordinated by different nodes can be out of order.
Ensure that your servers are synchronized with NTP using the same servers. Using external NTP pools carries some risks, however. NTP servers, such as those accessible as part of the ntp.org server pool, can be out of sync with one another or can be misconfigured to add leap seconds at the wrong time, or to not add scheduled leap seconds. If your Cassandra nodes’ NTP clients use external servers directly, their clocks may drift as they independently compensate for upstream inconsistencies. You can avoid these problems by setting up your own NTP pool that will compensate for inconsistencies between upstream servers and provide consistent time to your Cassandra nodes as clients.
Leap Seconds and DataStax Drivers
Like Cassandra, some client drivers are also susceptible to the kernel bug around leap seconds and timestamp generation issues.
Kernel Issue Impact
As the java-driver library runs on the JVM, it could, in theory, be susceptible to the kernel bug encountered in June 2012. In testing on kernel 2.6.35-32 with JDK 7u55, we found that no threads were susceptible to the leap second issue. However, since there may be other activities in an application running the java-driver, we strongly recommended upgrading your kernel to 3.4+ and also considering upgrading your JDK version to 7u60+.
The C++, Python, Ruby, and Node.js drivers were also tested on an older kernel version and did not demonstrate any lock up issues after a leap second was inserted. That being said, it is still strongly recommended that you consider upgrading to kernel 3.4+ as these tests were not comprehensive.
Leap Seconds and Client Timestamp Implementations
If you are using client timestamps you may run into similar issues described in the ‘Timestamp Behavior over the Leap Second’ section. In DataStax client drivers, there are three ways to enable client timestamps:
- Appending ‘USING TIMESTAMP timestamp’ to your CQL query.
- Using the ‘set timestamp’ method on a Statement, for example setDefaultTimestamp in the java-driver. This is only available for drivers supporting Cassandra 2.1 running against Cassandra 2.1+ clusters.
- Using a timestamp generator.
As using client timestamps with options 1 and 2 is at the discretion of the user, and how you generate these timestamps will determine how an inserted leap second will impact you. If you are using the default time implementation in a language (i.e. System.currentTimeMillis() in Java, time.time() in Python, Time.now() in Ruby, and so on) you will be susceptible to an inserted leap second setting time back one second, so writes will be interleaved.
The java-driver offers monotonic timestamp generators AtomicMonotonicTimestampGenerator and ThreadLocalMonotonicTimestampGenerator. These generators base their current timestamp off of (System.currentTimeMillis() * 1000) + a counter that is incremented for all statements created over the last millisecond. As time advances forward, the counter is reset. However, if time does not advance and the counter reaches 999, it no longer increases. Therefore after 1000 unique statements in a single millisecond, timestamps used are not unique. This typically would not be an issue, except that when time is sent back in the past 1 second on insertion of a leap second, the millisecond basis for the timestamp generator does not increase until the next second. It is not unlikely for there to be 1000 unique statements generated in a second, so one might observe this phenomenon during a leap second. When it occurs, the following log message will be emitted:
Sub-millisecond counter overflowed, some query timestamps will not be distinct
The problem introduced here is the possibility of having multiple writes for the same key and the same timestamp that Cassandra cannot determine which write is the most recent and thus chooses the write with the lexicographically largest value. JAVA-727 has been opened to handle this scenario better. To help reduce to possibility of counter overflow during a leap second, you may want to consider using ThreadLocalMonotonicTimestampGenerator which will use a separate microsecond counter per client thread that calls Session#execute or executeAsync. Note that since a separate microsecond counter is used per thread, you should be mindful that mutations to the same cell within the time period where the millisecond value doesn't elapse may be interleaved in time order since each thread has its own counter.
In summary, to prepare for the upcoming leap second in June:
- At a bare minimum, make sure you are running Cassandra and its drivers on kernel version 3.4 or higher. We also recommend using JDK version 7u60 or higher. This should protect you from the livelock problems users experienced in 2012.
- Determine if your application will be affected by out-of-order timestamps during the inserted leap second, and if it will, develop a strategy for preventing any problems.
Many thanks to Andy Tolbert for his contributions to this blog post.
EDITED 24 June 2015: added information about patches to RHEL; thanks to Jeremiah Jordan and James Kavanagh for bringing them to my attention.
DataStax has many ways for you to advance in your career and knowledge.
You can take free classes, get certified, or read one of our many white papers.
register for classes
DBA's Guide to NoSQL