Cassandra Automated Upgrade Testing Improvements
Test Infrastructure and Challenges
The Cassandra Test Engineering team has been working to improve testing the Cassandra upgrade process and we wanted to share these changes with you. As you may know, Cassandra's functional test suite, cassandra-dtest, is an open-source python project on github where much of the Cassandra test automation effort takes place (by the way we are happy to receive pull requests, and it's a great way for pythonistas to contribute). It's also worth mentioning that these tests, among a multitude of others, are run on a publicly accessible Jenkins server found at cassci.datastax.com. CassCI stands for Cassandra Continuous Integration, but we affectionately refer to it as "cassie".
Testing the upgrade process is essential to making sure nodes can use the latest releases' improvements, and get there with no negative impact on pre-upgrade data. In addition, testing ensures the upgrade process is predictable and repeatable for minimal impact on availability. Upgrading presents interesting challenges for stateful systems like databases because each new release may require data format changes but must also be capable of interacting with data from previous versions. This backwards-compatibility may only be required as a transitive part of the upgrade process, or could be needed indefinitely. For this reason it's important to pay attention to data when testing the upgrade process.
Real World Upgrades
Software test scenarios are often more simplistic than their real world counterparts, so let's take a detour into real world upgrades to get a sense for the procedure and what should be considered. For our purposes upgrading means taking nodes offline to update their software, bringing them back online, and aligning pre-upgrade data with the current format if necessary. When applicable, these data format changes apply to sstables. Before upgrading, one should have a rollback strategy in place and review the release notes for any special upgrade instructions. Ideally rolling upgrades should be performed to minimize impact on production systems. A rolling upgrade is where one node is upgraded at a time then brought back online before the next node is upgraded (cassandara-dtest covers both rolling and parallel upgrades).
Assuming each keyspace in the cluster has a replication factor of 3 you can continue to serve quorum reads and writes during a rolling upgrade. Consistency of 'ALL' cannot be honored with a node down, so it's worth considering whether your application(s) can tolerate temporary failure. One alternative to 'ALL' is adjusting consistency requirements -- note that reads will return the latest data if (nodes_written + nodes_read) > replication_factor.
During the rolling upgrade, as each node is taken offline, load will increase on the other replicas. A vnodes configuration on a well balanced cluster will evenly distribute this load to many nodes, but a single token per node configuration will cause a more significant impact on fewer machines.
After these considerations, an upgrade is started by choosing the first node and running 'nodetool drain'. Drain will write memtables to disk as sstables and cause the node to stop accepting writes. Next the node is stopped, and the software update is applied. The node is then started again. The Cassandra system log should be reviewed for any problems. A final step may include running 'nodetool upgradesstables' to write sstables in an updated format for the new software to use. Minor upgrades (e.g. upgrading from 1.2.8 to 1.2.9) typically do not require sstable changes, but there have been exceptions to this rule in the past.
Automated upgrade testing is informed by the process above and includes the key steps outlined -- drain, shutdown, update, startup, and sstableupgrade. Since we don't have the watchful eye of a human, automated checks are completed along the way: to write data prior to the upgrades and read it back afterwards, looking for manifestations of bugs. With this lengthy background on upgrades, let's talk about the exciting new changes to the automated tests. You can have a look at the upgrade test source code right here: https://github.com/riptano/cassandra-dtest/blob/master/upgrade_through_versions_test.py to follow along.
High level test improvements have been to simplify the test upgrade procedure, extend it to as many scenarios as practical, and minimize human intervention in keeping tests updated. Earlier upgrade tests exclusively used a continuous upgrade process where a single test would perform multiple upgrades. This also relied on a hard-coded version upgrade path. These aspects of testing created a few difficulties. First, it's more difficult to determine the root cause of a problem when many upgrades are happening in a single test. Also, tests were not always up to date with information about available versions. Rigidly testing a predefined version path can lead to blind spots in tests and give a false sense of security (if current code is not tested then a problem could be introduced and would not be detected). All tests are now automatically finding new versions instead of using a predefined upgrade path. It should be noted that multi-upgrade scenarios are still useful: there have been bugs in the past where an issue may not become apparent until more than one upgrade has been performed. We've kept some multi-upgrade tests and these should be a helpful resource going forward.
Single Upgrade Testing and Version Discovery
As an alternative to multi-upgrade testing, we added point-to-point testing where a single upgrade is vetted in each test. These single upgrade tests share the same basic upgrade procedure with their multi-upgrade peers, they just do less per test. This makes test failures easier to isolate, and allows individual tests to run faster. This means faster feedback for developers on the version(s) they are interested in. Single upgrade tests now account for the majority of upgrade testing, and have been configured to run across the several version upgrade strategies detailed below. The best part, however, is that tests automatically see a new minor version release and test it without any code changes needed.
To lower reliance on a hard-coded upgrade path, we want to be able to know at test runtime what Cassandra versions are available and adapt testing to those versions. Basically when a new minor version appears we want to automatically test it without a person needing to remember anything. This is implemented by inspecting the git repository at test runtime to retrieve information about current versions. While this adds a bit to test complexity, the trade-off is having up-to-date version information without much intervention. This version information is found in git tags which are used to launch the test clusters at the relevant points in the code's history. Although the tests still require a simple list of active branches, there's no need to update code when new tags are published for known branches. When a new branch is introduced, we can simply add the branch name and the tests are ready to run. Similarly, if a branch is retired we simply remove the branch name and no other code changes are necessary to eliminate those tests.
Dynamic Test Creation and Strategies
Armed with up-to-date knowledge of available versions, we have to translate this into new tests. Since we don't know the test versions until runtime, this means we must create the tests at runtime as well (creating tests beforehand would require specific knowledge of versions available, and would necessitate the manual updates which we want minimize). Luckily, the Cassandra test suite is written in Python which provides many useful features at runtime to help solve this problem. There were a few possible approaches but we opted for using a base class and dynamically creating subclasses to represent each point-to-point upgrade. The base class captures the basic upgrade logic and the subclasses handle specific upgrades.
With the basic test template in place as a base class, we can programmatically generate test classes using some straightforward upgrade strategies:
- Release to same branch: upgrades from the last release on a branch, to the latest commit of that same branch. For example: upgrading from 1.2.17 to the most recent commit on the 1.2 series. Since each commit is a potential new version, we want to know that upgrading will work with the latest changes.
- Release to next branch: upgrades from the last release on a branch, to the latest commit of the next branch. For example: upgrading from 1.2.17 to the most recent commit on the 2.0 series. This is analogous to the strategy above, but for upgrading to the next series.
- Branch to next release: upgrades from the latest commit on a branch, to the latest release on the next branch. For example: upgrading from the most recent commit on the 1.2 series to the 2.0.7 release. This helps protect against potential bugs in the start version.
- Branch to next branch: upgrades from the latest commit on a branch, to the latest commit on the next branch. For example: upgrading from the most recent commit on the 1.2 series to the most recent commit on the 2.0 series. This helps protect against potential bugs in both the start version and the end version.
That's about all there is to share about the current state of upgrade testing. If you made it this far and found this post interesting, you should know that DataStax is hiring (you might be interested in this posting in particular). Don't forget you can find the entire cassandra-dtest suite here on github (did I mention we like pull requests?), and you can see the latest test results at cassci.datastax.com.