DataStax Developer Blog

Ideology and Testing of a Resilient Driver

By Joaquin Casares -  May 20, 2013 | 0 Comments

Recently, DataStax released the 1.0.0 release of our Java Driver for Apache Cassandra. For the Java Driver, there was a large focus on ensuring that optimizations were handled gracefully, data was successfully written and verified against a Cassandra instance, and most importantly, that the driver could handle real-world cluster changes and issues as intended.

We began testing our work in this area by starting with our load balancing policies. For the Java Driver, we use the LoadBalancingPolicy interface and provide three load balancing policies by default: RoundRobinPolicy (default), TokenAwarePolicy, and DCAwareRoundRobinPolicy.

RoundRobinPolicy

For the RoundRobinPolicy, testing was decently straightforward. We not only tested that each subsequent request was spread out amongst all active nodes, but also that the active node list was accurate and actively updated after topological changes. This can be done without worry of where the final location of the data will be, since Cassandra implements coordinator roles when issuing requests.

The client can connect to any cluster member, which acts as a coordinator for the requests it receives. In the event that the request fails, or the acting coordinator node dies, a subsequent request using the RoundRobinPolicy will choose another peer node as a coordinator and should complete the request successfully once the specified ConsistencyLevels are met.

TokenAwarePolicy

The TokenAwarePolicy is a more interesting case since this policy also requires a child policy that dictates which nodes are local nodes and which are remote nodes, as far as the driver is concerned. For cases with a single datacenter, the TokenAwarePolicy chooses the primary replica to be the chosen coordinator in hopes of cutting down latency by avoiding the typical coordinator-replica hop.

To test this policy we tried multiple setups of different child policies as well as multiple datacenters, some of which were programmed to disappear altogether, simulating full datacenter outages. Although we did find possible minor issues in this area, the fail-over seemed to work great and as expected. If anything, fail-over optimizations may still be made, but two things are for certain: token aware optimization works and fail-over scenarios continued without a hiccup.

Cassandra’s token policy is handled innately, abstracted away from the user, in order to provide a simple and robust environment for both operations and developer teams. It’s this same simple and robust consistent hashing that allows for the TokenAwarePolicy to optimize requests by contacting the ideal node. And of course, if this ever fails, the specified child policy becomes the active LoadBalancingPolicy as the search for a replica that is both alive and responsive continues.

DCAwareRoundRobinPolicy

The DCAwareRoundRobinPolicy takes the same fundamentals as the RoundRobinPolicy, but introduces support for multiple datacenters. To test this policy we followed suit with implementing full datacenter outage cases and fail-over assertions when testing this policy, much like our TokenAwarePolicy tests. Although this may not initially seem highly beneficial, the case when datacenters exist in different parts of the world highlights why this is the best policy when dealing with remote datacenters.

Let’s take the use case of a Cassandra cluster that is split up into 3 Amazon regions: US-East, US-West, and EU-West with a keyspace replication_strategy of {‘US-East’: 3, ‘US-West’: 1, ‘EU-West’: 1}, where US-West and EU-West are designed to primarily be for backup or fail-over scenarios for the given keyspace.

In the ideal situation, you wouldn’t want to spread your requests across a WAN since a climb in latency will be apparent for requests that have to cross the Atlantic Ocean. Instead, you would want to constrain your application to contact only the nodes on the LAN to cut down on this latency.

This does not mean that data will only stay in US-East, for this specific replication_strategy, but as far as your driver is concerned, it’s the primary datacenter. In the background, data is being migrated in batch for every write without developer interaction. (Although from an operations standpoint, routine repairs are required to guarantee consistency.)

Even if datacenters were not physically separated by great distances, it’s best to choose a single datacenter and continue to write to just that one. This ensures all requests made to the cluster, for a given session, will work over the same data. This is the way that fail-over works on the DCAwareRoundRobinPolicy as well. If US-East ever falls over, you’re guaranteed to choose a single datacenter as the interim primary datacenter until your primary datacenter returns, thus providing a higher level of consistency at a lower latency. The DCAwareRoundRobinPolicy takes all this into consideration so that even new users can easily store data efficiently, without worrying too much on “how” it’s being stored since this has already been implemented and provided as part of the driver.

What’s Next?

For our next major release, we’ve already begun our Jenkins integration which should help us ensure that our test suite, which already has a runtime of more than an hour, is run daily against our master branch. This will ensure continuous stability as we grow both our test code and our product code.

We will also be building and ensuring that a long-running duration test shows that the driver continues to be accurate, performant, and stable. Even though a collection of 81 tests, covering 76% of the code, takes a full hour to be validated, these tests are all running on small disposable clusters that are running very specific tests. When we start testing our driver against 72+ hour duration tests, in conjunction with simulated chaotic environments, we will continue to ensure that data is always valid on both sides of the pipe and that our fault-tolerant database has a fault-tolerant client driver that is run on the new binary protocol.

Do continue to look out for updates around our client drivers as we continue to grow our documentation, examples, and our great team.

And if all that’s not enough, you can always meet with us face-to-face at the Cassandra Summit 2013′s “Birds of a Feather” and “Stump the Experts” sessions!



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>