Get in the Ring with Cassandra and EC2
date: March 17, 2014
This is the first in a series of posts about making good choices for hosting Cassandra on the Amazon Web Services (AWS) environment. So many operations are moving to the cloud, and AWS is the common choice for many of you. We will cover how to choose the right Elastic Compute Cloud (EC2) instances and then scale out your cluster, address whether an Amazon Machine Image (AMI) is right for your operations, elaborate on some tricky details, and finally, discuss disaster planning and prevention and how to recover from a disaster if the worst happens.
This series of posts should serve as a general advice and best practices guide specifically for the Amazon Web Services (AWS) environment. Liberties were taken in some cases to describe local hardware as well as other clouds, but were generally constrained to information that was highly relevant to Elastic Compute Cloud (EC2) deployments.
Some anecdotal asides can be found in this document to highlight outlined recommendations. We didn’t want to further complicate this document with extraneous detail, but we also felt these details were highly relevant since they are more common than expected, which is why they’ve been included in indented paragraphs.
Choosing the Right Instance Size (in EC2)
For Apache Cassandra installations in Amazon Web Services' (AWS) Elastic Compute Cloud (EC2), an m1.large is the smallest instance that we generally recommend running on. m1.small or m1.medium may work just long enough to start Cassandra and place sample data, but their small vCPU count, along with their small memory allocation and lighter network capacity place further restriction on all the Cassandra requirements. When you add the issue of multi-tenancy of these small devices, the complications become exponential. The reason for choosing Cassandra as a database is to keep the database layer simple. Don’t purposefully complicate things with this choice.
We’ve had customers in the past that developed, tested, and went into production with m1.small instances, before the advent of m1.medium. They began getting unique stack traces of exceptions that didn’t make sense. After being unable to find any possible fix for 2 weeks, we finally drew the line and had them try migrating off to m1.larges even though upper management pushed back. The only way we got cooperation was admitting it was Cassandra’s fault and that the m1.large upgrade was purely to save their system as a workaround. Sure enough, as soon as the upgrade happened, all issues went away. We never had another discussion around instance sizing and we never saw those stacktraces or issues again. m1.small may seem like a good choice early in development, but it will never be a good choice for Cassandra.
With the m1.large instance type, nodes start pretty comfortably, and run a development load without major issues on the cluster. By choosing a node of this size, you're guaranteed to have fully functioning nodes, rather than nodes that exhibit intermittent errors under no/low load. Don’t borrow trouble, don’t spend time debugging failing nodes because of instance size. Focus on your project, not the technology running it.
We generally recommend m1.xlarge for low-end production usage, but understand that m1.large will be sufficient in some cases.
We had one customer who went almost two years with approximately 20 m1.large instances in production - before the stress on the machine’s Java Virtual Machine (JVM) was too much during garbage collection (GC) periods. They moved their data to around half the number of m1.xlarge instances, while keeping the same replication factor. The process went smoothly and their throughput and latency got better since the memory usage was in check on the larger nodes. They were able to complete the move in less than a week with zero downtime, because they swapped out nodes by bootstrapping new nodes, and then removing the old nodes after the new ones were in place. There was nothing special needed, just Cassandra as usual.
So, feel free to run with m1.large instances, but remember that once you have more than 8 nodes, you should likely swap them for half as many m1.xlarge instances. An added benefit is that this generally doesn’t incur any additional cost.
If you have more money to spend and want lower latencies, c3.2xlarge would be the next best upgrade. With 2x vCPUs, approximately 4x EC2 Compute Units, and 2x80 SSDs, the c3.2xlarge will be more performant than the m1.xlarge. Do keep in mind, however, that there is a 25% cost increase with only 10% of the disk space of an m1.xlarge. This means that you will be expanding the cluster horizontally at a higher instance cost, as you need to keep up with an increase in data. Only use this option when your data size isn’t too large and latency is more important than cost.
The largest recommended node size for Cassandra is a i2.2xlarge. This is three times more expensive than a c3.2xlarge instance, but comes with 4x the memory and 10x the disk capacity. The instance store is also SSDs, an important feature. All other attributes are about the same as the c3.2xlarge instance type. Large enterprise customers with massive amounts of data typically chose this instance type.
Even though the above information defines AWS EC2 choices, the same recommendations translate to other cloud providers. An important attribute to note is that 8 GB of memory is the minimum recommended configuration for low-end production clusters (although 16 GB is recommended). The largest JVM heap size should be restricted to 8 GB, because of JVM garbage collection limitations, so any additional memory is used by Cassandra’s in-memory architecture and file caching. If you have an application that makes use of the additional memory, you may need a memory-optimized machine.
Another attribute to consider is the number of CPUs. Multiple CPUs are also important if heavy write loads are expected, but ultimately Cassandra is limited by disk contention. One of the most critical attributes to include in your cluster instances are SSDs, if possible. SSDs are preferred over "rotational rust" to reduce disk contention, but cost may ultimately be the deciding factor. If SSDs are out of your price range, expanding a cluster horizontally increases your cluster’s overall CPU count linearly as well, increasing your write throughput as well as increasing your read throughput and disk capacity.
Scaling the Right Way
Now that you've chosen the correct instance size to match your spending limit and replication topology, you need to understand how to grow the cluster when the time comes to expand it. With traditional systems, scaling upwards with faster machines, bigger disks and more memory is the norm. With Cassandra, scaling horizontally on commodity machines is recommended for a number of reasons.
With Cassandra, the innate fault-tolerance lies within the replication topology and concepts. Having a single node is a single point of failure (SPOF), whereas having multiple nodes can eliminate this issue. This is a very basic and fundamental concept: having more Cassandra nodes implies having higher fault-tolerance. Considering the reverse approach: having larger machines with more data implies less fault-tolerance. Finding a balance between machine load and number of machines isn't trivial, but it’s also not a blocking issue when getting your new cluster up and running.
In the beginning, three nodes with a replication factor (RF) of 3, is recommended. This allows you to read and write at a consistency level of QUORUM if you find strong consistency to be necessary, as this ensures that n/2+1 nodes (2 nodes, if using an RF=3) will accept the requests and provide for non-stale data. When you need to add more nodes, just keep the same replication factor, so that you don’t to re-engineer your requests.
Along with scaling horizontally, Cassandra makes use of the additional CPU cores by allowing more writes to be thrown at the cluster for faster processing. Adding additional nodes also implies having more memory to hold the bloom filters that will ultimately lead to less on-disk seeks and lower read latencies.
The actual method of scaling is as easy as writing about it: change your cassandra.yaml to set auto_bootstrap to True, point the seed_list to the existing seed_list that exists within the cluster, set the listen_address and, perhaps, broadcast_address to the appropriate addresses of the new node. Then start the new Cassandra nodes 2 minutes apart. With non-vnode machines, you'll also need to set the inital_token value to ensure it balances the ring without token conflicts; with vnodes, this is not necessary. Upon bootstrapping, the nodes join the cluster and start accepting new writes, but not reads, while the data they will be responsible for is streamed to them from new nodes. Later, cleanup can be called on each node to remove stale data from non-responsible nodes; it is not immediately required.
In the rare instance where scaling hardware upwards, rather than horizontally, is necessary, the same steps above are required. The only additional step is removing the old nodes with the decommission operation, one at a time to lessen impact of the operations on the cluster. When a decommission is called on a node, the node streams all data that it is responsible for to the new nodes that are taking over responsibility.
Performance through Scaling
Adrian Cockcroft wrote an insightful, detailed post about multi-tenancy issues. The details in the article may not be relevant to all cloud providers, but the concepts remain the same.This article is relevant to anyone running in a cloud environments who is trying to understand why tests in cloud environments are inconsistent.
A few anti-patterns can also be found here.
Ultimately, the takeaway for Cassandra is that scaling will be linear. Another report that goes a bit more in-depth with testing and comparisons is available here. In the latter you'll notice that Cassandra is linearly scalable while other systems are not.
One question that used to come up all the time in Support channels was one about trying to guess the throughput of the cluster, given so many machines. This is a complicated question, since different workloads experience different throughputs. Suggested methodology is to run a sample workload against 3 nodes, looking for the point that throughput maxes out and latency becomes an issue for your application. Running the same workload against 6 nodes will give you a second interpolation point. Since Cassandra scales linearly, finding the linear function is quite easy and will be valid in the early stages of your project.
Once the cluster workload begins to reach performance limits, you'll be able to easily spot them, especially if they're still in range of the initial calculations. Saving these metrics before expanding will prove useful later for making further well-informed scaling predictions.
I would like to take this time to note that we’re unsure if Cassandra is infinitely scalable, since it’s impossible to verify this through tests. We are, however, certain that it scales linearly up to 1,000 nodes. We have seen issues on clusters past the 100 node threshold, but once we patched up a few Cassandra internals and made tools better handle this size of information we haven’t seen many issues in this range of cluster sizes. We have confidence that clusters between 100 and 1,000 nodes nodes should work as expected. As larger Cassandra clusters are tested in both development and production, we’ll be able to confidently verify if this linear scalability continues to hold true.