Get your copy of the O’Reilly Cassandra eBook: The Definitive Guide - Download FREE Today
First, there are three broad categories of storage: instance store ("ephemeral"), EBS (Elastic Block Storage), and S3.
We'll come back to instance store in a minute. Let's look at EBS and S3.
EBS is mountable storage; it can be mounted as a device to an EC2 instance. Multiple EBS "drives" can be mounted to one EC2 instance, and they can be then striped and/or mirrored into a larger volume using software RAID. It is also network-attached storage, so high latency can be expected. There are ways to mitigate the latency, as I'll discuss shortly.
EBS can only be mounted to an EC2 instance in the same availability zone. This characteristic is what gives S3 a role - S3 can be accessed from anywhere. S3 is a web-based storage service that is replicated across a region, making it available in different availability zones. It is possible to copy S3 data from one region to another, so in theory, that data could be available across regions. However, it is not mountable, so its primary use is to store files that must persist, such as snapshots. It is important to keep in mind that it is a storage service, not a device.
Whereas EBS will have no consistency delays, S3 operates under eventual consistency, presumably because of the replication. Both EBS and S3 are persistent, meaning that they exist independent of an EC2 instance. EBS can be thought of as external hard drive, while S3 is more akin to DropBox. A brief mention of Glacier is apt here; Glacier is S3 for archived files. It takes 3 to 5 hours to "thaw" previously "frozen" files, but at a cheaper storage price than S3. However, retrieval rates should also be calculated and considered before formalizing the upload and time to download process.
EBS is further refined into two categories: Standard and Provisioned IOPS (PIOPS). Standard is as described above. Note that, since multiple EBS drives can be attached to a single EC2 instance, those drives can be striped to improve performance. EBS-optimized is an important setting on your instance that refers to how the instance will attach your EBS device, but let's discuss a little bit further down.
Because resources like EBS are shared resources, there will be variable performance, depending on the competing processes from various EC2 instances. In the past, AWS customers raised the issue of wanting more reliability about the "slice of the pie" they would get. Provisioned IOPS is the resulting solution. This EBS category is tunable and predictable. The desired number of IOPS is selected, per volume, and that rate is guaranteed. Note that the PIOPS guaranteed rate is 90% of what you're paying for 99.9% of the time. This is important, since it changes the way you do math on RAIDing your devices. Multiply (speed * 0.9) * numOfDevices, not speed * numOfDevices to calculate the capacity that you need.
Coming back to EBS-optimized, realize that if you have more IOPS, you probably need more pipe to push those bits through. EBS-optimized can speed up EBS performance, by providing more access to the network bandwidth between the EC2 instance and the EBS network-attached storage. It is important to recognize that if PIOPS is chosen, EBS-optimized should also be selected, so that you aren't trying to cram your PIOPS through the standard shared network pipe. This is especially important for Cassandra, as processes tend to be quite "chatty". It should be mentioned for completeness sake, you can also select EBS-optimized for standard EBS.
It is important to realize that networking is different on each EC2 instance size. Writes will be faster on an m1.xlarge than an m1.large, not because of CPU, but because of network.
All in all, EBS-optimized is an option that you should investigate yourself. It is not necessarily intended to boost performance, but only to lessen the variability in throughput, using dedicated pipes for the EBS instances, rather than one per EC2 instance. There are certainly cases where normal EBS has been observed to outperform EBS-optimized, especially in smaller EC2 instance types.
Amazon is rather cagey about how these improvements are implemented, so it is difficult to speculate about how they get these improved options. Suffice it to say, you have to pay for the improved performance.
Let's go back to the instance store, or ephemeral storage. This storage is physically attached, making its performance better than EBS for Cassandra clusters. However, it does evaporate when the EC2 instance is shutdown. Also, AWS makes a distinction between physically-attached and direct-attached, which I'll address in a moment. I believe that the physically-attached is still a shared resource within a rack, whereas direct-attached is only available to the EC2 instance that it is attached to.
The ephemeral storage is either rotating disk or SSD. In another blog post, I'll discuss the various AWS machines and their relative characteristics. Ephemeral storage can also be RAID configured to improve performance (the main thing that Cassandra users are trying to improve). The latency of ephemeral storage is considerably lower than EBS, and for this reason, most Cassandra clusters built in AWS should use ephemeral storage rather than EBS.
A puzzling question I had at the end of this research was, is EBS using rotating disks or SSDs? The end answer is that AWS doesn't really publish information about that, and in the end, it doesn't matter which type they are using. If you wish to increase the performance of EBS storage, then EBS-optimized with PIOPS is the only way to guarantee a certain level of performance. Whether AWS is achieving that using SSDs or additional/faster network cards is immaterial.
Finally, AWS has started offering beefier machines that have "direct-attached" storage. My surmise is that this is not a shared resource like instance store, and so the performance can be not only faster but reliable. Even if you run PIOPS with EBS-optimized, you may pay more for throughput that you ultimately won't use. Since Cassandra is scaled to use nodes that match throughput, the added complexity of network scaling will amount to wasted money. EBS devices also introduce a single point of failure (SPOF) that Cassandra prides itself on not having as part of its design. Certainly, schemes to spread EBS devices amongst multiple availability zones could alleviate this problem, but why choose this in the first place?
So really, unless you want to add more complexity for your operations team, choose EC2 instances with ephemeral or direct-attached storage, and size your instances appropriately. DataStax publishes a guide to scaling your cloud instances in this handy Reference Architecture.Scaling out your application horizontally, when necessary, is another way to extend your performance, while adding redundancy for free.You can explore the various EC2 instance types yourself here.
As an aside, there are interesting arguments made by Adrian Cockcroft, formerly of Netflix, about the performance issues that arise from multi-tenancy, i.e., different applications run on EC2 instances within the same rack. See his great blog about this written awhile back, but still relevant. You'll be glad you took the time to read it!
- Adrian Cockcroft: http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html