Apache Cassandra™ 2.0

Planning an Amazon EC2 cluster

Use the DataStax AMI for clusters in a single availability zone.

For EC2 clusters that span multiple regions and availability zones, install Cassandra on your EC2 instances as described in Installing Cassandra Debian packages, and then configure the cluster as a multiple data center cluster using the EC2MultiRegionSnitch.
Note: OpsCenter provides several useful features for adding nodes and clusters:

Use only AMI's from a trusted source. Random AMI's pose a security risk and may perform levels slower than expected due to the way the install is configured for EC2. The following are examples of trusted AMI:

For production Cassandra clusters on EC2, using these guidelines for choosing the instance types:
  • Development and light production: m1.large
  • Moderate production: m1.xlarge
  • SSD production with light data: c3.2xlarge
  • Largest heavy production: m3.2xlarge (PV) or i2.2xlarge (HVM)

EBS volumes are not recommended

EBS volumes are not recommended for Cassandra data storage volumes for the following reasons:

  • EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
  • EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to back load reads and writes until the entire cluster becomes unresponsive.
  • Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing.

For more information and graphs related to ephemeral versus EBS performance, see the blog article Systematic Look at EC2 I/O.

Storage recommendations for Cassandra 1.2 and later

Cassandra 1.2 and later supports JBOD (just a bunch of disks). JBOD excels at tolerating partial failures of your disk array. Configure using disk_failure_policy in the cassandra.yaml file. Addition information is available in the Handling Disk Failures In Cassandra 1.2 blog.

Note: Cassandra JBOD support allows you to use standard disks. However, RAID0 may provide better throughput because it splits every block to be on another device so that writes are written in parallel fashion instead of written serially on disk.

Storage recommendations for the DataStax AMI and Cassandra 1.1 and earlier

RAID 0 the ephemeral disks, and put both the data directory and the commit log on that volume. This has proved to be better in practice than putting the commit log on the root volume (which is also a shared resource). For more data redundancy, consider deploying your Cassandra cluster across multiple availability zones or using EBS volumes to store your Cassandra backup files.