Apache Cassandra 1.1 Documentation

Planning a Cassandra Cluster Deployment

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

When planning a Cassandra cluster deployment, you should have a good idea of the initial volume of data you plan to store and a good estimate of the typical application workload. After reading this section, it recommended that you read Anti-patterns in Cassandra.

Selecting Hardware for Enterprise Implementations

As with any application, choosing appropriate hardware depends on selecting the right balance of the following resources: memory, CPU, disks, number of nodes, and network.

Memory

The more memory a Cassandra node has, the better read performance. More RAM allows for larger cache sizes and reduces disk I/O for reads. More RAM also allows memory tables (memtables) to hold more recently written data. Larger memtables lead to a fewer number of SSTables being flushed to disk and fewer files to scan during a read. The ideal amount of RAM depends on the anticipated size of your hot data.

  • For dedicated hardware, the optimal price-performance sweet spot is 16GB to 64GB; the minimum is 8GB.
  • For a virtual environments, the optimal range may be 8GB to 16GB; the minimum is 4GB.
  • For testing light workloads, Cassandra can run on a virtual machine as small as 256MB.
  • Java heap space should be set to a maximum of 8GB or half of your total RAM, whichever is lower. (A greater heap size has more intense garbage collection periods.)

CPU

Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound. Cassandra is highly concurrent and uses as many CPU cores as available.

  • For dedicated hardware, 8-core processors are the current price-performance sweet spot.
  • For virtual environments, consider using a provider that allows CPU bursting, such as Rackspace Cloud Servers.

Disk

What you need for your environment depends a lot on the usage, so it's important to understand the mechanism. Cassandra writes data to disk for two purposes:

  • All data is written to the commit log for durability.
  • When thresholds are reached, Cassandra periodically flushes in-memory data structures (memtables) to SSTable data files for persistent storage of column family data.

Commit logs receive every write made to a Cassandra node, but are only read during node start up. Commit logs are purged after the corresponding data is flushed. Conversely, SSTable (data file) writes occur asynchronously and are read during client look-ups. Additionally, SSTables are periodically compacted. Compaction improves performance by merging and rewriting data and discarding old data. However, during compaction (or node repair), disk utilization and data directory volume can substantially increase. For this reason, DataStax recommends leaving an adequate amount of free disk space available on a node (50% [worst case] for tiered compaction, 10% for leveled compaction).

The following links provide information about compaction:

Recommendations:

  • When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate). Most workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).

  • Solid-state drives (SSDs) are the recommended choice for Cassandra. Cassandra's sequential, streaming write patterns minimize the undesirable effects of write amplification associated with SSDs. This means that Cassandra deployments can take advantage of inexpensive consumer-grade SSDs.

  • Ideally Cassandra needs at least two disks (if using HDDs), one for the commit log and the other for the data directories. At a minimum the commit log should be on its own partition.

  • Commit log disk - this disk does not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).

  • Data disks - use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction.

  • RAID - compaction can temporarily require up to 100% of the free in-use disk space on a single data directory volume. This means when approaching 50% of disk capacity, you should use RAID 0 or RAID 10 for your data directory volumes. RAID also helps smooth out I/O hotspots within a single SSTable.

    • Use RAID0 if disk capacity is a bottleneck and rely on Cassandra's replication capabilities for disk failure tolerance. If you lose a disk on a node, you can recover lost data through Cassandra's built-in repair.
    • Use RAID10 to avoid large repair operations after a single disk failure, or if you have disk capacity to spare.
    • Generally RAID is not needed for the commit log disk because replication adequately prevents data loss. If you need the extra redundancy, use RAID 1.
  • Extended file systems - On ext2 or ext3, the maximum file size is 2TB even using a 64-bit kernel. On ext4 it is 16TB.

    Because Cassandra can use almost half your disk space for a single file, use XFS when raiding large disks together, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit.

Number of Nodes

The amount of data on each disk in the array isn't as important as the total size per node. Using a greater number of smaller nodes is better than using fewer larger nodes because of potential bottlenecks on larger nodes during compaction.

Network

Since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and replication of data across nodes. Be sure to choose reliable, redundant network interfaces and make sure that your network can handle traffic between nodes without bottlenecksT.

  • Recommended bandwidth is 1000 Mbit/s (Gigabit) or greater.
  • Bind the Thrift interface (listen_address) to a specific NIC (Network Interface Card).
  • Bind the RPC server interface (rpc_address) to another NIC.

Cassandra is efficient at routing requests to replicas that are geographically closest to the coordinator node handling the request. Cassandra will pick a replica in the same rack if possible, and will choose replicas located in the same data center over replicas in a remote data center.

Firewall

If using a firewall, make sure that nodes within a cluster can reach each other on these ports. See Configuring Firewall Port Access.

Note

Generally, when you have firewalls between machines, it is difficult to run JMX across a network and maintain security. This is because JMX connects on port 7199, handshakes, and then uses any port within the 1024+ range. Instead use SSH to execute commands remotely connect to JMX locally or use the DataStax OpsCenter.

Planning an Amazon EC2 Cluster

DataStax provides an Amazon Machine Image (AMI) to allow you to quickly deploy a multi-node Cassandra cluster on Amazon EC2. The DataStax AMI initializes all nodes in one availability zone using the SimpleSnitch. If you want an EC2 cluster that spans multiple regions and availability zones, do not use the DataStax AMI. Instead, install Cassandra on your EC2 instances as described in Installing Cassandra Debian Packages, and then configure the cluster as a multiple data center cluster.

Use the following guidelines when setting up your cluster:

EBS volumes are not recommended for Cassandra data volumes. Their network performance and disk I/O are not good fits for Cassandra for the following reasons:

  • EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.

  • EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to backload reads and writes until the entire cluster becomes unresponsive.

  • Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing.

    For more information and graphs related to ephemeral versus EBS performance, see the blog article at http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/.

Calculating Usable Disk Capacity

To calculate how much data your Cassandra nodes can hold, calculate the usable disk capacity per node and then multiply that by the number of nodes in your cluster. Remember that in a production cluster, you will typically have your commit log and data directories on different disks. This calculation is for estimating the usable capacity of the data volume.

Start with the raw capacity of the physical disks:

raw_capacity = disk_size * number_of_disks

Account for file system formatting overhead (roughly 10 percent) and the RAID level you are using. For example, if using RAID-10, the calculation would be:

(raw_capacity * 0.9) / 2 = formatted_disk_space

During normal operations, Cassandra routinely requires disk capacity for compaction and repair operations. For optimal performance and cluster health, DataStax recommends that you do not fill your disks to capacity, but run at 50-80 percent capacity. With this in mind, calculate the usable disk space as follows (example below uses 50%):

formatted_disk_space * 0.5 = usable_disk_space

Calculating User Data Size

As with all data storage systems, the size of your raw data will be larger once it is loaded into Cassandra due to storage overhead. On average, raw data will be about 2 times larger on disk after it is loaded into the database, but could be much smaller or larger depending on the characteristics of your data and column families. The calculations in this section account for data persisted to disk, not for data stored in memory.

  • Column Overhead - Every column in Cassandra incurs 15 bytes of overhead. Since each row in a column family can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, add an additional 8 bytes (23 bytes column overhead). So the total size of a regular column is:

    total_column_size = column_name_size + column_value_size + 15
    
  • Row Overhead - Just like columns, every row also incurs some overhead when stored on disk. Every row in Cassandra incurs 23 bytes of overhead.

  • Primary Key Index - Every column family also maintains a primary index of its row keys. Primary index overhead becomes more significant when you have lots of skinny rows. Sizing of the primary row key index can be estimated as follows (in bytes):

    primary_key_index = number_of_rows * (32 + average_key_size)
    
  • Replication Overhead - The replication factor obviously plays a role in how much disk capacity is used. For a replication factor of 1, there is no overhead for replicas (as only one copy of your data is stored in the cluster). If replication factor is greater than 1, then your total data storage requirement will include replication overhead.

    replication_overhead = total_data_size * (replication_factor - 1)
    

Choosing Node Configuration Options

A major part of planning your Cassandra cluster deployment is understanding and setting the various node configuration properties. These properties are set in the cassandra.yaml configuration file. Each node should be correctly configured before starting it for the first time:

  • Storage settings: By default, a node is configured to store the data it manages in the /var/lib/cassandra directory. In a production cluster deployment, you should change the commitlog_directory to a different disk drive from the data_file_directories.
  • Gossip settings: The gossip-related settings control a node's participation in a cluster and how the node is known to the cluster.
  • Purging gossip state on a node: Gossip information is also persisted locally by each node to use immediately when a node restarts. You may want to purge gossip history on node restart for various reasons, such as when the node's IP addresses has changed.
  • Partitioner settings: You must set the partitioner type and assign each node an initial_token value.
  • Snitch settings: You need to configure a snitch when you create a cluster. The snitch is responsible for knowing the location of nodes within your network topology and distributing replicas by grouping machines into data centers and racks.
  • Keyspace replication settings: When you create a keyspace, you must define the replica placement strategy and the number of replicas you want.