Apache Cassandra 0.8 Documentation

Planning a Cassandra Cluster Deployment

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

When planning a Cassandra cluster deployment, you should first have a good idea of the initial volume of data you plan to store, as well as what your typical application workload will be.

Selecting Hardware

As with any application, choosing appropriate hardware depends on selecting the right balance of the following resources: Memory, CPU, Disk and Network.

Memory

The more memory a Cassandra node has, the better read performance will be. More RAM allows for larger cache sizes, reducing disk I/O for reads. More RAM also allows for larger memory tables (memtables) to hold the most recently written data. Larger memtables lead to a fewer number of SSTables being flushed to disk and fewer files to scan during a read. The minimum that should be considered for a production deployment is 4GB, with 8GB-16GB being the most common, and many production clusters using 32GB or more per node. The ideal amount of RAM depends on the anticipated size of your hot data.

CPU

Insert-heavy workloads will actually be CPU-bound in Cassandra before being memory-bound. Cassandra is highly concurrent and will benefit from many CPU cores. Currently, nodes with 8 CPU cores typically provide the best price/performance ratio. On virtualized machines, consider using a cloud provider that allows CPU bursting.

Disk

When choosing disks, you should consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate). Most workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with a lot of RAM).

Solid-state drives (SSDs) are also a valid alternative for Cassandra. Cassandra’s sequential, streaming write patterns minimize the undesirable effects of write amplification associated with SSDs.

Cassandra persists data to disk for two different purposes: it constantly appends to a commit log for write durability, and periodically flushes in-memory data structures to SSTable data files for persistent storage of column family data. It is highly recommended to use a different disk device for the commit log than for the SSTables. The commit log does not need much capacity, but throughput should be enough to accommodate your expected write load. Data directories should be large enough to house all of your data, but with enough throughput to handle your expected (non-cached) read load and the disk I/O required by flushing and compaction.

During compaction and node repair, disk utilization can temporarily double in your data directory volume. For this reason, DataStax recommends only using 50 percent of usable disk capacity on a node.

For disk fault tolerance and data redundancy, there are two reasonable approaches:

  • Use RAID0 on your data volume and rely on Cassandra replication for disk failure tolerance. If you lose a disk on a node, you can recover lost data through Cassandra’s built-in repair.
  • Use RAID10 to avoid large repair operations after a single disk failure.

If you have disk capacity to spare, DataStax recommends using RAID10. If disk capacity is a bottleneck for your workload, use RAID0.

Network

Since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and replication of data across nodes. You want to choose reliable, redundant network interfaces and make sure that your network can handle traffic between nodes without bottlenecks.

Cassandra is efficient at routing requests to replicas that are geographically closest to the coordinator node handling the request. Cassandra will pick a replica in the same rack if possible, and will choose replicas located in the same data center over replicas in a remote data center.

Cassandra uses the following ports. If using a firewall, make sure that nodes within a cluster can reach each other on these ports:

Port Description
7000 Cassandra intra-node communication port
9160 Thrift client port
7199 JMX monitoring port (8080 in prior releases)

Planning an Amazon EC2 Cluster

Cassandra clusters can be deployed on cloud infrastructures such as Amazon EC2.

For production Cassandra clusters on EC2, use L or XL instances with local storage. RAID0 the ephemeral disks, and put both the data directory and the commit log on that volume. This has proved to be better in practice than putting the commit log on the root volume (which is also a shared resource). For data redundancy, consider deploying your Cassandra cluster across multiple availability zones or using EBS volumes to store your Cassandra backup files.

EBS volumes are not recommended for Cassandra data volumes - their network performance and disk I/O are not good fits for Cassandra for the following reasons:

  • EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
  • EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to backload reads and writes until the entire cluster becomes unresponsive.
  • Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing.

DataStax provides an Amazon Machine Image (AMI) to allow you to quickly deploy a multi-node Cassandra cluster on Amazon EC2. The DataStax AMI initializes all nodes in one availability zone using the SimpleSnitch.

If you want an EC2 cluster that spans multiple regions and availability zones, do not use the DataStax AMI. Instead, initialize your EC2 instances for each Cassandra node and then configure the cluster as a multi data center cluster.

Capacity Planning

The estimates in this section can be used as guidelines for planning the size of your Cassandra cluster based on the data you plan to store. To estimate capacity, you should first have a good understanding of the sizing of the raw data you plan to store, and how you plan to model your data in Cassandra (number of column families, rows, columns per row, and so on).

Calculating Usable Disk Capacity

To calculate how much data your Cassandra nodes can hold, calculate the usable disk capacity per node and then multiply that by the number of nodes in your cluster. Remember that in a production cluster, you will typically have your commit log and data directories on different disks. This calculation is for estimating the usable capacity of the data volume.

Start with the raw capacity of the physical disks:

::
raw_capacity = disk_size * number_of_disks

Account for file system formatting overhead (roughly 10 percent) and the RAID level you are using. For example, if using RAID-10, the calculation would be:

::
(raw_capacity * 0.9) / 2 = formatted_disk_space

During normal operations, Cassandra routinely requires disk capacity for compaction and repair operations. For optimal performance and cluster health, DataStax recommends that you do not fill your disks to capacity, but run at 50 percent capacity or lower. With this in mind, calculate the usable disk space as follows:

::
formatted_disk_space * 0.5 = usable_disk_space

Calculating User Data Size

As with all data storage systems, the size of your raw data will be larger once it is loaded into Cassandra due to storage overhead. On average, raw data will be about 2 times larger on disk after it is loaded into the database, but could be much smaller or larger depending on the characteristics of your data and column families. The calculations in this section account for data persisted to disk, not for data stored in memory.

  • Column Overhead - Every column in Cassandra incurs 15 bytes of overhead. Since each row in a column family can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, add an additional 8 bytes (23 bytes column overhead). So the total size of a regular column is:

    total_column_size = column_name_size + column_value_size + 15
    
  • Row Overhead - Just like columns, every row also incurs some overhead when stored on disk. Every row in Cassandra incurs 23 bytes of overhead.

  • Primary Key Index - Every column family also maintains a primary index of its row keys. Primary index overhead becomes more significant when you have lots of skinny rows. Sizing of the primary row key index can be estimated as follows (in bytes):

    primary_key_index = number_of_rows * (32 + average_key_size)
    
  • Replication Overhead - The replication factor obviously plays a role in how much disk capacity is used. For a replication factor of 1, there is no overhead for replicas (as only one copy of your data is stored in the cluster). If replication factor is greater than 1, then your total data storage requirement will include replication overhead.

    replication_overhead = total_data_size * (replication_factor - 1)
    

Choosing Node Configuration Options

A major part of planning your Cassandra cluster deployment is understanding and setting the various node configuration properties. This section explains the various configuration decisions that need to be made before deploying a Cassandra cluster, be it a single-node, multi-node, or multi-data center cluster.

These properties mentioned in this section are set in the cassandra.yaml configuration file. Each node should be correctly configured before starting it for the first time.

Storage Settings

By default, a node is configured to store the data it manages in /var/lib/cassandra. In a production cluster deployment, you should change the commitlog_directory so it is on a different disk device than the data_file_directories.

Gossip Settings

The gossip settings control a nodes participation in a cluster and how the node is known to the cluster.

Property Description
cluster_name Name of the cluster that this node is joining. Should be the same for every node in the cluster.
listen_address The IP address or hostname that other Cassandra nodes will use to connect to this node. Should be changed from localhost to the public address for the host.
seeds A comma-delimited list of node IP addresses used to bootstrap the gossip process. Every node should have the same list of seeds. In multi data center clusters, the seed list should include a node from each data center.
storage_port The intra-node communication port (default is 7000). Should be the same for every node in the cluster.
initial_token The initial token is used to determine the range of data this node is responsible for.

Purging Gossip State on a Node

Gossip information is also persisted locally by each node to use immediately next restart without having to wait for gossip. To clear gossip history on node restart (for example, if node IP addresses have changed), add the following line to the cassandra-env.sh file. This file is located in /usr/share/cassandra or $CASSANDRA_HOME/conf in Cassandra installations.

-Dcassandra.load_ring_state=false

Partitioner Settings

When you deploy a Cassandra cluster, you need to make sure that each node is responsible for roughly an equal amount of data. This is also known as load balancing. This is done by configuring the partitioner for each node, and correctly assigning the node an initial_token value.

DataStax strongly recommends using the RandomPartitioner (the default) for all cluster deployments. Assuming use of this partitioner, each node in the cluster is assigned a token that represents a hash value within the range of 0 to 2**127.

For clusters where all nodes are in a single data center, you can calculate tokens by dividing the range by the total number of nodes in the cluster. In multi-data center deployments, tokens should be calculated such that each data center is individually load balanced as well. See Calculating Tokens for the different approaches to generating tokens for nodes in single and multi-data center clusters.

Snitch Settings

The snitch is responsible for knowing the location of nodes within your network topology. This affects where replicas are placed as well as how requests are routed between replicas. The endpoint_snitch property configures the snitch for a node. All nodes should have the exact same snitch configuration.

For a single data center (or single node) cluster, using the default SimpleSnitch is usually sufficient. However, if you plan to expand your cluster at a later time to multiple racks and data centers, it will be easier if you choose a rack and data center aware snitch from the start. All snitches are compatible with all replica placement strategies.

Configuring the PropertyFileSnitch

The PropertyFileSnitch requires you to define network details for each node in the cluster in a cassandra-topology.properties configuration file. A sample of this file is located in /etc/cassandra/conf/cassandra.yaml in packaged installations or $CASSANDRA_HOME/conf/cassandra.yaml in binary installations.

Every node in the cluster should be described in this file, and this file should be exactly the same on every node in the cluster if you are using the PropertyFileSnitch.

For example, supposing you had non-uniform IPs and two physical data centers with two racks in each, and a third logical data center for replicating analytics data:

# Data Center One

175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1

120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2

# Data Center Two

110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1

50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2

# Analytics Replication Group

172.106.12.120=DC3:RAC1
172.106.12.121=DC3:RAC1
172.106.12.122=DC3:RAC1

# default for unknown nodes
default=DC3:RAC1

If using this snitch, you can define your data center and rack names to be whatever you want. Just make sure the data center names you define in the cassandra-topology.properties file correlates to what you name your data centers in your keyspace strategy_options.

Choosing Keyspace Replication Options

When you create a keyspace, you must define the replica placement strategy and the number of replicas you want. DataStax recommends always choosing NetworkTopologyStrategy for both single and multi-data center clusters. It is as easy to use as SimpleStrategy and allows for expansion to multiple data centers in the future, should that become useful. It is much easier to configure the most flexible replication strategy up front, than to reconfigure replication after you have already loaded data into your cluster.

NetworkTopologyStrategy takes as options the number of replicas you want per data center. Even for single data center (or single node) clusters, you can use this replica placement strategy and just define the number of replicas for one data center. For example (using cassandra-cli):

[default@unknown] CREATE KEYSPACE test
WITH placement_strategy = 'NetworkTopologyStrategy'
AND strategy_options=[{us-east:6}];

Or for a multi-data center cluster:

[default@unknown] CREATE KEYSPACE test
WITH placement_strategy = 'NetworkTopologyStrategy'
AND strategy_options=[{DC1:6,DC2:6,DC3:3}];

When declaring the keyspace strategy_options, what you name your data centers depends on the snitch you have chosen for your cluster. The data center names must correlate to the snitch you are using in order for replicas to be placed in the correct location.

As a general rule, the number of replicas should not exceed the number of nodes in a replication group. However, it is possible to increase the number of replicas, and then add the desired number of nodes afterwards. When the replication factor exceeds the number of nodes, writes will be rejected, but reads will still be served as long as the desired consistency level can be met.