Apache Cassandra 1.2 Documentation

About data distribution and replication

In Cassandra, data distribution and replication go together. This is because Cassandra is designed as a peer-to-peer system that makes copies of the data and distributes the copies among a group of nodes. Data is organized by table and identified by a row key called a primary key. The primary key determines which node the data is stored on. Copies of rows are called replicas. When data is first written, it is also referred to as a replica.

When your create a cluster, you must specify the following:

  • Virtual nodes: Assigns data ownership to physical machines.
  • Partitioner: Partitions the data across the cluster.
  • Replication strategy: Determines the replicas for each row of data.
  • Snitch: Defines the topology information that the replication strategy uses to place replicas.

Consistent hashing

This section provides more detail about how the consistent hashing mechanism distributes data across a cluster in Cassandra. Consistent hashing partitions data based on the primary key. For example, if you have the following data:

jim age: 36 car: camaro gender: M
carol age: 37 car: bmw gender: F
johnny age: 12 gender: M  
suzy age: 10 gender: F  

Cassandra assigns a hash value to each primary key:

Primary key Murmur3 hash value
jim -2245462676723223822
carol 7723358927203680754
johnny -6723372854036780875
suzy 1168604627387940318

Each node in the cluster is responsible for a range of data based on the hash value:

Node Murmur3 start range Murmur3 end range
A -9223372036854775808 -4611686018427387903
B -4611686018427387904 -1
C 0 4611686018427387903
D 4611686018427387904 9223372036854775807

Cassandra places the data on each node according to the value of the primary key and the range that the node is responsible for. For example, in a four node cluster, the data in this example is distributed as follows:

Node Start range End range Primary key Hash value
A -9223372036854775808 -4611686018427387903 johnny -6723372854036780875
B -4611686018427387904 -1 jim -2245462676723223822
C 0 4611686018427387903 suzy 1168604627387940318
D 4611686018427387904 9223372036854775807 carol 7723358927203680754

How data is distributed across a cluster (using virtual nodes)

Prior to version 1.2, you had to calculate and assign a single token to each node in a cluster. Each token determined the node's position in the ring and its portion of data according to its hash value. Although the design of consistent hashing used prior to version 1.2 (compared to other distribution designs), allowed moving a single node's worth of data when adding or removing nodes from the cluster, it still required substantial effort to do so.

Starting in version 1.2, Cassandra changes this paradigm from one token and range per node to many tokens per node. The new paradigm is called virtual nodes. Virtual nodes allow each node to own a large number of small ranges distributed throughout the cluster. Virtual nodes also use consistent hashing to distribute data but using them doesn't require token generation and assignment.


../../_images/vnodes_compare.png

The top portion of the graphic shows a cluster without virtual nodes. In this paradigm, each node is assigned a single token that represents a location in the ring. Each node stores data determined by mapping the row key to a token value within a range from the previous node to its assigned value. Each node also contains copies of each row from other nodes in the cluster. For example, range E replicates to nodes 5, 6, and 1. Notice that a node owns exactly one contiguous range in the ring space.

The bottom portion of the graphic shows a ring with virtual nodes. Within a cluster, virtual nodes are randomly selected and non-contiguous. The placement of a row is determined by the hash of the row key within many smaller ranges belonging to each node.

Using virtual nodes

Virtual nodes simplify many tasks in Cassandra:

  • You no longer have to calculate and assign tokens to each node.
  • Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.
  • Rebuilding a dead node is faster because it involves every other node in the cluster and because data is sent to the replacement node incrementally instead of waiting until the end of the validation phase.
  • Improves the use of heterogeneous machines in a cluster. You can assign a proportional number of virtual nodes to smaller and larger machines.

For more information, see the article Virtual nodes in Cassandra 1.2.

To set up virtual nodes:

Set the number of tokens on each node in your cluster with the num_tokens parameter in the cassandra.yaml file. The recommended value is 256. Do not set the initial_token parameter.

Generally when all nodes have equal hardware capability, they should have the same number of virtual nodes. If the hardware capabilities vary among the nodes in your cluster, assign a proportional number of virtual nodes to the larger machines. For example, you could designate your older machines to use 128 virtual nodes and your new machines (that are twice as powerful) with 256 virtual nodes.

About data replication

Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed.

The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later. When replication factor exceeds the number of nodes, writes are rejected, but reads are served as long as the desired consistency level can be met.

Two replication strategies are available:

  • SimpleStrategy: Use for a single data center only. If you ever intend more than one data center, use the NetworkTopologyStrategy.
  • NetworkTopologyStrategy: Highly recommended for most deployments because it is much easier to expand to multiple data centers when required by future expansion.

SimpleStrategy

Use only for a single data center. SimpleStrategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering topology (rack or data center location).

NetworkTopologyStrategy

Use NetworkTopologyStrategy when you have (or plan to have) your cluster deployed across multiple data centers. This strategy specify how many replicas you want in each data center.

NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack. NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) often fail at the same time due to power, cooling, or network issues.

When deciding how many replicas to configure in each data center, the two primary considerations are (1) being able to satisfy reads locally, without incurring cross data-center latency, and (2) failure scenarios. The two most common ways to configure multiple data center clusters are:

  • Two replicas in each data center: This configuration tolerates the failure of a single node per replication group and still allows local reads at a consistency level of ONE.
  • Three replicas in each data center: This configuration tolerates either the failure of a one node per replication group at a strong consistency level of LOCAL_QUORUM or multiple node failures per data center using consistency level ONE.

Asymmetrical replication groupings are also possible. For example, you can have three replicas per data center to serve real-time application requests and use a single replica for running analytics.

Choosing keyspace replication options

To set the replication strategy for a keyspace, see CREATE KEYSPACE.

When you use NetworkToplogyStrategy, during creation of the keyspace strategy_options, you use the data center names defined for the snitch used by the cluster. To place replicas in the correct location, Cassandra requires a keyspace definition that uses the snitch-aware data center names. For example, if the cluster uses the PropertyFileSnitch, create the keyspace using the user-defined data center and rack names in the cassandra-topologies.properties file. If the cluster uses the EC2Snitch, create the keyspace using EC2 data center and rack names.