Apache Cassandra 0.7 Documentation

Clustering

Tokens, Partitioners, and the Ring

Cassandra servers are each assigned a unique token. When the token from these servers are taken as a set and sorted, they create a range which is used to determine the storage location of individual rows by mapping their keys to a token value. Individual Cassandra servers are responsible for the the set of tokens that includes their own and expands to the previous server’s token. The server with the lowest token from the sort wraps to include the range above the server with the highest token. This range mapping is commonly referred to as “the ring”.

Partitioners

Partitioners decide where a key maps onto the ring.

RandomPartitioner

RandomPartitioner takes the MD5 hash of keys and uses the hash as the location in the ring. This causes keys to be evenly distributed around the entire ring.

OrderPreservingPartitioner

OrderPreservingPartitioner does not use a hash to determine a key’s placement in the ring, but instead compares the key itself (as a byte array) to the tokens in the ring to determine which node is responsible for that key. This partitioner uses UTF-8 validation for keys.

Unlike RandomPartitioner, this partitioner allows you to efficiently use get_range() to get a contiguous range of keys.

However, a serious drawback of this partitioner is that it leads to an unbalanced ring, so node tokens must constantly be adjusted to keep the load equal. Because there are other ways to acheive the nice get_range() behavior, such as Using Column Families as Indexes, RandomPartitioner should always be preferred unless an order preserving partitioner is absolutely needed.

ByteOrderedPartitioner

This partitioner supports keys with arbitrary content, and orders keys by their byte value. Because it does not enforce UTF-8 keys, ByteOrderedPartitioner may be preferred over OrderPreservingPartitioner in deployments where posting non-UTF-bytes is a requirement.

CollatingOrderPreservingPartitioner

Similar to OrderPreservingPartitioner, but compares keys using EN,US rules instead of native byte ordering.

Replication

Replication of data in Cassandra is controlled by the replication_factor setting in the keyspace definition. The default value of 1 means that data is written to only one node in the ring – essentially, replication is disabled by default.

Choosing a higher value for the replication factor creates data redundancy for failover. With the replication factor set to 3, for example, write operations are committed to three separate nodes. This provides for two copies of a record in case of failure in one of the three replicas. This also means that each node acts as a replica for three separate key ranges.

Note

The replication factor reflects the total number of replicas. A replication factor of 1 does not mean there is an additional copy of the data; it means there is one copy in total, and no data redundancy at all.

The actual placement of replicas in the cluster is determined by the Replica Placement Strategies. If your cluster spans multiple data centers, selecting a placement strategy is especially important to your overall replication setup.

When choosing the replication factor for a keyspace, consider the number carefully; Changing the Replication Factor is a non-trivial task that can temporarily affect the performance of your cluster.

Replica Placement Strategies

For keyspaces where the replication factor is higher than 1, the replica placement strategy determines which nodes contain the additional copy, or copies, of data. You can define the replica placement strategy when creating a keyspace in the Cassandra CLI. The default value is org.apache.cassandra.locator.SimpleStrategy.

SimpleStrategy can be efficient for a cluster located within a single data center. For clusters spanning multiple data centers, DataStax recommends using one of the “rack-aware” strategies – NetworkTopologyStrategy or OldNetworkTopologyStrategy.

Placement Strategy Description
SimpleStrategy Returns the nodes that are next to each other on the ring
NetworkTopologyStrategy Allows you to configure the number of replicas per data center as specified in the strategy_options. Replicas are placed on different racks within each data center, if possible.
OldNetworkTopologyStrategy Places one replica in a different data center while placing the others on different racks in the current data center. Optimized for exactly three replicas across two datacenters.

A cluster using one of the rack-aware replica placement strategies must also use one of the rack-aware Snitches.

Changing the Replication Factor

Increasing the ReplicationFactor for a cluster is a two part process with each part having several steps as defined below.

The first part involves the following for each server in the cluster:

  1. Edit the configuration file to contain the desired value (see replication_factor for more information).
  2. Restart Cassandra

The second part, also done to each node in the cluster, involves running nodetool repair against each keyspace individually.

Note that repair can be an intensive process depending on the size of the data so it’s advisable to wait until one repair task finishes on a node before moving on to another.

Snitches

Snitches give Cassandra information about the network topology of the cluster. Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the Replica Placement Strategies. Snitches do not affect client requests.

Cassandra comes with three snitches out of the box: SimpleSnitch, RackInferringSnitch and PropertyFileSnitch. Each of these snitches uses a different method to determine the location of nodes in the network and the most efficient way to route requests.

  • SimpleSnitch (org.apache.cassandra.locator.SimpleSnitch)

    SimpleSnitch can be efficient for locating nodes in clusters limited to a single data center. This snitch is configured by default.

  • RackInferringSnitch (org.apache.cassandra.locator.RackInferringSnitch)

    RackInferringSnitch extrapolates the topolology of the network by analyzing IP addresses. This snitch assumes that the second octet identifies the data center where a node is located, and the third octet identifies the rack.

  • PropertyFileSnitch (org.apache.cassandra.locator.PropertyFileSnitch)

    PropertyFileSnitch determines the location of nodes by referring to a user-defined description of the network details located in the property file cassandra-topology.properties. Your installation contains an example properties file for PropertyFileSnitch in $CASSANDRA_HOME/conf/cassandra-topology.properties.

To configure the snitch for Cassandra, set the desired value for endpoint_snitch. For example, to configure RackInferringSnitch, edit cassandra.yaml as follows:

endpoint_snitch: org.apache.cassandra.locator.RackInferringSnitch

Note

With NetworkTopologyStrategy or OldNetworkTopologyStrategy, you must use one of the rack-aware snitches – either RackInferringSnitch or PropertyFileSnitch.

Dynamic Snitching

By default, all snitches are wrapped in a dynamic snitch layer that monitors read latency and, when possible, routes requests away from poorly-performing nodes. Dynamic snitching is used by default, and is recommended for use in most deployments.

The performance threshold and intervals for dynamic snitching can be controlled using these parameters in casssandra.yaml:

dynamic_snitch_update_interval_in_ms
Sets the interval to calculate read latency. Defaults to 100.
dynamic_snitch_reset_interval_in_ms
Sets the interval to reset all host scores and allow a bad node to recover. Defaults to 60000.
dynamic_snitch_badness_threshold
Sets a performance threshold for dynamically routing requests away from a node. Defaults to 0.0. A value of 0.2 means Cassandra would continue to prefer the static snitch values until the host was 20% worse than the fastest.

Adding Capacity

You can add capacity to a Cassandra cluster by autobootstrapping new nodes. Adding more nodes than the existing number is not supported in a single operation, and must be performed in stages.

Increasing the number of nodes by an integer factor – such as doubling, tripling, or quadrupling the number of nodes – is operationally less complicated than increasing the number unevenly, as it does not require moving tokens on existing nodes. This is because the new token ranges evenly divide the existing token ranges, and the existing nodes can keep their initial token values while new nodes assume half (or a third, or fourth) of the range.

However, any expansion of the cluster requires you to calculate new intial_token values for all nodes, and requires performance-intensive cleanup operations to remove unused keys. DataStax recommends scheduling these operations for low-use hours.

Calculating Tokens

Whenever you expand capacity, calculate the correct token values for all nodes in the expanded cluster. To determine the correct initial token values for the cluster, divide 2 to the 127th power by the total number of nodes, enumerate the nodes starting with zero, then multiply the node’s number by the quotient. The Cassandra Wiki provides a python program to calculate new tokens for the nodes. A Python script like the following, when run from a command line, will prompt you for a number of tokens and will print the initial token values:

#! /usr/bin/python
import sys
if (len(sys.argv) > 1):
        num=int(sys.argv[1])
else:
        num=int(raw_input("How many nodes are in your cluster? "))
for i in range(0, num):
        print 'node %d: %d' % (i, (i*(2**127)/num))

Once you have calculated these values, you can provide the appropriate value for each node. For the new nodes, you will use the calculated value to set initial_token in cassandra.yaml. For existing nodes, you will use nodetool move <new_token>. This is required in order to balance the load evenly.

In scenarios where you are doubling the number of nodes, half of the tokens calculated should already belong to existing nodes. Make sure you assign the new, unused tokens to the new nodes.

To add nodes to a Cassandra cluster

  1. For all new nodes, edit cassandra.yaml as described in the Getting Started tutorial, specifying required network addresses and the cluster’s seed list.
  2. In cassandra.yaml for each new node, enable autobootstrapping by setting auto_bootstrap: true (default is false).

Note

An autobootstrapping node cannot have itself in the list of seeds nor can it contain an initial_token already claimed by another node. To add new seeds, autobootstrap the nodes first, and then configure them as seeds.

  1. In cassandra.yaml for the new nodes, provide the appropriate values for intial_token.
  2. Start the new nodes in staggered fashion, allowing at least two minutes between each node startup for the gossip protocol to perform important inter-node communication. You can monitor the startup and data streaming process to its completion using nodetool netstats.
  3. After the new nodes are fully bootstrapped, run nodetool move <new_token> on each existing node, one node at a time, where <new_token> is the value you calculated for the node. Only the first node in the ring, whose token value is zero, does not need to be moved.

Note

If you are increasing the cluster by an integer factor (such as doubling it), you can skip the above step for moving tokens. This assumes that you have calculated all tokens for the cluster and have specified the new tokens in cassandra.yaml on the new nodes.

  1. Run nodetool cleanup on each of the previously existing nodes to remove the keys no longer belonging to those nodes. This operation is as disk-intensive as a major compaction, so run only one cleanup command at a time. Cleanup may be safely postponed for low-usage hours.

Replacing a Dead Node

Node failures can be handled by bringing up a replacement node in one of two ways: autobootstrap with a new IP Address or use the same IP Address and token as the previous host and nodetool repair. Which of these processes to choose depends on your client’s consistency levels and your tolerance for stale or missing data. In discussing these two approaches, we will use the nodes in the following ring as an example:

$ nodetool -h localhost -p 8080 ring
Address        Status   State   Load        Owns    Range                                      Ring
                                                    95315431979199388464207182617231204396
10.194.171.160 Down     Normal  ?           39.98   61078635599166706937511052402724559481     |<--|
10.196.14.48   Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |   |
10.196.14.239  Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

First Approach: Autoboostrap with same IP

During the autobootstrap process the node will not receive reads until bootstrapping is complete. Further, the ring will go down to two nodes when the dead node is removed:

$ nodetool -h localhost -p 8080 ring
Address       Status   State    Load       Owns    Range                                      Ring
                                                   95315431979199388464207182617231204396
10.196.14.48  Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |<--|
10.196.14.239 Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

Once the node has completed bootstrapping, the ring will again show three nodes:

$ nodetool -h localhost -p 8080 ring
Address       Status  State    Load        Owns    Range                                      Ring
                                                   95315431979199388464207182617231204396
10.196.14.48  Up      Normal   3.16 KB     30.01   78197033789183047700859117509977881938     |<--|
10.194.171.160Up      Normal   495 bytes   01.75   86756232884191218082533150063604543167     |   |
10.196.14.239 Up      Normal   3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

This is the safest approach if the client read consistency level is frequently ONE. The downside of this approach is that we have relied on the cluster to auto-balance the load and our replacement node now has a different token.

Second Approach: New Node With Manual Token Selection

If you do more QUORUM reads, you can tolerate empty results being returned, or you want to maintain the token assignments on the ring, this is the preferred approach. The following ring will be used in this example (the same status from the first example):

$ nodetool -h localhost -p 8080 ring
Address        Status   State   Load        Owns    Range                                      Ring
                                                    95315431979199388464207182617231204396
10.194.171.160 Down     Normal  ?           39.98   61078635599166706937511052402724559481     |<--|
10.196.14.48   Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |   |
10.196.14.239  Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

First, pick a token for your new node which is -1 from the token of the dead node and start the new node. Given the above ring, our value for initial_token will be:

86756232884191218082533150063604543166

Once autoboostrap is complete, the ring should now have four nodes, three active and one down:

$ nodetool -h localhost -p 8080 ring
Address       Status   State   Load       Owns    Range                                      Ring
                                                  95315431979199388464207182617231204396
10.196.14.48  Up       Normal  3.16 KB    30.01   78197033789183047700859117509977881938     |<--|
10.203.30.139 Up       Normal  495 bytes  01.75   86756232884191218082533150063604543166     |   |
10.194.171.160Down     Normal  ?          01.75   86756232884191218082533150063604543167     |   |
10.196.14.239 Up       Normal  3.16 KB    30.01   95315431979199388464207182617231204396     |-->|

Remove the dead node from the ring with nodetool removetoken using the down node’s token:

$ nodetool -h 10.196.14.239 -p 8080 removetoken 86756232884191218082533150063604543167

Verify the integrity of the ring:

$ nodetool -h 10.196.14.239 -p 8080 ring
Address       Status   State   Load       Owns    Range                                      Ring
                                                  95315431979199388464207182617231204396
10.196.14.48  Up       Normal  3.16 KB    30.01   78197033789183047700859117509977881938     |<--|
10.203.30.139 Up       Normal  495 bytes  01.75   86756232884191218082533150063604543166     |   |
10.196.14.239 Up       Normal  3.16 KB    30.01   95315431979199388464207182617231204396     |-->|

Last, run nodetool repair for each keyspace against the next node on the ring:

$ nodetool -h 10.196.14.239 -p 8080 repair Keyspace1