Apache Cassandra 0.6 Documentation

Clustering

Tokens, Partitioners, and the Ring

Cassandra servers are each assigned a unique token. When the token from these servers are taken as a set and sorted, they create a range which is used to determine the storage location of individual rows by mapping their keys to a token value. Individual Cassandra servers are responsible for the the set of tokens that includes their own and expands to the previous server’s token. The server with the lowest token from the sort wraps to include the range above the server with the highest token. This range mapping is commonly referred to as “the ring”.

Partitioners

Partitioners decide where a key maps onto the ring.

RandomPartitioner

RandomPartitioner takes the MD5 hash of keys and uses the hash as the location in the ring. This causes keys to be evenly distributed around the entire ring.

OrderPreservingPartitioner

OrderPreservingPartitioner does not use a hash to determine a key’s placement in the ring, but instead compares the key itself (as a byte array) to the tokens in the ring to determine which node is responsible for that key.

Unlike RandomPartitioner, this partitioner allows you to efficiently use get_range() to get a contiguous range of keys.

However, a serious drawback of this partitioner is that it leads to an unbalanced ring which, so node tokens must constantly be adjusted to keep the load equal. Because there are other ways to acheive the nice get_range() behavior, such as Using Column Families as Indexes, RandomPartitioner should always be preferred unless an order preserving partitioner is absolutely needed.

CollatingOrderPreservingPartitioner

Similar to OrderPreservingPartitioner, but compares keys using EN,US rules instead of native byte ordering.

Replication Strategies

Placement Algorithms

Placement Strategy Description
RackUnawareStrategy Returns the nodes that are next to each other on the ring
RackAwareStrategy Places one replica in a different data center while placing the others on different racks in the current data center
DataCenterShardStrategy Divides up the placements such that some datacenters get more duplicates than others. For example, if the replication factor was 6, and there were 3 datacenters, the datacenter replication factors could be 3,2,1 so 3 replicas would go in the first datacenter, 2 replicas in the next, 1 replica in the last

Changing Strategies

Increasing ReplicationFactor

Increasing the ReplicationFactor for a cluster is a two part process with each part having several steps as defined below.

The first part involves the following for each server in the cluster:

  1. Edit the configuration file to contain the desired value (see ReplicationFactor for more information).
  2. Restart Cassandra

The second part, also done to each node in the cluster, involves running nodetool repair against each keyspace individually.

Note that repair can be an intensive process depending on the size of the data so it’s advisable to wait until one repair task finishes on a node before moving on to another.

Snitches

Snitches give Cassandra information about the network topology of the cluster. Cassandra uses this information to route requests as efficiently as possible within the confines of the Replication Strategies. Cassandra comes with two snitches out of the box: EndpointSnitch and DynamicEndpointSnitch.

EndpointSnitch

EndpointSnitch assumes rack and datacenter info is encoded in the IP octets. If the 3rd octet is the same they are in the same rack; if the 2nd octet is the same they are in the same datacenter.

DynamicEndpointSnitch

DynamicEndpointSnitch is more sophisticated in that it sorts endpoints by latency with an adapted phi failure detector.

Adding Capacity

AutoBootstrap

AutoBootstrap is the easiest way to add capacity to a Cassandra cluster. It is a seemingly straightforward operation on a small number of nodes. However, one problem many new users overlook is the imbalance in cluster load that can be caused by continually autoboostrapping in new nodes. Take the following three node cluster as an example:

Address        Status    Load       Range                                     Ring
                                    95315431979199388464207182617231204396
10.196.14.48   Up        340 MB     78197033789183047700859117509977881938    |<--|
10.194.171.160 Up        328 MB     86756232884191218082533150063604543167    |   |
10.196.14.239  Up        334 MB     95315431979199388464207182617231204396    |-->|

Autobootstrapping in one node will introduce an inbalance because the most heavily loaded node, the first in the list, will see its load cut in half while the other two nodes remain at the same level as autobootstrap pulls data from the most heavily loaded node only.

To avoid this issue and still rely on the convenience of autobootstrap, doubling the size of the cluster is the best approach. This will split the load of each node evenly. Alternatively, consider the options in Balancing: InitialToken and Node Movement.

Note

An auto bootstrapping node cannot have itself in the list of seeds nor can it contain an InitialToken already claimed by another node.

Balancing: InitialToken and Node Movement

InitialToken may be used to specify the location in the ring where a node joins. When starting a new cluster, manually specifying the tokens for each node will yield the most balanced ring.

If RandomPartitioner is used, the optimal tokens can be calculated using the following python function:

def tokens(num_nodes):
    for i in range(1, num_nodes + 1):
        print (i * (2 ** 127 - 1) / num_nodes)

After the initial cluster has been created, it must be rebalanced when new nodes are added to the ring to distribute the load equally. After adding a node to the ring, use nodetool move <nodetool-move to shift the other nodes in the ring until the loads are balanced. Keep in mind that nodetool move can be a fairly intensive operation for the node that loses part of its range to a moved node.

Cleanup

After a node has lost part of its range to a newly joined node, it will not automatically remove the data that it no longer needs to keep. To have this data removed, run nodetool cleanup on all nodes that have either lost part of their primary range or part of a range that they replicate.

Replacing a Dead Node

Node failures can be handled by bringing up a replacement node in one of two ways: auto bootstrap with a new IP Address or use the same IP Address and token as the previous host and nodetool repair. Which of these processes to choose depends on your client’s consistency levels and your tolerance for stale or missing data. In discussing these two approaches, we will use the nodes in the following ring as an example:

$ nodetool -h localhost -p 8080 ring
Address       Status   Load       Range                                     Ring
                                  95315431979199388464207182617231204396
10.194.171.160Down     3.72 KB    61078635599166706937511052402724559481    |<--|
10.196.14.48  Up       3.16 KB    78197033789183047700859117509977881938    |   |
10.196.14.239 Up       3.16 KB    95315431979199388464207182617231204396    |-->|

First Approach: Autoboostrap with same IP

During the auto bootstrap process the node will not receive reads until bootstraping is complete. Further, the ring will go down to two nodes when the dead node is removed:

$ nodetool -h localhost -p 8080 ring
Address       Status    Load       Range                                     Ring
                                   95315431979199388464207182617231204396
10.196.14.48  Up        3.16 KB    78197033789183047700859117509977881938    |<--|
10.196.14.239 Up        3.16 KB    95315431979199388464207182617231204396    |-->|

Once the node has completed bootstrapping, the ring will again show three nodes:

$ nodetool -h localhost -p 8080 ring
Address       Status    Load       Range                                     Ring
                                   95315431979199388464207182617231204396
10.196.14.48  Up        3.16 KB    78197033789183047700859117509977881938    |<--|
10.194.171.160Up        495 bytes  86756232884191218082533150063604543167    |   |
10.196.14.239 Up        3.16 KB    95315431979199388464207182617231204396    |-->|

This is the safest approach if the client read consistency level is frequently ONE. The downside of this approach, is that we have relied on the cluster to auto-balance the load and our replacement node now has a different token.

Second Approach: New Node With Manual Token Selection

If you do more QUORUM reads or you can withstand empty results being returned, or you want to maintain the token assignments on the ring, this is should be the preferred approach. The following ring will be used in this example (The same status from the first example):

$ nodetool -h localhost -p 8080 ring
Address       Status    Load       Range                                     Ring
                                   95315431979199388464207182617231204396
10.196.14.48  Up        3.16 KB    78197033789183047700859117509977881938    |<--|
10.194.171.160Up        495 bytes  86756232884191218082533150063604543167    |   |
10.196.14.239 Up        3.16 KB    95315431979199388464207182617231204396    |-->|

First, pick a token for your new node which is -1 from the token of the dead node and start the new node. Given the above ring, our value for InitialToken will be:

86756232884191218082533150063604543166

Once autoboostrap is complete, the ring should now have four nodes, three active and one down:

$ nodetool -h localhost -p 8080 ring
Address       Status    Load       Range                                     Ring
                                   95315431979199388464207182617231204396
10.196.14.48  Up        3.16 KB    78197033789183047700859117509977881938    |<--|
10.203.30.139 Up        495 bytes  86756232884191218082533150063604543166    |   |
10.194.171.160Down      495 bytes  86756232884191218082533150063604543167    |   |
10.196.14.239 Up        3.16 KB    95315431979199388464207182617231204396    |-->|

Now remove the dead node from the ring with nodetool removetoken using the down node’s token:

$ nodetool -h 10.196.14.239 -p 8080 removetoken 86756232884191218082533150063604543167

Verify the integrity of the ring:

$ nodetool -h 10.196.14.239 -p 8080 ring
Address       Status    Load       Range                                     Ring
                                   95315431979199388464207182617231204396
10.196.14.48  Up        3.16 KB    78197033789183047700859117509977881938    |<--|
10.203.30.139 Up        495 bytes  86756232884191218082533150063604543166    |   |
10.196.14.239 Up        3.16 KB    95315431979199388464207182617231204396    |-->|

Now run nodetool repair for each keyspace against the next node on the ring. The repair should happen quickly as it will only need one key.

$ nodetool -h 10.196.14.239 -p 8080 repair Keyspace1