Cassandra servers are each assigned a unique token. When the token from these servers are taken as a set and sorted, they create a range which is used to determine the storage location of individual rows by mapping their keys to a token value. Individual Cassandra servers are responsible for the the set of tokens that includes their own and expands to the previous server’s token. The server with the lowest token from the sort wraps to include the range above the server with the highest token. This range mapping is commonly referred to as “the ring”.
Partitioners decide where a key maps onto the ring.
RandomPartitioner takes the MD5 hash of keys and uses the hash as the location in the ring. This causes keys to be evenly distributed around the entire ring.
OrderPreservingPartitioner does not use a hash to determine a key’s placement in the ring, but instead compares the key itself (as a byte array) to the tokens in the ring to determine which node is responsible for that key.
Unlike RandomPartitioner, this partitioner allows you to efficiently use get_range() to get a contiguous range of keys.
However, a serious drawback of this partitioner is that it leads to an unbalanced ring which, so node tokens must constantly be adjusted to keep the load equal. Because there are other ways to acheive the nice get_range() behavior, such as Using Column Families as Indexes, RandomPartitioner should always be preferred unless an order preserving partitioner is absolutely needed.
Similar to OrderPreservingPartitioner, but compares keys using EN,US rules instead of native byte ordering.
| Placement Strategy | Description |
|---|---|
| RackUnawareStrategy | Returns the nodes that are next to each other on the ring |
| RackAwareStrategy | Places one replica in a different data center while placing the others on different racks in the current data center |
| DataCenterShardStrategy | Divides up the placements such that some datacenters get more duplicates than others. For example, if the replication factor was 6, and there were 3 datacenters, the datacenter replication factors could be 3,2,1 so 3 replicas would go in the first datacenter, 2 replicas in the next, 1 replica in the last |
Increasing the ReplicationFactor for a cluster is a two part process with each part having several steps as defined below.
The first part involves the following for each server in the cluster:
The second part, also done to each node in the cluster, involves running nodetool repair against each keyspace individually.
Note that repair can be an intensive process depending on the size of the data so it’s advisable to wait until one repair task finishes on a node before moving on to another.
Snitches give Cassandra information about the network topology of the cluster. Cassandra uses this information to route requests as efficiently as possible within the confines of the Replication Strategies. Cassandra comes with two snitches out of the box: EndpointSnitch and DynamicEndpointSnitch.
EndpointSnitch assumes rack and datacenter info is encoded in the IP octets. If the 3rd octet is the same they are in the same rack; if the 2nd octet is the same they are in the same datacenter.
DynamicEndpointSnitch is more sophisticated in that it sorts endpoints by latency with an adapted phi failure detector.
AutoBootstrap is the easiest way to add capacity to a Cassandra cluster. It is a seemingly straightforward operation on a small number of nodes. However, one problem many new users overlook is the imbalance in cluster load that can be caused by continually autoboostrapping in new nodes. Take the following three node cluster as an example:
Address Status Load Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up 340 MB 78197033789183047700859117509977881938 |<--|
10.194.171.160 Up 328 MB 86756232884191218082533150063604543167 | |
10.196.14.239 Up 334 MB 95315431979199388464207182617231204396 |-->|
Autobootstrapping in one node will introduce an inbalance because the most heavily loaded node, the first in the list, will see its load cut in half while the other two nodes remain at the same level as autobootstrap pulls data from the most heavily loaded node only.
To avoid this issue and still rely on the convenience of autobootstrap, doubling the size of the cluster is the best approach. This will split the load of each node evenly. Alternatively, consider the options in Balancing: InitialToken and Node Movement.
Note
An auto bootstrapping node cannot have itself in the list of seeds nor can it contain an InitialToken already claimed by another node.
InitialToken may be used to specify the location in the ring where a node joins. When starting a new cluster, manually specifying the tokens for each node will yield the most balanced ring.
If RandomPartitioner is used, the optimal tokens can be calculated using the following python function:
def tokens(num_nodes):
for i in range(1, num_nodes + 1):
print (i * (2 ** 127 - 1) / num_nodes)
After the initial cluster has been created, it must be rebalanced when new nodes are added to the ring to distribute the load equally. After adding a node to the ring, use nodetool move <nodetool-move to shift the other nodes in the ring until the loads are balanced. Keep in mind that nodetool move can be a fairly intensive operation for the node that loses part of its range to a moved node.
After a node has lost part of its range to a newly joined node, it will not automatically remove the data that it no longer needs to keep. To have this data removed, run nodetool cleanup on all nodes that have either lost part of their primary range or part of a range that they replicate.
Node failures can be handled by bringing up a replacement node in one of two ways: auto bootstrap with a new IP Address or use the same IP Address and token as the previous host and nodetool repair. Which of these processes to choose depends on your client’s consistency levels and your tolerance for stale or missing data. In discussing these two approaches, we will use the nodes in the following ring as an example:
$ nodetool -h localhost -p 8080 ring
Address Status Load Range Ring
95315431979199388464207182617231204396
10.194.171.160Down 3.72 KB 61078635599166706937511052402724559481 |<--|
10.196.14.48 Up 3.16 KB 78197033789183047700859117509977881938 | |
10.196.14.239 Up 3.16 KB 95315431979199388464207182617231204396 |-->|
During the auto bootstrap process the node will not receive reads until bootstraping is complete. Further, the ring will go down to two nodes when the dead node is removed:
$ nodetool -h localhost -p 8080 ring
Address Status Load Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up 3.16 KB 78197033789183047700859117509977881938 |<--|
10.196.14.239 Up 3.16 KB 95315431979199388464207182617231204396 |-->|
Once the node has completed bootstrapping, the ring will again show three nodes:
$ nodetool -h localhost -p 8080 ring
Address Status Load Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up 3.16 KB 78197033789183047700859117509977881938 |<--|
10.194.171.160Up 495 bytes 86756232884191218082533150063604543167 | |
10.196.14.239 Up 3.16 KB 95315431979199388464207182617231204396 |-->|
This is the safest approach if the client read consistency level is frequently ONE. The downside of this approach, is that we have relied on the cluster to auto-balance the load and our replacement node now has a different token.
If you do more QUORUM reads or you can withstand empty results being returned, or you want to maintain the token assignments on the ring, this is should be the preferred approach. The following ring will be used in this example (The same status from the first example):
$ nodetool -h localhost -p 8080 ring
Address Status Load Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up 3.16 KB 78197033789183047700859117509977881938 |<--|
10.194.171.160Up 495 bytes 86756232884191218082533150063604543167 | |
10.196.14.239 Up 3.16 KB 95315431979199388464207182617231204396 |-->|
First, pick a token for your new node which is -1 from the token of the dead node and start the new node. Given the above ring, our value for InitialToken will be:
86756232884191218082533150063604543166
Once autoboostrap is complete, the ring should now have four nodes, three active and one down:
$ nodetool -h localhost -p 8080 ring
Address Status Load Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up 3.16 KB 78197033789183047700859117509977881938 |<--|
10.203.30.139 Up 495 bytes 86756232884191218082533150063604543166 | |
10.194.171.160Down 495 bytes 86756232884191218082533150063604543167 | |
10.196.14.239 Up 3.16 KB 95315431979199388464207182617231204396 |-->|
Now remove the dead node from the ring with nodetool removetoken using the down node’s token:
$ nodetool -h 10.196.14.239 -p 8080 removetoken 86756232884191218082533150063604543167
Verify the integrity of the ring:
$ nodetool -h 10.196.14.239 -p 8080 ring
Address Status Load Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up 3.16 KB 78197033789183047700859117509977881938 |<--|
10.203.30.139 Up 495 bytes 86756232884191218082533150063604543166 | |
10.196.14.239 Up 3.16 KB 95315431979199388464207182617231204396 |-->|
Now run nodetool repair for each keyspace against the next node on the ring. The repair should happen quickly as it will only need one key.
$ nodetool -h 10.196.14.239 -p 8080 repair Keyspace1