Cassandra servers are each assigned a unique token. When the token from these servers are taken as a set and sorted, they create a range which is used to determine the storage location of individual rows by mapping their keys to a token value. Individual Cassandra servers are responsible for the the set of tokens that includes their own and expands to the previous server’s token. The server with the lowest token from the sort wraps to include the range above the server with the highest token. This range mapping is commonly referred to as “the ring”.
Partitioners decide where a key maps onto the ring.
RandomPartitioner takes the MD5 hash of keys and uses the hash as the location in the ring. This causes keys to be evenly distributed around the entire ring.
OrderPreservingPartitioner does not use a hash to determine a key’s placement in the ring, but instead compares the key itself (as a byte array) to the tokens in the ring to determine which node is responsible for that key. This partitioner uses UTF-8 validation for keys.
Unlike RandomPartitioner, this partitioner allows you to efficiently use get_range() to get a contiguous range of keys.
However, a serious drawback of this partitioner is that it leads to an unbalanced ring, so node tokens must constantly be adjusted to keep the load equal. Because there are other ways to acheive the nice get_range() behavior, such as Using Column Families as Indexes, RandomPartitioner should always be preferred unless an order preserving partitioner is absolutely needed.
This partitioner supports keys with arbitrary content, and orders keys by their byte value. Because it does not enforce UTF-8 keys, ByteOrderedPartitioner may be preferred over OrderPreservingPartitioner in deployments where posting non-UTF-bytes is a requirement.
Similar to OrderPreservingPartitioner, but compares keys using EN,US rules instead of native byte ordering.
Replication of data in Cassandra is controlled by the replication_factor setting in the keyspace definition. The default value of 1 means that data is written to only one node in the ring – essentially, replication is disabled by default.
Choosing a higher value for the replication factor creates data redundancy for failover. With the replication factor set to 3, for example, write operations are committed to three separate nodes. This provides for two copies of a record in case of failure in one of the three replicas. This also means that each node acts as a replica for three separate key ranges.
Note
The replication factor reflects the total number of replicas. A replication factor of 1 does not mean there is an additional copy of the data; it means there is one copy in total, and no data redundancy at all.
The actual placement of replicas in the cluster is determined by the Replica Placement Strategies. If your cluster spans multiple data centers, selecting a placement strategy is especially important to your overall replication setup.
When choosing the replication factor for a keyspace, consider the number carefully; Changing the Replication Factor is a non-trivial task that can temporarily affect the performance of your cluster.
For keyspaces where the replication factor is higher than 1, the replica placement strategy determines which nodes contain the additional copy, or copies, of data. You can define the replica placement strategy when creating a keyspace in the Cassandra CLI. The default value is org.apache.cassandra.locator.SimpleStrategy.
SimpleStrategy can be efficient for a cluster located within a single data center. For clusters spanning multiple data centers, DataStax recommends using one of the “rack-aware” strategies – NetworkTopologyStrategy or OldNetworkTopologyStrategy.
| Placement Strategy | Description |
|---|---|
| SimpleStrategy | Returns the nodes that are next to each other on the ring |
| NetworkTopologyStrategy | Allows you to configure the number of replicas per data center as specified in the strategy_options. Replicas are placed on different racks within each data center, if possible. |
| OldNetworkTopologyStrategy | Places one replica in a different data center while placing the others on different racks in the current data center. Optimized for exactly three replicas across two datacenters. |
A cluster using one of the rack-aware replica placement strategies must also use one of the rack-aware Snitches.
Increasing the ReplicationFactor for a cluster is a two part process with each part having several steps as defined below.
The first part involves the following for each server in the cluster:
The second part, also done to each node in the cluster, involves running nodetool repair against each keyspace individually.
Note that repair can be an intensive process depending on the size of the data so it’s advisable to wait until one repair task finishes on a node before moving on to another.
Snitches give Cassandra information about the network topology of the cluster. Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the Replica Placement Strategies. Snitches do not affect client requests.
Cassandra comes with three snitches out of the box: SimpleSnitch, RackInferringSnitch and PropertyFileSnitch. Each of these snitches uses a different method to determine the location of nodes in the network and the most efficient way to route requests.
SimpleSnitch can be efficient for locating nodes in clusters limited to a single data center. This snitch is configured by default.
RackInferringSnitch extrapolates the topolology of the network by analyzing IP addresses. This snitch assumes that the second octet identifies the data center where a node is located, and the third octet identifies the rack.
PropertyFileSnitch determines the location of nodes by referring to a user-defined description of the network details located in the property file cassandra-topology.properties. Your installation contains an example properties file for PropertyFileSnitch in $CASSANDRA_HOME/conf/cassandra-topology.properties.
To configure the snitch for Cassandra, set the desired value for endpoint_snitch. For example, to configure RackInferringSnitch, edit cassandra.yaml as follows:
endpoint_snitch: org.apache.cassandra.locator.RackInferringSnitch
Note
With NetworkTopologyStrategy or OldNetworkTopologyStrategy, you must use one of the rack-aware snitches – either RackInferringSnitch or PropertyFileSnitch.
By default, all snitches are wrapped in a dynamic snitch layer that monitors read latency and, when possible, routes requests away from poorly-performing nodes. Dynamic snitching is used by default, and is recommended for use in most deployments.
The performance threshold and intervals for dynamic snitching can be controlled using these parameters in casssandra.yaml:
You can add capacity to a Cassandra cluster by autobootstrapping new nodes. Adding more nodes than the existing number is not supported in a single operation, and must be performed in stages.
Increasing the number of nodes by an integer factor – such as doubling, tripling, or quadrupling the number of nodes – is operationally less complicated than increasing the number unevenly, as it does not require moving tokens on existing nodes. This is because the new token ranges evenly divide the existing token ranges, and the existing nodes can keep their initial token values while new nodes assume half (or a third, or fourth) of the range.
However, any expansion of the cluster requires you to calculate new intial_token values for all nodes, and requires performance-intensive cleanup operations to remove unused keys. DataStax recommends scheduling these operations for low-use hours.
Whenever you expand capacity, calculate the correct token values for all nodes in the expanded cluster. To determine the correct initial token values for the cluster, divide 2 to the 127th power by the total number of nodes, enumerate the nodes starting with zero, then multiply the node’s number by the quotient. The Cassandra Wiki provides a python program to calculate new tokens for the nodes. A Python script like the following, when run from a command line, will prompt you for a number of tokens and will print the initial token values:
#! /usr/bin/python
import sys
if (len(sys.argv) > 1):
num=int(sys.argv[1])
else:
num=int(raw_input("How many nodes are in your cluster? "))
for i in range(0, num):
print 'node %d: %d' % (i, (i*(2**127)/num))
Once you have calculated these values, you can provide the appropriate value for each node. For the new nodes, you will use the calculated value to set initial_token in cassandra.yaml. For existing nodes, you will use nodetool move <new_token>. This is required in order to balance the load evenly.
In scenarios where you are doubling the number of nodes, half of the tokens calculated should already belong to existing nodes. Make sure you assign the new, unused tokens to the new nodes.
To add nodes to a Cassandra cluster
Note
An autobootstrapping node cannot have itself in the list of seeds nor can it contain an initial_token already claimed by another node. To add new seeds, autobootstrap the nodes first, and then configure them as seeds.
Note
If you are increasing the cluster by an integer factor (such as doubling it), you can skip the above step for moving tokens. This assumes that you have calculated all tokens for the cluster and have specified the new tokens in cassandra.yaml on the new nodes.
Node failures can be handled by bringing up a replacement node in one of two ways: autobootstrap with a new IP Address or use the same IP Address and token as the previous host and nodetool repair. Which of these processes to choose depends on your client’s consistency levels and your tolerance for stale or missing data. In discussing these two approaches, we will use the nodes in the following ring as an example:
$ nodetool -h localhost -p 8080 ring
Address Status State Load Owns Range Ring
95315431979199388464207182617231204396
10.194.171.160 Down Normal ? 39.98 61078635599166706937511052402724559481 |<--|
10.196.14.48 Up Normal 3.16 KB 30.01 78197033789183047700859117509977881938 | |
10.196.14.239 Up Normal 3.16 KB 30.01 95315431979199388464207182617231204396 |-->|
During the autobootstrap process the node will not receive reads until bootstrapping is complete. Further, the ring will go down to two nodes when the dead node is removed:
$ nodetool -h localhost -p 8080 ring
Address Status State Load Owns Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up Normal 3.16 KB 30.01 78197033789183047700859117509977881938 |<--|
10.196.14.239 Up Normal 3.16 KB 30.01 95315431979199388464207182617231204396 |-->|
Once the node has completed bootstrapping, the ring will again show three nodes:
$ nodetool -h localhost -p 8080 ring
Address Status State Load Owns Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up Normal 3.16 KB 30.01 78197033789183047700859117509977881938 |<--|
10.194.171.160Up Normal 495 bytes 01.75 86756232884191218082533150063604543167 | |
10.196.14.239 Up Normal 3.16 KB 30.01 95315431979199388464207182617231204396 |-->|
This is the safest approach if the client read consistency level is frequently ONE. The downside of this approach is that we have relied on the cluster to auto-balance the load and our replacement node now has a different token.
If you do more QUORUM reads, you can tolerate empty results being returned, or you want to maintain the token assignments on the ring, this is the preferred approach. The following ring will be used in this example (the same status from the first example):
$ nodetool -h localhost -p 8080 ring
Address Status State Load Owns Range Ring
95315431979199388464207182617231204396
10.194.171.160 Down Normal ? 39.98 61078635599166706937511052402724559481 |<--|
10.196.14.48 Up Normal 3.16 KB 30.01 78197033789183047700859117509977881938 | |
10.196.14.239 Up Normal 3.16 KB 30.01 95315431979199388464207182617231204396 |-->|
First, pick a token for your new node which is -1 from the token of the dead node and start the new node. Given the above ring, our value for initial_token will be:
86756232884191218082533150063604543166
Once autoboostrap is complete, the ring should now have four nodes, three active and one down:
$ nodetool -h localhost -p 8080 ring
Address Status State Load Owns Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up Normal 3.16 KB 30.01 78197033789183047700859117509977881938 |<--|
10.203.30.139 Up Normal 495 bytes 01.75 86756232884191218082533150063604543166 | |
10.194.171.160Down Normal ? 01.75 86756232884191218082533150063604543167 | |
10.196.14.239 Up Normal 3.16 KB 30.01 95315431979199388464207182617231204396 |-->|
Remove the dead node from the ring with nodetool removetoken using the down node’s token:
$ nodetool -h 10.196.14.239 -p 8080 removetoken 86756232884191218082533150063604543167
Verify the integrity of the ring:
$ nodetool -h 10.196.14.239 -p 8080 ring
Address Status State Load Owns Range Ring
95315431979199388464207182617231204396
10.196.14.48 Up Normal 3.16 KB 30.01 78197033789183047700859117509977881938 |<--|
10.203.30.139 Up Normal 495 bytes 01.75 86756232884191218082533150063604543166 | |
10.196.14.239 Up Normal 3.16 KB 30.01 95315431979199388464207182617231204396 |-->|
Last, run nodetool repair for each keyspace against the next node on the ring:
$ nodetool -h 10.196.14.239 -p 8080 repair Keyspace1