Apache Cassandra 0.8 Documentation

Managing a Cassandra Cluster

This section discusses routine management and maintenance tasks.

Running Routine Node Repair

The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data. Repair should be run at regular intervals during normal operations, as well as during node recovery scenarios (bringing a node back into the cluster after a failure).

Unless Cassandra applications perform no deletes at all, production clusters must schedule repair to run periodically on all nodes. The hard requirement for repair frequency is the value used for gc_grace_seconds – make sure you run a repair operation at least once on each node within this time period. Following this important guideline can ensure that deletes are properly handled in the cluster.

Note

Repair is an expensive operation in both disk and CPU consumption. Use caution when running node repair on more than one node at a time, and schedule regular repair operations for low-usage hours.

In systems that seldom delete or overwrite data, it is possible to raise the value of gc_grace_seconds at a minimal cost in extra disk space used. This allows wider intervals for scheduling repair operations with the nodetool utility.

Adding Capacity to an Existing Cluster

Cassandra allows you to add capacity to a cluster by introducing new nodes to the cluster in stages. When a new node joins an existing cluster, it needs to know:

  • Its position in the ring and the range of data it is responsible for. This is determined by the settings of initial_token and auto_bootstrap when the node first starts up.
  • The nodes it should contact to learn about the cluster and establish the gossip process. This is determined by the setting of seeds when the node first starts up.
  • The name of the cluster it is joining and how the node should be addressed within the cluster. See Node and Cluster Initialization Properties in cassandra.yaml.
  • Any other non-default adjustments made to cassandra.yaml on your existing cluster should also be made on the new node as well before it is started.

Calculating Tokens For the New Nodes

When you add a node to a cluster, it needs to know its position in the ring. There are a few different approaches for calculating tokens for new nodes:

  • Add capacity by doubling the cluster size. Adding capacity by doubling (or tripling or quadrupling) the number of nodes is operationally less complicated when assigning tokens. Existing nodes can keep their existing token assignments, and new nodes are assigned tokens that bisect (or trisect) the existing token ranges. For example, when you generate tokens for 6 nodes, three of the generated token values will be the same as if you generated for 3 nodes. You just need to determine the token values that are already in use, and assign the newly calculated token values to the newly added nodes.
  • Recalculate new tokens for all nodes and move nodes around the ring. If doubling the cluster size is not feasible, and you need to increase capacity by a non-uniform number of nodes, you will have to recalculate tokens for the entire cluster. Existing nodes will have to have their new tokens assigned using nodetool move. After all nodes have been restarted with their new token assignments, run a nodetool cleanup in order to remove unused keys on all nodes. These operations are resource intensive and should be planned for low-usage times.
  • Add one node at a time and automatically assign a token with auto bootstrap. When a node is started with auto_bootstrap set to true and initial_token left empty, Cassandra will split the token range of the heaviest loaded node place the new node into the ring at that token position. Note that this approach will probably not result in a perfectly balanced ring, but it will alleviate hot spots.

To calculate tokens for expansion:

  1. Create a new file for move token generator program. For example:
vi move_tokens.py
  1. Paste the following Python program into this file:
#! /usr/bin/python
import sys

if len(sys.argv) != 3:
   print "usage: move_tokens.py <current_size> <new_size>"
   sys.exit(1)

CURRENT_NODES = int(sys.argv[1])
GOING_TO_NODES = int(sys.argv[2])

RING_SIZE = 2**127

def tokens(n):
  rv = []
  for x in xrange(n):
    rv.append(RING_SIZE / n * x)
  return rv

def print_layout(nodes):
  i = 0
  for n in nodes:
    i += 1
    print "%d:\t%d" % (i, n)

current = tokens(CURRENT_NODES)
go = tokens(GOING_TO_NODES)
print "Existing Tokens: "
print_layout(current)
print ""
print ""
print "New Tokens: "
print_layout(go)

print ""
print ""
i = 0
j = 0
pending = []
while j != len(go) and i != len(current):
  cur = current[i]
  next = go[j]

  if next == cur:
    if pending:
      for x in pending:
         print x
      pending = []
    print "[%d] Old Node %d stays at %d" % (j+1, i+1, cur)
    i += 1

  if next > cur:
    if pending:
      for x in pending:
        print x
    pending = []
  #diff = (next - cur)
  print "[%d] Old Node %d moves to %d" % (j+1, i+1, next)
  i += 1

 if next < cur:
   pending.insert(0, "[%d] New Node added at %d" % (j+1, next))
 j += 1

 if pending:
   for x in pending:
     print x
 pending = []

while j != len(go):
  print "[%d] Add new node %d at %d" % (j+1, j+1, go[j])
  j += 1
  1. Make it executable:

    chmod +x move_tokens.py
    
  2. Run the program and supply the number of nodes in your current cluster and the number of nodes in your expanded cluster. For example, if expanding from a 6 node cluster to a 9 node cluster:

./move_tokens.py 6 9

[1] Old Node 1 stays at 0
[2] New Node added at 18904575940052136859076367079542678414
[3] Old Node 2 moves to 37809151880104273718152734159085356828
[4] Old Node 3 stays at 56713727820156410577229101238628035242
[5] New Node added at 75618303760208547436305468318170713656
[6] Old Node 4 moves to 94522879700260684295381835397713392070
[7] Old Node 5 stays at 113427455640312821154458202477256070484
[8] New Node added at 132332031580364958013534569556798748898
[9] Old Node 6 moves to 151236607520417094872610936636341427312

Based on the the above example output, first you would install and initialize the new nodes at token positions 2, 8 and 5. Then you would run nodetool move for token positions 3, 6 and 9. Positions 1, 4, and 7 can keep their original token assignments. After all nodes are up with their new tokens, run nodetool cleanup on each node in the cluster.

Adding Nodes to a Cluster

  1. Install Cassandra on the new node, but do not start it.
  2. Calculate tokens based on the expansion strategy you are using. If you want a new node to automatically pick a token range during auto bootstrap, you can skip this step.
  3. In the Node and Cluster Configuration (cassandra.yaml) file, set auto_bootstrap to true. Set the other Node and Cluster Initialization Properties accordingly. Set initial_token according to your token calculations (or leave it unset to auto bootstrap the node into the ring by splitting the token range of the heaviest loaded node).
  4. Start Cassandra on the new node. Allow a few minutes between node initializations. You can monitor the startup and data streaming process to its completion using nodetool netstats.
  5. After the new nodes are fully bootstrapped, move the tokens on the existing nodes that require a new token assignment, one node at a time. First edit initial_token in the Node and Cluster Configuration (cassandra.yaml) file. Then run nodetool move <new_token> on the existing nodes that require a new token assignment, one node at a time.
  6. After all nodes have their new tokens assigned, run nodetool cleanup on each of the existing nodes to remove the keys no longer belonging to those nodes. Wait for cleanup to complete on one node before doing the next. Cleanup may be safely postponed for low-usage hours.

Changing the Replication Factor

Increasing the replication factor increases the total number of copies of keyspace data stored in a Cassandra cluster.

  1. Update each keyspace in the cluster and change its replication strategy options. For example to update the number of replicas in Cassandra CLI when using SimpleStrategy replica placement strategy:

    [default@unknown] UPDATE KEYSPACE demo
    WITH strategy_options = [{replication_factor:3}];
    

    Or if using NetworkTopologyStrategy:

    [default@unknown] UPDATE KEYSPACE demo
    WITH strategy_options = [{datacenter1:6,datacenter2:6}];
    
  2. On each node in the cluster, run nodetool repair for each keyspace that was updated. Wait until repair completes on a node before moving to the next node.

Replacing a Dead Node

Node failures can be handled by bringing up a replacement node in one of two ways: autobootstrap with a new IP Address or use the same IP Address and token as the previous host and nodetool repair. Which of these processes to choose depends on your client’s consistency levels and your tolerance for stale or missing data. In discussing these two approaches, we will use the nodes in the following ring as an example:

$ nodetool -h localhost -p 8080 ring
Address        Status   State   Load        Owns    Range                                      Ring
                                                    95315431979199388464207182617231204396
10.194.171.160 Down     Normal  ?           39.98   61078635599166706937511052402724559481     |<--|
10.196.14.48   Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |   |
10.196.14.239  Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

First Approach: Autoboostrap with same IP

During the autobootstrap process the node will not receive reads until bootstrapping is complete. Further, the ring will go down to two nodes when the dead node is removed:

$ nodetool -h localhost -p 8080 ring
Address       Status   State    Load       Owns    Range                                      Ring
                                                   95315431979199388464207182617231204396
10.196.14.48  Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |<--|
10.196.14.239 Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

Once the node has completed bootstrapping, the ring will again show three nodes:

$ nodetool -h localhost -p 8080 ring
Address       Status  State    Load        Owns    Range                                      Ring
                                                   95315431979199388464207182617231204396
10.196.14.48  Up      Normal   3.16 KB     30.01   78197033789183047700859117509977881938     |<--|
10.194.171.160Up      Normal   495 bytes   01.75   86756232884191218082533150063604543167     |   |
10.196.14.239 Up      Normal   3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

This is the safest approach if the client read consistency level is frequently ONE. The downside of this approach is that we have relied on the cluster to auto-balance the load and our replacement node now has a different token.

Second Approach: New Node With Manual Token Selection

If you do more QUORUM reads, you can tolerate empty results being returned, or you want to maintain the token assignments on the ring, this is the preferred approach. The following ring will be used in this example (the same status from the first example):

$ nodetool -h localhost -p 8080 ring
Address        Status   State   Load        Owns    Range                                      Ring
                                                    95315431979199388464207182617231204396
10.194.171.160 Down     Normal  ?           39.98   61078635599166706937511052402724559481     |<--|
10.196.14.48   Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |   |
10.196.14.239  Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

First, pick a token for your new node which is -1 from the token of the dead node and start the new node. Given the above ring, our value for initial_token will be:

86756232884191218082533150063604543166

Once autoboostrap is complete, the ring should now have four nodes, three active and one down:

$ nodetool -h localhost -p 8080 ring
Address       Status   State   Load       Owns    Range                                      Ring
                                                  95315431979199388464207182617231204396
10.196.14.48  Up       Normal  3.16 KB    30.01   78197033789183047700859117509977881938     |<--|
10.203.30.139 Up       Normal  495 bytes  01.75   86756232884191218082533150063604543166     |   |
10.194.171.160Down     Normal  ?          01.75   86756232884191218082533150063604543167     |   |
10.196.14.239 Up       Normal  3.16 KB    30.01   95315431979199388464207182617231204396     |-->|

Remove the dead node from the ring with nodetool removetoken using the down node’s token:

$ nodetool -h 10.196.14.239 -p 8080 removetoken 86756232884191218082533150063604543167

Verify the integrity of the ring:

$ nodetool -h 10.196.14.239 -p 8080 ring
Address       Status   State   Load       Owns    Range                                      Ring
                                                  95315431979199388464207182617231204396
10.196.14.48  Up       Normal  3.16 KB    30.01   78197033789183047700859117509977881938     |<--|
10.203.30.139 Up       Normal  495 bytes  01.75   86756232884191218082533150063604543166     |   |
10.196.14.239 Up       Normal  3.16 KB    30.01   95315431979199388464207182617231204396     |-->|

Last, run nodetool repair for each keyspace against the next node on the ring:

$ nodetool -h 10.196.14.239 -p 8080 repair Keyspace1