The process for initializing a Cassandra cluster (be it a single node, multiple node, or multiple data center cluster) is to first correctly configure the Node and Cluster Initialization Properties in each node’s cassandra.yaml configuration file, and then start each node individually starting with the seed node(s).
For more guidance on choosing the right configuration properties for your needs, see Choosing Node Configuration Options.
Cassandra is intended to be run on multiple nodes, however you may want to start with a single node cluster for evaluation purposes. To start Cassandra on a single node:
Set the following required properties in the cassandra.yaml file:
cluster_name: 'MyClusterName' initial_token: 0
(optional) The following properties are already correctly configured for a single node instance of Cassandra. However, if you plan on expanding to more nodes after your single-node evaluation, setting these correctly the first time you start the node is recommended.
seeds: <IP of node> listen_address: <IP of node> rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch | PropertyFileSnitch
Start Cassandra on the node. If you installed using the RPM or Debian packages, you can start the service as follows (as root):
# service cassandra start
If you installed using the binary tarball, you can start Cassandra as a stand-alone process as follows:
$ cd $CASSANDRA_HOME $ sh bin/cassandra -f
To correctly configure a multi-node or multi-data center cluster you must determine the following information:
This information will be used to configure the Node and Cluster Initialization Properties in the cassandra.yaml configuration file on each node in the cluster. Each node should be correctly configured before starting up the cluster, one node at a time (starting with the seed nodes).
For example, suppose you are configuring a 6 node cluster spanning 2 racks in a single data center. The nodes have the following IPs, and one node per rack will serve as a seed:
The cassandra.yaml files for each node would then have the following modified property settings.
cluster_name: 'MyDemoCluster' initial_token: 0 seed_provider: - seeds: "184.108.40.206,220.127.116.11" listen_address: 18.104.22.168 rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch
cluster_name: 'MyDemoCluster' initial_token: 28356863910078205288614550619314017621 seed_provider: - seeds: "22.214.171.124,126.96.36.199" listen_address: 188.8.131.52 rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch
cluster_name: 'MyDemoCluster' initial_token: 56713727820156410577229101238628035242 seed_provider: - seeds: "184.108.40.206,220.127.116.11" listen_address: 18.104.22.168 rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch
cluster_name: 'MyDemoCluster' initial_token: 85070591730234615865843651857942052864 seed_provider: - seeds: "22.214.171.124,126.96.36.199" listen_address: 188.8.131.52 rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch
cluster_name: 'MyDemoCluster' initial_token: 113427455640312821154458202477256070485 seed_provider: - seeds: "184.108.40.206,220.127.116.11" listen_address: 18.104.22.168 rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch
cluster_name: 'MyDemoCluster' initial_token: 141784319550391026443072753096570088106 seed_provider: - seeds: "22.214.171.124,126.96.36.199" listen_address: 188.8.131.52 rpc_address: 0.0.0.0 endpoint_snitch: RackInferringSnitch
Tokens are used to assign a range of data to a particular node. Assuming you are using the RandomPartitioner (the default partitioner), the approaches described in this section will ensure even data distribution.
If you have multiple racks in single data center or a multiple data center cluster, you can use the same formula for calculating the tokens. However you should assign the tokens to nodes in alternating racks. For example: rack1, rack2, rack3, rack1, rack2, rack3, and so on. Be sure to have the same number of nodes in each rack.
Create a new file for your token generator program:
Paste the following Python program into this file:
#! /usr/bin/python import sys if (len(sys.argv) > 1): num=int(sys.argv) else: num=int(raw_input("How many nodes are in your cluster? ")) for i in range(0, num): print 'token %d: %d' % (i, (i*(2**127)/num))
Save and close the file and make it executable:
chmod +x tokengentool
Run the script:
When prompted, enter the total number of nodes in your cluster:
How many nodes are in your cluster? 6 token 0: 0 token 1: 28356863910078205288614550619314017621 token 2: 56713727820156410577229101238628035242 token 3: 85070591730234615865843651857942052864 token 4: 113427455640312821154458202477256070485 token 5: 141784319550391026443072753096570088106
On each node, edit the cassandra.yaml file and enter its corresponding token value in the initial_token property.
In multi-data center deployments, replica placement is calculated per data center using the NetworkTopologyStrategy replica placement strategy. In each data center (or replication group) the first replica for a particular row is determined by the token value assigned to a node. Additional replicas in the same data center are placed by walking the ring clockwise until it reaches the first node in another rack.
If you do not calculate partitioner tokens so that the data ranges are evenly distributed for each data center, you could end up with uneven data distribution within a data center. The goal is to ensure that the nodes for each data center are evenly dispersed around the ring, or to calculate tokens for each replication group individually (without conflicting token assignments).
One way to avoid uneven distribution is to calculate tokens for all nodes in the cluster, and then alternate the token assignments so that the nodes for each data center are evenly dispersed around the ring.
Another way to assign tokens in a multi data center cluster is to generate tokens for the nodes in one data center, and then offset those token numbers by 1 for all nodes in the next data center, by 2 for the nodes in the next data center, and so on. This approach is good if you are adding a data center to an established cluster, or if your data centers do not have the same number of nodes.
After you have installed and configured Cassandra on all nodes, you are ready to start your cluster. On initial start-up, each node must be started one at a time, starting with your seed nodes.
Packaged installations include startup scripts for running Cassandra as a service. Binary packages do not.
You can start the Cassandra Java server process as follows:
$ cd $CASSANDRA_HOME $ sh bin/cassandra -f
To stop the Cassandra process, find the Cassandra Java process ID (PID), and then kill -9 the process using its PID number. For example:
$ ps ax | grep java $ kill -9 1539
Packaged installations provide startup scripts in /etc/init.d for starting Cassandra as a service. The service runs as the cassandra user. You must have root or sudo permissions to start or stop services.
To start the Cassandra service (as root):
# service cassandra start
To stop the Cassandra service (as root):
# service cassandra stop
On Enterprise Linux systems, the Cassandra service runs as a java process. On Debian systems, the Cassandra service runs as a jsvc process.