Before you can start DataStax Enterprise (DSE), be it on a single or multi-node cluster, there are a few Cassandra configuration properties you must set on each node in the cluster. These are set in the cassandra.yaml file (located in /etc/dse/cassandra in packaged installations or $DSE_HOME/resources/cassandra/conf in binary distributions).
Note
These instructions apply only to single data center clusters. For information about configuring clusters with multiple data centers, see Initializing a Multi-Node or Multi-Data Center Cluster.
In DataStax Enterprise, the term data center is a grouping of nodes. You should configure these data centers by type of node: Cassandra and Analytics.
Before you start a multi-node DSE cluster you must determine the following:
A name for your cluster.
How many total nodes your DSE cluster will have.
The internal IP addresses of each node.
The token for each node (see Generating Tokens).
If you are deploying a mixed-workload DSE Cluster, make sure to alternate token assignments between Cassandra nodes and Analytics nodes so that replicas are evenly balanced.
Which nodes will serve as the seed nodes.
If you are configuring a mixed-workload cluster, you should have at least one seed node for each side (the Cassandra real-time side and the Hadoop Analytics side).
If you intend to run a mixed-workload cluster determine which nodes will serve which purpose.
If you have a firewalls enabled on the machines that you plan to use for your cluster, make sure that nodes within a cluster can reach each other. See Configuring Firewall Port Access.
For example, suppose you are starting a 6 node mixed-workload cluster with 3 Analytics nodes and 3 Cassandra nodes. The nodes have the following IPs:
The cassandra.yaml file for each node would have the following modified property settings. Note that in a mixed-workload cluster, the token placement alternates between Cassandra and Analytics nodes. This ensures even distribution of replicas on both sides of the cluster. For example:
Also note that in the seeds list, the seed node for the Analytics side of the cluster is listed first.
Node0
cluster_name: 'DSECluster'
initial_token: 0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.0
rpc_address: 0.0.0.0
Node1
cluster_name: 'DSECluster'
initial_token: 56713727820156410577229101238628035242
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.1
rpc_address: 0.0.0.0
Node2
cluster_name: 'DSECluster'
initial_token: 113427455640312821154458202477256070485
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.2
rpc_address: 0.0.0.0
Node3
cluster_name: 'DSECluster'
initial_token: 28356863910078205288614550619314017621
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.3
rpc_address: 0.0.0.0
Node4
cluster_name: 'DSECluster'
initial_token: 85070591730234615865843651857942052864
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.4
rpc_address: 0.0.0.0
Node5
cluster_name: 'DSECluster'
initial_token: 141784319550391026443072753096570088106
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.5
rpc_address: 0.0.0.0
Tokens are used to assign a range of data to a particular node. Assuming you are using the RandomPartitioner, this approach will ensure even data distribution.
Create a new file for your token generator program:
vi tokengentool
Paste the following Python program into this file:
#! /usr/bin/python
import sys
if (len(sys.argv) > 1):
num=int(sys.argv[1])
else:
num=int(raw_input("How many nodes are in your cluster? "))
for i in range(0, num):
print 'node %d: %d' % (i, (i*(2**127)/num))
Save and close the file and make it executable:
chmod +x tokengentool
Run the script:
./tokengentool
When prompted, enter the total number of nodes in your cluster:
How many nodes are in your cluster? 6
node 0: 0
node 1: 28356863910078205288614550619314017621
node 2: 56713727820156410577229101238628035242
node 3: 85070591730234615865843651857942052864
node 4: 113427455640312821154458202477256070485
node 5: 141784319550391026443072753096570088106
On each node, edit the cassandra.yaml file and enter its corresponding token value in the initial_token property.
After you have installed and configured DSE on one or more nodes, you are ready to start your cluster starting with the seed nodes. In a mixed-workload DSE cluster, you must start the Analytics seed node first.
Packaged installations include startup scripts for running DSE as a service. Binary packages do not.
If you have a firewall running on the nodes in your Cassandra or DataStax Enterprise cluster, you must open up the following ports to allow communication between the nodes, including certain Cassandra ports. If this isn’t done, when you start Cassandra (or Hadoop in DataStax Enterprise) on a node, the node will act as a standalone database server rather than joining the database cluster.
| Port | Rule Type | Description |
|---|---|---|
| 7000 | Custom TCP Rule | Cassandra intra-node port (source is the current security group) |
| 9160 | Custom TCP Rule | Cassandra client port |
| 8012 | Custom TCP Rule | Hadoop Job Tracker client port |
| 9290 | Custom TCP Rule | Hadoop Job Tracker Thrift port (source is the current security group) |
| 10000 | Custom TCP Rule | Hive Thift Server port (for JDBC Hive access) |
| 50030 | Custom TCP Rule | Hadoop Job Tracker website port |
| 50060 | Custom TCP Rule | Hadoop Task Tracker website port |
If running a mixed-workload cluster, determine which nodes to start as Cassandra nodes and which nodes to start as Analytics nodes. Begin with the seed nodes first - Analytics seed node, followed by the Cassandra seed node - then start the remaining nodes in the cluster one at a time.
On an Analytics node:
dse cassandra -t
On a Cassandra node:
dse cassandra
Packaged installations provide startup scripts in /etc/init.d for starting DSE as a service. Before starting DSE as a service on an Analytics node, you must first configure the service to start the Hadoop Job Tracker and Task Tracker services as well.
Note
For mixed-workload clusters, nodes that are Cassandra-only can simply start the DSE service (skip step 1).
Create the file /etc/default/dse, and add the following line as the contents of this file:
HADOOP_ENABLED=1
Start the DSE service:
sudo service dse start
Note
On Enterprise Linux systems, the DSE service runs as a java process. On Debian systems, the DSE service runs as a jsvc process.