DataStax Enterprise 1.0 Documentation

Configuring and Initializing a DataStax Enterprise Cluster

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

Before you can start DataStax Enterprise (DSE), be it on a single or multi-node cluster, there are a few Cassandra configuration properties you must set on each node in the cluster. These are set in the cassandra.yaml file (located in /etc/dse/cassandra in packaged installations or $DSE_HOME/resources/cassandra/conf in binary distributions).

Note

These instructions apply only to single data center clusters. For information about configuring clusters with multiple data centers, see Initializing a Multi-Node or Multi-Data Center Cluster.

In DataStax Enterprise, the term data center is a grouping of nodes. You should configure these data centers by type of node: Cassandra and Analytics.

Initializing a Multi-Node DataStax Enterprise Cluster

Before you start a multi-node DSE cluster you must determine the following:

  • A name for your cluster.

  • How many total nodes your DSE cluster will have.

  • The internal IP addresses of each node.

  • The token for each node (see Generating Tokens).

    If you are deploying a mixed-workload DSE Cluster, make sure to alternate token assignments between Cassandra nodes and Analytics nodes so that replicas are evenly balanced.

  • Which nodes will serve as the seed nodes.

    If you are configuring a mixed-workload cluster, you should have at least one seed node for each side (the Cassandra real-time side and the Hadoop Analytics side).

  • If you intend to run a mixed-workload cluster determine which nodes will serve which purpose.

  • If you have a firewalls enabled on the machines that you plan to use for your cluster, make sure that nodes within a cluster can reach each other. See Configuring Firewall Port Access.

For example, suppose you are starting a 6 node mixed-workload cluster with 3 Analytics nodes and 3 Cassandra nodes. The nodes have the following IPs:

  • node0 (Cassandra seed) 110.82.155.0
  • node1 (Cassandra) 110.82.155.1
  • node2 (Cassandra) 110.82.155.2
  • node3 (Analytics seed) 110.82.155.3
  • node4 (Analytics) 110.82.155.4
  • node5 (Analytics) 110.82.155.5

The cassandra.yaml file for each node would have the following modified property settings. Note that in a mixed-workload cluster, the token placement alternates between Cassandra and Analytics nodes. This ensures even distribution of replicas on both sides of the cluster. For example:

  • node 0: 0
  • node 3: 28356863910078205288614550619314017621
  • node 1: 56713727820156410577229101238628035242
  • node 4: 85070591730234615865843651857942052864
  • node 2: 113427455640312821154458202477256070485
  • node 5: 141784319550391026443072753096570088106

Also note that in the seeds list, the seed node for the Analytics side of the cluster is listed first.

Node0

cluster_name: 'DSECluster'
initial_token: 0
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.0
rpc_address: 0.0.0.0

Node1

cluster_name: 'DSECluster'
initial_token: 56713727820156410577229101238628035242
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.1
rpc_address: 0.0.0.0

Node2

cluster_name: 'DSECluster'
initial_token: 113427455640312821154458202477256070485
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.2
rpc_address: 0.0.0.0

Node3

cluster_name: 'DSECluster'
initial_token: 28356863910078205288614550619314017621
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.3
rpc_address: 0.0.0.0

Node4

cluster_name: 'DSECluster'
initial_token: 85070591730234615865843651857942052864
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.4
rpc_address: 0.0.0.0

Node5

cluster_name: 'DSECluster'
initial_token: 141784319550391026443072753096570088106
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.5
rpc_address: 0.0.0.0

Generating Tokens

Tokens are used to assign a range of data to a particular node. Assuming you are using the RandomPartitioner, this approach will ensure even data distribution.

  1. Create a new file for your token generator program:

    vi tokengentool
    
  2. Paste the following Python program into this file:

    #! /usr/bin/python
    import sys
    if (len(sys.argv) > 1):
        num=int(sys.argv[1])
    else:
        num=int(raw_input("How many nodes are in your cluster? "))
    for i in range(0, num):
        print 'node %d: %d' % (i, (i*(2**127)/num))
    
  3. Save and close the file and make it executable:

    chmod +x tokengentool
    
  4. Run the script:

    ./tokengentool
    
  5. When prompted, enter the total number of nodes in your cluster:

    How many nodes are in your cluster? 6
    node 0: 0
    node 1: 28356863910078205288614550619314017621
    node 2: 56713727820156410577229101238628035242
    node 3: 85070591730234615865843651857942052864
    node 4: 113427455640312821154458202477256070485
    node 5: 141784319550391026443072753096570088106
    
  6. On each node, edit the cassandra.yaml file and enter its corresponding token value in the initial_token property.

Starting a DataStax Enterprise Cluster

After you have installed and configured DSE on one or more nodes, you are ready to start your cluster starting with the seed nodes. In a mixed-workload DSE cluster, you must start the Analytics seed node first.

Packaged installations include startup scripts for running DSE as a service. Binary packages do not.

Configuring Firewall Port Access

If you have a firewall running on the nodes in your Cassandra or DataStax Enterprise cluster, you must open up the following ports to allow communication between the nodes, including certain Cassandra ports. If this isn’t done, when you start Cassandra (or Hadoop in DataStax Enterprise) on a node, the node will act as a standalone database server rather than joining the database cluster.

Port Rule Type Description
7000 Custom TCP Rule Cassandra intra-node port (source is the current security group)
9160 Custom TCP Rule Cassandra client port
8012 Custom TCP Rule Hadoop Job Tracker client port
9290 Custom TCP Rule Hadoop Job Tracker Thrift port (source is the current security group)
10000 Custom TCP Rule Hive Thift Server port (for JDBC Hive access)
50030 Custom TCP Rule Hadoop Job Tracker website port
50060 Custom TCP Rule Hadoop Task Tracker website port

Starting DataStax Enterprise as a Stand-Alone Process

If running a mixed-workload cluster, determine which nodes to start as Cassandra nodes and which nodes to start as Analytics nodes. Begin with the seed nodes first - Analytics seed node, followed by the Cassandra seed node - then start the remaining nodes in the cluster one at a time.

On an Analytics node:

dse cassandra -t

On a Cassandra node:

dse cassandra

Starting DataStax Enterprise as a Service

Packaged installations provide startup scripts in /etc/init.d for starting DSE as a service. Before starting DSE as a service on an Analytics node, you must first configure the service to start the Hadoop Job Tracker and Task Tracker services as well.

Note

For mixed-workload clusters, nodes that are Cassandra-only can simply start the DSE service (skip step 1).

  1. Create the file /etc/default/dse, and add the following line as the contents of this file:

    HADOOP_ENABLED=1
    
  2. Start the DSE service:

    sudo service dse start
    

Note

On Enterprise Linux systems, the DSE service runs as a java process. On Debian systems, the DSE service runs as a jsvc process.