DataStax Enterprise 2.0 Documentation

Configuring and Initializing a DataStax Enterprise Cluster

Before you can start DataStax Enterprise (DSE) on either a single or multi-node cluster, there are a few Cassandra configuration properties you must set on each node in the cluster. You set these properties in the cassandra.yaml file (located in /etc/dse/cassandra in packaged installations or <install_location>/resources/cassandra/conf in binary distributions).

Note

These instructions apply only to single data center clusters. For information about configuring clusters with multiple data centers, see Configuring Multiple Data Centers Quick Start.

In DataStax Enterprise, the term data center is a grouping of nodes. You should configure these data centers by type of node: Cassandra, Analytics, and Search.

Initializing a Multi-Node DataStax Enterprise Cluster

Before starting a multi-node DSE cluster, you must determine the following:

  • A name for your cluster.
  • How many total nodes your DSE cluster will have.
  • The internal IP addresses of each node.
  • The token for each node (see Generating Tokens). If you are deploying a mixed-workload DSE Cluster, make sure to alternate token assignments between Cassandra nodes and Analytics nodes so that replicas are evenly balanced.
  • Which nodes will serve as the seed nodes. You need at least one seed node per data center for Cassandra and Hadoop nodes. Solr nodes don't require a seed node.
  • If you intend to run a mixed-workload cluster determine which nodes will serve which purpose.
  • If you have a firewall enabled on the machines that you plan to use for your cluster, make sure that nodes within a cluster can reach each other. See Configuring Firewall Port Access.
  • If you want to use Solr, you must create a data center for Solr nodes and all nodes in that data center must also be running with Solr enabled. The default DseSimpleSnitch does this for you automatically.

To determine tokens assignments:

For example, suppose you are starting a 8 node mixed-workload cluster with 3 Analytics nodes, 3 Cassandra nodes, and 2 Search nodes. The nodes have the following IPs:

  • node0 (Cassandra seed) 110.82.155.0
  • node1 (Cassandra) 110.82.155.1
  • node2 (Cassandra) 110.82.155.2
  • node3 (Analytics seed) 110.82.155.3
  • node4 (Analytics) 110.82.155.4
  • node5 (Analytics) 110.82.155.5
  • node6 (Search) 110.82.155.6
  • node7 (Search) 110.82.155.7

To assign tokens in a multi data-center cluster, you generate tokens for the nodes in one data center, and then offset those token numbers by 1 for all nodes in the next data center, by 2 for the nodes in the next data center, and so on (larger increments are allowed, such as 10 or 50).

Because the number of nodes are not the same in each data center, you need to run the Token Generating Tool twice. The first run generates the tokens for the Cassandra data center. The second run generates tokens for the Search data center. For the Analytics data center, you offset the tokens generated by the first run. In this example, the tokens are incremented by 10. For the Solr data center, you use the tokens generated by the tool and then increment the first Solr node by 20.

Node Token Offset Type
Token Generation - First Run
node 0 0 Na Cassandra seed
node 1 56713727820156410577229101238628035242 NA Cassandra
node 2 113427455640312821154458202477256070485 NA Cassandra
node 3 10 10 Analytics seed
node 4 56713727820156410577229101238628035252 10 Analytics
node 5 113427455640312821154458202477256070495 10 Analytics
Token Generation - Second Run
node 6 20 (offset twice) 20 Search
node 7 85070591730234615865843651857942052864 10 Search

Since this is a mixed-workload cluster, the token placement alternates between Cassandra, Analytics, and Search nodes. This ensures even distribution of replicas on both sides of the cluster. The cassandra.yaml file for each node has the following modified property settings.

  • node 0: 0 (Cassandra seed)
  • node 3: 10 (Analytics seed)
  • node 6: 20 (Search)
  • node 1: 56713727820156410577229101238628035242 (Cassandra)
  • node 4: 56713727820156410577229101238628035252 (Analytics)
  • node 7: 85070591730234615865843651857942052864 (Search)
  • node 2: 113427455640312821154458202477256070485 (Cassandra)
  • node 5: 113427455640312821154458202477256070495 (Analytics)

Node0

cluster_name: 'DSECluster'
initial_token: 0
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.0
rpc_address: 0.0.0.0

Node1

cluster_name: 'DSECluster'
initial_token: 56713727820156410577229101238628035242
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.1
rpc_address: 0.0.0.0

Node2

cluster_name: 'DSECluster'
initial_token: 113427455640312821154458202477256070485
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.2
rpc_address: 0.0.0.0

Node3

cluster_name: 'DSECluster'
initial_token: 10
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.3
rpc_address: 0.0.0.0

Node4

cluster_name: 'DSECluster'
initial_token: 56713727820156410577229101238628035252
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.4
rpc_address: 0.0.0.0

Node5

cluster_name: 'DSECluster'
initial_token: 113427455640312821154458202477256070495
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.5
rpc_address: 0.0.0.0

Node6

cluster_name: 'DSECluster'
initial_token: 85070591730234615865843651857942052864
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.6
rpc_address: 0.0.0.0

Node7

cluster_name: 'DSECluster'
initial_token: 85070591730234615865843651857942052874
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
         - seeds: "110.82.155.3,110.82.155.0"
listen_address: 110.82.155.7
rpc_address: 0.0.0.0

Generating Tokens

Tokens are used to assign a range of data to a particular node within a data center. Assuming you are using the RandomPartitioner, this approach ensures even data distribution. For a multi data-center cluster, generate the tokens for the nodes in one data center, and then offset those token numbers by 1 for all nodes in the next data center, by 2 for the nodes in the next data center, and so on. (Instead of using single digits, you might want to offset the token number by a larger value, such as 10 or 50.)

Note

The following steps illustrate token generation for the above example.

To create tokens:

  1. Create a new file for your token generator program:

    vi tokengentool
    
  2. Paste the following Python program into this file:

    #! /usr/bin/python
    import sys
    if (len(sys.argv) > 1):
        num=int(sys.argv[1])
    else:
        num=int(raw_input("How many nodes are in your cluster? "))
    for i in range(0, num):
        print 'node %d: %d' % (i, (i*(2**127)/num))
    
  3. Save and close the file and make it executable:

    chmod +x tokengentool
    
  4. Run the script:

    ./tokengentool
    
  5. When prompted, enter the total number of nodes in your Cassandra data center:

    How many nodes are in your cluster? 3
    
    node 0: 0
    node 1: 56713727820156410577229101238628035242
    node 2: 113427455640312821154458202477256070485
    
  6. Run the tool again for two nodes (Solr data center):

    How many nodes are in your cluster? 2
    
    node 0: 0
    node 1: 85070591730234615865843651857942052864
    
  1. On each node, edit the cassandra.yaml file and enter its corresponding token value in the initial_token property.

Configuring Firewall Port Access

If you have a firewall running on the nodes in your Cassandra or DataStax Enterprise cluster, you must open up the following ports to allow communication between the nodes, including certain Cassandra ports. If this isn't done, when you start Cassandra (or Hadoop in DataStax Enterprise) on a node, the node will act as a standalone database server rather than joining the database cluster.

Port Description
Public Facing Ports
22 SSH (default)
DataStax Enterprise Specific
8012 Hadoop Job Tracker client port
8983 Solr port and Demo applications website port (Portfolio, Search, Search log)
50030 Hadoop Job Tracker website port
50060 Hadoop Task Tracker website port
OpsCenter Specific
8888 OpsCenter website port
Intranode Ports
Cassandra Specific
1024+ JMX reconnection/loopback ports
7000 Cassandra intra-node port
7199 Cassandra JMX monitoring port
9160 Cassandra client port
DataStax Enterprise Specific
9290 Hadoop Job Tracker Thrift port
OpsCenter Specific
50031 OpsCenter HTTP proxy for Job Tracker
61620 OpsCenter intra-node monitoring port
61621 OpsCenter agent ports

Starting a DataStax Enterprise Cluster

After you have installed and configured DSE on one or more nodes, you are ready to start your cluster starting with the seed nodes. In a mixed-workload DSE cluster, you must start the Analytics seed node first.

Packaged installations include startup scripts for running DSE as a service. Binary packages do not.

Note

When Cassandra loads, you may notice a message that MX4J will not load and that mx4j-tools.jar is not in the classpath. You can ignore this message. MX4j provides an HTML and HTTP interface to JMX and is not necessary to run Cassandra. DataStax recommends using OpsCenter It has more monitoring capabilities than MX4J.

Starting DataStax Enterprise as a Stand-Alone Process

If running a mixed-workload cluster, determine which nodes to start as Analytics, Cassandra, and Search nodes. Begin with the seed nodes first - Analytics seed node, followed by the Cassandra seed node - then start the remaining nodes in the cluster one at a time. For additional information, see Configuring Multiple Data Centers Quick Start.

To start DataStax Enterprise as a stand-alone process:

  • Analytics node: dse cassandra -t

  • Cassandra node: dse cassandra

  • Solr node: dse cassandra -s

  • To check that your ring is up and running (from the install directory):

    $ bin/nodetool ring -h localhost
    

Starting DataStax Enterprise as a Service

Packaged installations provide startup scripts in /etc/init.d for starting DSE as a service.

For mixed-workload clusters, nodes that are Cassandra-only can simply start the DSE service (skip step 1).

To start DataStax Enterprise as a service:

  1. Create the /etc/default/dse file, and then add the appropriate line to this file, depending on the type of node you want:

    • HADOOP_ENABLED=1 - Designates the node as DSE Analytic and starts the Hadoop Job Tracker and Task Tracker services.
    • SOLR_ENABLED=1 - Starts the node as DSE Enterprise Search. See Getting Starting with DSE Search.

    Note

    Using the SOLR_ENABLED and HADOOP_ENABLED options together to enable both search and Hadoop analytics on the same node is only recommended for development. In production environments each node should be used only for one or the other.

  2. Start the DSE service:

    sudo service dse start
    
  3. To check if your cluster is up and running:

    nodetool ring -h localhost
    

On RHEL and CentOS, the DSE service runs as a java process. On Debian systems, the DSE service runs as a jsvc process.