How to Set Up and Monitor a Multi-Node Cassandra and Hadoop Cluster on Linux
How would you like to easily install and stand up a scalable, distributed database system that can manage big data in real-time fashion with Apache Cassandra and is also able to analyze that same data via Apache Hadoop? If your answer is “yes”, good, because that’s what I’ll be showing you how to do in this article.
For this particular tutorial, I’ll be using 4 CentOS 6.0 Linux machines that I’ve provisioned on Rackspace. Each has 2GB of RAM and 80GB of hard disk space. If you’re not using CentOS, don’t worry as the process below works for any Linux platform (e.g. Red Hat, Ubuntu, Debian, etc.)
For this exercise, we’ll designate that two of our nodes will manage real-time data with Cassandra, and the other two will be assigned analytic operations with Hadoop. I’ll also be using DataStax OpsCenter to manage and monitor the Cassandra and Hadoop cluster.
Before you install the Apache Cassandra/Hadoop software, there are a few prerequisites that you’ll want to make sure are on your boxes before you begin. Specifically, they include:
- Java 1.6 or higher
- Python 2.6 or higher
You can check these prerequisites on your Linux boxes by entering the following commands and viewing the output:
If you don’t have these prerequisites on your machines, follow the steps in the install sections of the DataStax documentation for installing them.
To improve performance, you can also install JNA, which helps with memory management. JNA is optional, but if you’d like to install it, see the DataStax documentation for the few required steps.
Also, if you have a firewall enabled, be sure that you open ports so that all your machines have access to one another. If you don’t do this, when you start Cassandra or Hadoop on a node, it will act as a standalone database server rather than as a database cluster.
The ports you need to open for Cassandra are 7000 and 9160. For Hadoop, you need to open 8012, 9290, 50030, and 50031. For OpsCenter, you need to open 7199, 8888, 61620, and 61621.
For each port you need to open, you can use the
iptables command similar to this:
iptables -A INPUT -p tcp --dport 7000 -j ACCEPT
You’ll want to enter a command like the above on all machines that make up your cluster and for all ports that need to be opened.
Also, a quick note on CentOS and RHEL systems: they oftentimes have a universal
REJECT all anywhere line in the firewall file (
iptables), that you’ll want to remove. If you don’t, even if you open all the ports listed above, your machine will still universally reject all incoming requests.
Download the Software
Next we need to download the Apache Cassandra and Hadoop software. For this article, I’ll be using the DataStax Enterprise Edition bundle, which contains the latest and most stable Apache Cassandra and Hadoop versions, along with the Cassandra Query Language (CQL) utility, a sample database and application, and the enterprise edition of DataStax OpsCenter, which is the tool you’ll want to use for managing and monitoring your Cassandra/Hadoop cluster.
For Linux, DataStax makes available rpm, deb, and tar downloads. For this tutorial, I’ll be using the tar download files.
You can register and obtain the DataStax Enterprise Edition from the downloads page. There’s no time bomb or trial license date for DataStax Enterprise if you use it for development/testing purposes (production deployments do require a subscription license).
Once you get an assigned userid/password to get the DataStax Enterprise Edition software, you can also get the download file via the
wget command in this fashion:
wget http://[username]:[password]@downloads.datastax.com/enterprise/dse-[version number]-bin.tar.gz
Unpack the Download File
Next, on each of your machines, move the tar file you’ve downloaded to the directory where you’d like to install things and unzip it (e.g.
Generate the Tokens for the Cluster
Cassandra automatically distributes data in a cluster evenly between all the machines via a hashing algorithm that’s applied to incoming data. Because DataStax Enterprise is built on top of Cassandra, this is done also for data utilized by Hadoop.
All you have to do is assign a token to each machine in the cluster, which is a numerical identifier that determines the machine’s position in the cluster and the range of data each machine is responsible for.
DataStax supplies a small python program that makes the generation of tokens very easy. To use it:
- Create a new file on one of your Linux machines using your favorite editor
- Copy and paste the following code into it:
#! /usr/bin/python import sys if (len(sys.argv) > 1): num=int(sys.argv) else: num=int(raw_input("How many nodes are in your cluster? ")) for i in range(0, num): print 'token %d: %d' % (i, (i*(2**127)/num))
- Save and name your file (e.g.
tokentool). Also be sure to set the permissions on the file so you can execute it
Then, execute the new script, input the number of nodes you intend to use for your new Cassandra/Hadoop cluster, and the script will output the tokens for each node of your new cluster:
Record the tokens because we’ll need them for the next step. For the cluster I’m building for this exercise, I’ll assign the above tokens as follows:
- Node 1: Cassandra: token 0
- Node 2: Hadoop: token 42535295865117307932921825928971026432
- Node 3: Cassandra: token 85070591730234615865843651857942052864
- Node 4: Hadoop: token 127605887595351923798765477786913079296
Next, Set a Few Parameters
There are a few startup parameters you’ll want to set for your new Cassandra / Hadoop cluster. Some are mandatory and others are optional. All are contained in the
cassandra.yaml file that can be found in the
/conf subdirectory that’s directly under the directory where you unzipped the DataStax Enterprise Edition server package:
The mandatory parameters you’ll need to set are:
initial_token: using the token list you generated, input a token for each machine in your cluster.
listen_address: Set to the IP address of the machine.
seeds: When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure and you can input multiple seed nodes if you’d like. For your new Cassandra/Hadoop cluster, input the IP address or machine name of your first Hadoop node followed by the IP/name of your first Cassandra node separated by a comma:
seeds: "ip address of Hadoop seed node, ip address of cassandra seed node"
Optional parameters you can set are:
cluster_name: is (surprise!) the name of your cluster. You can leave “Test Cluster” in if you’d like or give it a more meaningful name.
saved_caches_directory: Like many other database systems, Cassandra has data and log files. And like other databases, you’ll get better performance if you separate the data and log files so that they’re on different drives/devices. This isn’t necessary if you’re just getting your feet wet with Cassandra, but you’ll want to create separate directories for these if/when you go into production, and include those directories here. If you leave them blank, Cassandra will just use the
/var/libdirectory for everything.
Of course, there are other parameters you can tweak in the
cassandra.yaml file for performance tuning, etc., but you’re done for now. Save your changes to the
cassandra.yaml file once you’re finished.
Start your Cluster
You’re now ready to start up your Cassandra and Hadoop cluster. Start your Hadoop seed node first by going to the [installation directory]/bin directory and executing
./dse cassandra –t:
Then go to your Cassandra seed node and execute
./dse cassandra. With the seed nodes up, you can start up your other Hadoop and Cassandra nodes in the same fashion.
Give the nodes a few seconds to establish themselves in the ring and then check the status of your cluster by using the
nodetool utility in the
[installation directory]/resources/cassandra/bin directory (on any node in your cluster), and entering the command:
./nodetool –h localhost ring:
You should see the nodes that comprise your cluster along with their status and some other pieces of information.
Notice how two of my nodes are labled Casssandra nodes for realtime workloads, and two are labled Analytic nodes for Hadoop / batch analytic workloads.
One of the great things about using Cassandra and Hadoop with DataStax Enterprise is that it provides automatic workload separation between your Cassandra nodes and workloads and the Hadoop nodes. Your Cassandra real time work won’t compete for data or compute resources on your Hadoop nodes, and your Hadoop MapReduce, Hive, and Pig tasks don’t compete for the data or compute resources on your Cassandra nodes.
All data in the cluster is kept in sync via Cassandra’s replication feature, so you don’t have to ETL any data between nodes. For example, if you create and populate a Hive table on your Hadoop nodes, that object is automatically copied for you and made available in a column family on your Cassandra nodes.
You also get a fault-tolerant Hadoop implemenation via DataStax Enterprise so you’ll have no points of failure as with standard Hadoop installs. Your Hadoop Job Tracker will be running on your Hadoop seed node, but you can move it if you’d like to another Hadoop node in the cluster via the
dsetool utility that comes packaged with your install.
That was pretty easy wasn’t it? Download/unzip a single tar file, set a couple of config parms, start your nodes, and you’ve got a full blown scalable and fault tolerant Cassandra / Hadoop that’s ready for work.
Let’s now install and configure the management and monitoring tool you’ll use to maintain your new cluster.
Installing DataStax OpsCenter
DataStax OpsCenter is a visual management and monitoring solution for Apache Cassandra that lets you easily see what’s going on in your database cluster, manage objects, and more. There are two editions of DataStax OpsCenter: a free community edition and a paid enterprise edition. The paid version is what you’ll need to manage and monitor both Cassandra and Hadoop, but there’s no need to pay for it until you move your cluster into production status.
With DataStax OpsCenter, you’ll install the main OpsCenter service on one of your nodes and agents on every node. You can choose any node for the OpsCenter service; I’ll just use my first node.
Download the OpsCenter Enterprise tar file onto one of your machines, and unzip the file (e.g.
Go to the OpsCenter installation directory and then to the bin directory. Enter the command: ./setup.py (that used to be called create-agent-tarball; if you’re using OpsCenter 1.3 or below, this is what the setup routine is called in those versions), which will set up SSL for OpsCenter and do a few other housekeeping chores:
Next, go to the
conf directory and edit the
opscenterd.conf file. You need to change the
interface parameter in the file from 127.0.0.1 (defaults to just the
localhost machine) to 0.0.0.0 so you can get to OpsCenter from any machine.
Then, start DataStax OpsCenter up as a background process, by executing
Now, you have the main OpsCenter service running on one of your nodes. Next, we have to install the OpsCenter agent on each node in your new cluster.
For every node in your cluster, follow these instructions:
- Make a directory on each machine for the OpsCenter agent (e.g.
- The setup script for OpsCenter created a file called
agent.tar.gzon your first machine. Copy that file (e.g.
scp) to all the machines in your cluster and put it into the new directory you just created.
- Unzip the
- Next, you need to set the agent to point to the main OpsCenter machine (the one you just installed the primary OpsCenter service on) and also monitor the current machine you are on. This is done by executing the
setupscript in the
binsubdirectory of the
agentdirectory and passing it first the machine IP or name of where the OpsCenter main service is running followed by the local machine’s address or name:
./setup main-opscenter-machine-name-or-ip local-machine-name-or-ip
- Start up the agent by executing the
Lastly, DataStax OpsCenter uses the
iostat command on Linux to obtain disk I/O metrics. On my new Rackspace boxes, I didn’t have
iostat installed, so I installed it by issuing the command:
yum install sysstat.
You can now invoke the DataStax OpsCenter dashboard from any Google Chrome or Firefox web browser by typing in the following on the browser’s address bar:
http://[IP address of the OpsCenter service machine]:8888/opscenter/index.html:
Congratulations – you’ve got a multi-node Cassandra and Hadoop cluster up and running along with your visual management and monitor solution.
We’ve reached the end of this short article on how to setup and start monitoring a multi-node Cassandra cluster on Linux. Look for some more articles on setting up Cassandra on Amazon EC2 and other platforms in the future.
We’ve now reached the end for this short article on how to stand up a multi-node Cassandra and Hadoop cluster on Linux. To download the DataStax Enterprise edition and start working with Cassandra and Hadoop today, please visit the downloads page.