DataStax Enterprise 3.1 Documentation

Getting Started with Analytics and Hadoop in DataStax Enterprise

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

In DataStax Enterprise, you can run analytics on your Cassandra data via the platform's built-in Hadoop integration. The Hadoop component in DataStax Enterprise is not meant to be a full Hadoop distribution, but rather enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides replication, data location awareness, and takes full advantage of Cassandra's peer-to-peer architecture.

DataStax Enterprise supports running analytics on Cassandra data with the following Hadoop components:

  • MapReduce
  • Hive for running HiveQL queries on Cassandra data
  • Pig for exploring very large data sets
  • Apache Mahout for machine learning applications

Before starting an analytics/Hadoop node on a production cluster or data center, it is important to disable the virtual node configuration. You can skip this step to run the Hadoop getting started tutorial.

Disabling virtual nodes

DataStax recommends using virtual nodes only on data centers running Cassandra real-time workloads. You should disable virtual nodes on data centers running either Hadoop or Solr workloads.

To disable virtual nodes:

  1. In the cassandra.yaml file, set num_tokens to 1.

    num_tokens = 1
    
  2. Uncomment the initial_token property and set it to 1 or to the value of a generated token for a multi-node cluster.

Starting and stopping a DSE Analytics node

The way you start up a DSE Analytics node depends on the type of installation, tarball or packaged.

Tarball installation

From the install directory, use this command to start the analytics node:

bin/dse cassandra -t

The analytics node starts up.

From the install directory, use this command to stop the analytics node:

bin/dse cassandra-stop

Check that the dse process has stopped:

ps auwx | grep dse

If the dse process stopped, the output should be minimal, for example:

jdoe  12390 0.0 0.0  2432768   620 s000  R+ 2:17PM   0:00.00 grep dse

If the output indicates that the dse process is not stopped, rerun the cassandra-stop command using the process ID (PID) from the top of the output.

bin/dse cassandra-stop <PID>

Packaged installation

  1. Enable Hadoop mode by setting this option in /etc/default/dse: HADOOP_ENABLED=1

  2. Start the dse service <start-dse> using this command:

    sudo service dse start
    

    The analytics node starts up.

You stop an analytics node using this command:

sudo service dse stop

Hadoop getting started tutorial

In this tutorial, you download a text file containing a State of the Union speech and run a classic MapReduce job that counts the words in the file and creates a sorted list of word/count pairs as output. The mapper and reducer are provided in a JAR file. Download the State of the Union speech now.

This tutorial assumes you started an analytics node on Linux. Also, the tutorial assumes you have permission to perform Hadoop and other DataStax Enterprise operations, for example, that you preface commands with sudo if necessary.

Setup

  1. Unzip the downloaded obama.txt.zip file into a directory of your choice on your file system.

    This file will be the input for the MapReduce job.

  2. Create a directory in the CassandraFS for the input file using the dse command version of the familiar hadoop fs command.

    cd <dse-install>
    bin/dse hadoop fs -mkdir /user/hadoop/wordcount/input
    
  3. Copy the input file that you downloaded to the CassandraFS.

    bin/dse hadoop fs -copyFromLocal
      <path>/obama.txt
      /user/hadoop/wordcount/input
    
  4. Check the version number of the hadoop-examples-<version>.jar. On tarball installations, the JAR is in the DataStax Enterprise resources directory. On packaged and AMI installations, the JAR is in the /usr/dse/hadoop directory.

  5. Get usage information about how to run the MapReduce job from the jar file. For example:

    bin/dse hadoop jar /<install_location>/resources/hadoop/hadoop-examples-1.0.4.8.jar
      wordcount
    

    The output is:

    2013-10-02 12:40:16.983 java[9505:1703] Unable to load realm info from SCDynamicStore
    Usage: wordcount <in> <out>
    

    If you see the SCDynamic Store message, just ignore it. The internet provides information about the message.

  6. Run the Hadoop word count example in the JAR.

    bin/dse hadoop jar
      /<install_location>/resources/hadoop/hadoop-examples-1.0.4.8.jar wordcount
      /user/hadoop/wordcount/input
      /user/hadoop/wordcount/output
    

    The output is:

    13/10/02 12:40:36 INFO input.FileInputFormat: Total input paths to process : 0
    13/10/02 12:40:36 INFO mapred.JobClient: Running job: job_201310020848_0002
    13/10/02 12:40:37 INFO mapred.JobClient:  map 0% reduce 0%
    . . .
    13/10/02 12:40:55 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=19164
    13/10/02 12:40:55 INFO mapred.JobClient:   Map-Reduce Framework
    . . .
    
  7. List the contents of the output directory on the CassandraFS.

    bin/dse hadoop fs -ls /user/hadoop/wordcount/output
    

    The output looks something like this:

    Found 3 items
    -rwxrwxrwx   1 root wheel      0 2013-10-02 12:58 /user/hadoop/wordcount/output/_SUCCESS
    drwxrwxrwx   - root wheel      0 2013-10-02 12:57 /user/hadoop/wordcount/output/_logs
    -rwxrwxrwx   1 root wheel  24528 2013-10-02 12:58 /user/hadoop/wordcount/output/part-r-00000
    
  8. Using the output file name from the directory listing, get more information about the output file using the dsetool utility.

    bin/dsetool checkcfs /user/hadoop/wordcount/output/part-r-00000
    

    The output is:

    Path: cfs://127.0.0.1/user/hadoop/wordcount/output/part-r-00000
      INode header:
        File type: FILE
        User: root
        Group: wheel
        Permissions: rwxrwxrwx (777)
        Block size: 67108864
        Compressed: true
        First save: true
        Modification time: Wed Oct 02 12:58:05 PDT 2013
      INode:
        Block count: 1
        Blocks:                               subblocks   length   start     end
          (B) f2fa9d90-2b9c-11e3-9ccb-73ded3cb6170:   1    24528       0   24528
              f3030200-2b9c-11e3-9ccb-73ded3cb6170:        24528       0   24528
        Block locations:
        f2fa9d90-2b9c-11e3-9ccb-73ded3cb6170: [localhost]
      Data:
        All data blocks ok.
    
  9. Finally, look at the output of the MapReduce job--the list of word/count pairs using the dse version of the familiar hadoop fs -cat command.

    bin/dse hadoop fs -cat /user/hadoop/wordcount/output/part-r-00000
    

    The output is:

    "D."  1
    "Don't  1
    "I  4
    . . .
    

Hadoop demos

You can run these Hadoop demos in DataStax Enterprise:

  • Portfolio Manager demo: Demonstrates a hybrid workflow using DataStax Enterprise.
  • Hive example: Shows how to use Hive to access data in Cassandra.
  • Mahout Demo: Demonstrates Mahout support in DataStax Enterprise by determining which entries in the sample input data file remained statistically in control and which have not.
  • Pig Demo: Creates a Pig relation, performs a MapReduce jobs, and stores results in a Cassandra table.
  • Sqoop Demo: Migrates data from a MySQL database containing information from the North American Numbering Plan.

Setting the replication factor

The default replication for the HiveMetaStore, dse_system, cfs, and cfs_archive system keyspaces is 1. A replication factor of 1 is suitable for development and testing of a single node, not for a production environment. For production clusters, increase the replication factor to at least 2. The higher replication factor ensures resilience to single-node failures. For example:

ALTER KEYSPACE cfs
  WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};

Configuration for running jobs on a remote cluster

This information is intended for advanced users.

To connect to external addresses:

  1. Make sure that the hostname resolution works properly on the localhost for the remote cluster nodes.

  2. Copy the dse-core-default.xml and dse-mapred-default.xml files from any working remote cluster node to your local Hadoop conf directory.

  3. Run the job with dse hadoop.

  4. If you need to override the JT location or if DataStax Enterprise cannot automatically detect the JT location, before running the job, define the HADOOP_JT environment variable:

    HADOOP_JT=<jobtracker host>:<jobtracker port> dse hadoop jar ....
    
  5. If you need to connect to many different remote clusters from the same host:

    1. Before starting the job, copy the remote Hadoop conf directories fully to the local node (into different locations).
    2. Select the appropriate location by defining HADOOP_CONF_DIR.