DataStax Enterprise 2.1 Documentation

Getting Started with Hadoop in DataStax Enterprise

In DataStax Enterprise, Hadoop is continuously available for analytics workloads. DataStax Enterprise is 100% compatible with Hadoop. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture. DataStax Enterprise fully supports MapReduce, Hive, and Pig. DataStax Enterprise 2.1 and later also supports Apache Mahout for machine learning applications.

To run a sample application that uses some Hadoops features, see Running the Portfolio Manager Demo Application.

Starting DataStax Enterprise Hadoop

Assuming an analytics node is running, use the following command to start Hadoop:

dse hadoop fs <args>

where the available <args> are described in the HDFS File System Shell Guide on the Apache Hadoop web site.

For example:

dse hadoop fs -help

For information on starting an analytics node, see Starting a DataStax Enterprise Cluster.

For information on starting Hive, Pig, or using Hadoop, see:

Divergence from Apache Hadoop for Advanced Users

How to enable Hadoop to connect to external addresses:

  • In the core-site.xml file, change the property fs.default.name from file:/// to cfs:<listen_address>:<rpc_port>.

    This eliminates the need to specify the IP address or hostname for MapReduce jobs and all other calls to Hadoop. The core-site.xml file is located in the following locations:

    Packaged installations: /etc/dse/hadoop

    Binary installations: /<install_location>/resources/hadoop/conf

  • Or run the following embedded parameter:

    dse hadoop fs -Dfs.default.name="cfs:<listen_address>:<rpc_port>" -ls /