DataStax Enterprise 2.2 Documentation

About Hadoop

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

In DataStax Enterprise, Hadoop is continuously available for analytics workloads. DataStax Enterprise is 100% compatible with Apache's Hadoop. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture.

DataStax Enterprise fully supports:

Starting DataStax Enterprise Hadoop

Assuming an analytics node is running, use the following command to start Hadoop:

dse hadoop fs <args>

where the available <args> are described in the HDFS File System Shell Guide on the Apache Hadoop web site.

For example:

dse hadoop fs -help

For information on starting an analytics node, see Starting and Stopping DataStax Enterprise.

For information on starting Hive, Pig, or using Hadoop, see:

Hadoop Demos

After starting Hadoop, run these demos for a good introduction to Hadoop solutions:

  • Portfolio Manager Demo: Demonstrates a hybrid workflow using DataStax Enterprise.
  • Hive Demo: Demonstrates using Hive to access data in Cassandra.
  • Mahout Demo: Demonstrates Mahout support in DataStax Enterprise by determining which entries in the sample input data file remained statistically in control and which have not.
  • Pig Demo: Create a Pig relation, perform a simple MapReduce job, and put the results back into CassandraFS or into a Cassandra column family.
  • Sqoop Demo: Migrates data from a MySQL database containing information from the North American Numbering Plan.

Setting the Replication Factor

The default replication for system keyspaces is 1. This replication factor is suitable for development and testing of a single node, not for a production environment. For production increase the replication factors to at least 2. This ensures resilience to single-node failures. For example:

[default@unknown] UPDATE KEYSPACE cfs
   WITH placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
   AND strategy_options={Analytics:3};

For more information, see Changing Replication Settings.

Connecting to External Addresses

This information is intended for advanced users.

How to enable Hadoop to connect to external addresses:

  • In the core-site.xml file, change the property fs.default.name from file:/// to cfs:<listen_address>:<rpc_port>.

    This eliminates the need to specify the IP address or hostname for MapReduce jobs and all other calls to Hadoop. The core-site.xml file is located in the following locations:

    Packaged installations: /etc/dse/hadoop

    Binary installations: /<install_location>/resources/hadoop/conf

  • Or run the following embedded parameter:

    dse hadoop fs -Dfs.default.name="cfs:<listen_address>:<rpc_port>" -ls /