DataStax Enterprise 3.0 Documentation

Configuring Hadoop

In DataStax Enterprise, Hadoop is continuously available for analytics workloads. DataStax Enterprise is 100% compatible with Apache's Hadoop. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture.

DataStax Enterprise fully supports:

Starting DataStax Enterprise Hadoop

Assuming an analytics node is running, use the following command to start Hadoop:

dse hadoop fs <args>

where the available <args> are described in the HDFS File System Shell Guide on the Apache Hadoop web site.

For example:

dse hadoop fs -help

For information on starting an analytics node, see Starting and stopping DataStax Enterprise.

For information on starting Hive, Pig, or using Hadoop, see:

Hadoop demos

After starting Hadoop, run these demos for a good introduction to Hadoop solutions:

  • Portfolio Manager Demo: Demonstrates a hybrid workflow using DataStax Enterprise.
  • Hive Demo: Demonstrates using Hive to access data in Cassandra.
  • Mahout Demo: Demonstrates Mahout support in DataStax Enterprise by determining which entries in the sample input data file remained statistically in control and which have not.
  • Pig Demo: Create a Pig relation, perform a simple MapReduce job, and put the results back into CassandraFS or into a Cassandra column family.
  • Sqoop Demo: Migrates data from a MySQL database containing information from the North American Numbering Plan.

Setting the replication factor

The default replication for system keyspaces is 1. This replication factor is suitable for development and testing of a single node, not for a production environment. For production increase the replication factors to at least 2. This ensures resilience to single-node failures. For example:

[default@unknown] UPDATE KEYSPACE cfs
   WITH placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
   AND strategy_options={Analytics:3};

For more information, see Changing replication settings.