In DataStax Enterprise, Hadoop is continuously available for analytics workloads. DataStax Enterprise is 100% compatible with Apache's Hadoop. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture.
DataStax Enterprise fully supports:
Assuming an analytics node is running, use the following command to start Hadoop:
dse hadoop fs <args>
where the available <args> are described in the HDFS File System Shell Guide on the Apache Hadoop web site.
For example:
dse hadoop fs -help
For information on starting an analytics node, see Starting and Stopping DataStax Enterprise.
For information on starting Hive, Pig, or using Hadoop, see:
After starting Hadoop, run these demos for a good introduction to Hadoop solutions:
The default replication for system keyspaces is 1. This replication factor is suitable for development and testing of a single node, not for a production environment. For production increase the replication factors to at least 2. This ensures resilience to single-node failures. For example:
[default@unknown] UPDATE KEYSPACE cfs
WITH placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
AND strategy_options={Analytics:3};
For more information, see Changing Replication Settings.
This information is intended for advanced users.
How to enable Hadoop to connect to external addresses:
In the core-site.xml file, change the property fs.default.name from file:/// to cfs:<listen_address>:<rpc_port>.
This eliminates the need to specify the IP address or hostname for MapReduce jobs and all other calls to Hadoop. The core-site.xml file is located in the following locations:
Packaged installations: /etc/dse/hadoop
Binary installations: /<install_location>/resources/hadoop/conf
Or run the following embedded parameter:
dse hadoop fs -Dfs.default.name="cfs:<listen_address>:<rpc_port>" -ls /