Hadoop is a software framework for distributed processing of large data sets using MapReduce programs. DataStax Enterprise (DSE) works with these external Hadoop systems in a bring your own Hadoop (BYOH) model. Use BYOH when you want to run DSE with a separate Hadoop cluster, from a different vendor. Supported vendors are:
- Hadoop 2.x data warehouse implementations Cloudera 4.5, 4.6, and 5.0.x
- Hortonworks 1.3.3 and 2.0.x
You can use Hadoop in one of the following modes:
- External Hadoop
Uses the Hadoop distribution provided by Cloudera (CDH) or Hortonworks (HDP).
- Internal Hadoop
Uses the DSE Hadoop integrated with DataStax Enterprise.
For legacy purposes, DataStax Enterprise 4.5 includes DSE Hadoop 1.0.4 with built-in Hadoop trackers.
- Bi-directional data movement between Cassandra in DataStax Enterprise and the Hadoop Distributed File System (HDFS)
- Hive queries against Cassandra data in DataStax Enterprise
- Data combination (joins) between Cassandra and HDFS data
- ODBC access to Cassandra data through Hive
|Component||DSE Integrated Hadoop Owner||BYOH Owner||DSE Interaction|
|Job tracker||DSE Cluster||Hadoop Cluster||Optional|
|Task tracker||DSE Cluster||Hadoop Cluster||Co-located with BYOH nodes|
|Pig||Distributed with DSE||Distribution chosen by operator||Can launch from task trackers|
|Hive||Distributed with DSE||Distribution chosen by operator||Can launch from task trackers|
BYOH installation and configuration overview
The procedure for installing and configuring DataStax Enterprise for BYOH is straight-forward. First, ensure that you meet the prerequisites. Next, install DataStax Enterprise on all nodes in the Cloudera or Hortonworks cluster and on additional nodes outside the Hadoop cluster. Install several Cloudera or Hortonworks components on the additional nodes and deploy those nodes in a virtual BYOH data center. Finally, configure DataStax Enterprise BYOH environment variables on each node in the BYOH data center to point to the Hadoop cluster, as shown in the following diagram:
DataStax Enterprise runs only on BYOH nodes, and uses Hadoop components to integrate BYOH and Hadoop. You never start up the DataStax Enterprise installations on the Hadoop cluster.
In a typical Hadoop cluster, Task Tracker and Data Node services run on each node. A Job Tracker service running on one of the master nodes coordinates MapReduce jobs between the Task Trackers, which pull data locally from data node. For the latest versions of Hadoop using YARN, Node Manager services replace Task Trackers and the Resource Manager service replaces the Job Tracker.
- Task Tracker--Means Task Tracker or Node Manager.
- Job Tracker--Means Job Tracker or Resource Manager.
A MapReduce service runs on each BYOH node along with optional MapReduce, Hive, and Pig clients. To take advantage of the performance benefits offered by Cassandra, BYOH handles frequently accessed hot data. The Hadoop cluster handles less-frequently and rarely accessed cold data. You design the MapReduce application to store output in Cassandra or Hadoop.
The following diagram shows the data flow of a job in a BYOH data center. The Job Tracker/Resource Manager (JT/RM) receives MapReduce input from the client application. The JT/RM sends a MapReduce job request to the Task Trackers/Node Managers (TT/NM) and optional clients, MapReduce, Hive, and Pig. The data is written to Cassandra and results sent back to the client.
BYOH clients submit Hive jobs to the Hadoop job tracker or ResourceManager in the case of YARN. If Cassandra is the source of the data, the job tracker evaluates the job, and the ColumnFamilyInputFormat creates input splits and assigns tasks to the various task trackers in the Cassandra node setup (giving the jobs local data access). The Hadoop job runs until the output phase.
During the output phase if Cassandra is the target of the output, the HiveCqlOutputFormat writes the data back into Cassandra from the various reducers. During the reduce step, if data is written back to Cassandra, locality is not a concern and data gets written normally into the cluster. For Hadoop in general, this pattern is the same. When spilled to disk, results are written to separate files, partial results for each reducer. When written to HDFS, the data is written back from each of the reducers.
Intermediate MapReduce files are stored on the local disk or in temporary HDFS tables, depending on configuration, but never in CFS. Using the BYOH model, Hadoop MapReduce jobs can access Cassandra as a data source and write results back to Cassandra or Hadoop.