In the Cassandra source distribution, the contrib/word_count example demonstrates the configuration of a Hadoop Job with Cassandra’s ConfigHelper in order to run map/reduce over the data in a particular column family on a Cassandra cluster. This WordCount example is basic, but it has a number of moving parts that show off the key aspects of integrating Cassandra and Hadoop.
The most important of these integration points to understand is that the Cassandra Hadoop implementations are capable of providing the Hadoop JobTracker with splits in such a way as to preserve data locality within the cluster. Particularly, if a Hadoop TaskTracker runs on the same servers as the Cassandra nodes, the Tasks will only contain data for that node. This causes tremendous gains in efficiencies and processing times as the Cassandra nodes receive only queries for which they are the primary replica, avoiding the overhead of the Gossip protocol.
For some additional details behind the integration points, the implementations of the Hadoop classes in Cassandra are listed below with a brief explanation of each. These classes can all be found in the org.apache.cassandra.hadoop package in the Cassandra source distribution.
Implements Hadoop’s InputFormat defining how the rows retrieved from Cassandra will be split for the JobTracker.
Implements Hadoop’s RecordReader to break the column family into key/value pairs for input to the Mapper
Implements Hadoop’s InputSplit to hold the locations (i.e. replicas) of a token range. ColumnFamilyInputFormat assembles these splits as input for ColumnFamilyRecordReader.