DataStax Developer Blog

What’s new in Cassandra 0.7: Hadoop output to Cassandra

By Brandon Williams -  December 14, 2010 | 0 Comments

Overview

In Cassandra 0.6, we created a ColumnFamilyInputFormat, allowing you to read data stored in Cassandra from a Hadoop mapreduce job. However, you had to write the output from these jobs to a local file or HDFS, or manually connect to Cassandra in your reducer to store the results there. In 0.7, we’ve extended this with a ColumnFamilyOutputFormat, allowing you to write the results back into Cassandra without needing to write code to do it yourself. Our example in ‘contrib/word_count’ has also been extended to demonstrate this functionality. Let’s take a look at it in action.

To follow this, you’ll first need to start a Cassandra 0.7 instance with the default configuration.

Build the demo


$ cd contrib/word_count
contrib/word_count$ ant

When ant finishes, you should see output like this:


jar:
    [mkdir] Created dir: /srv/cassandra/contrib/word_count/build/classes/META-INF
    [jar] Building jar: /srv/cassandra/contrib/word_count/build/word_count.jar

BUILD SUCCESSFUL

Insert the test data


contrib/word_count$ bin/word_count_setup
10/12/06 19:42:48 INFO WordCountSetup: added text1
10/12/06 19:42:48 INFO WordCountSetup: added text2
10/12/06 19:42:48 INFO WordCountSetup: added text3

This will insert a column under ‘key0′ named ‘text1′ whose value is ‘word1′, a column under ‘key0′ named ‘text2′ whose value is ‘word1 word2′, and then 1000 rows named key1..key1000 each with a column named ‘text3′ with a value of ‘word1′.

Run the job


contrib/word_count$ bin/word_count

There will be lots of output while the job runs, but at the end you should see something similar to:


10/12/06 20:14:24 INFO mapred.JobClient: map 100% reduce 100%
10/12/06 20:14:24 INFO mapred.JobClient: Job complete: job_local_0004
10/12/06 20:14:24 INFO mapred.JobClient: Counters: 12
10/12/06 20:14:24 INFO mapred.JobClient: FileSystemCounters
10/12/06 20:14:24 INFO mapred.JobClient: FILE_BYTES_READ=186890530
10/12/06 20:14:24 INFO mapred.JobClient: FILE_BYTES_WRITTEN=188478656
10/12/06 20:14:24 INFO mapred.JobClient: Map-Reduce Framework
10/12/06 20:14:24 INFO mapred.JobClient: Reduce input groups=1
10/12/06 20:14:24 INFO mapred.JobClient: Combine output records=0
10/12/06 20:14:24 INFO mapred.JobClient: Map input records=1000
10/12/06 20:14:24 INFO mapred.JobClient: Reduce shuffle bytes=0
10/12/06 20:14:24 INFO mapred.JobClient: Reduce output records=1
10/12/06 20:14:24 INFO mapred.JobClient: Spilled Records=2000
10/12/06 20:14:24 INFO mapred.JobClient: Map output bytes=10000
10/12/06 20:14:24 INFO mapred.JobClient: Combine input records=0
10/12/06 20:14:24 INFO mapred.JobClient: Map output records=1000
10/12/06 20:14:24 INFO mapred.JobClient: Reduce input records=1000

Examining WordCount.java, we can see just how easy it is to use these formats together:


Job job = new Job(getConf(), "wordcount");
...
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);

Examining results

Now that our job has completed, let’s use cassandra-cli to view the results. The word count demo will write to a keyspace named ‘wordcount’ and the output column family is ‘output_words’.


contrib/word_count$ cd ../..
$ bin/cassandra-cli
Welcome to cassandra CLI.

Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] connect localhost/9160;
Connected to: "Test Cluster" on localhost/9160
[default@unknown] use wordcount;
Authenticated to keyspace: wordcount

Now that we’re connected to the keyspace, let’s take a look at our output:


[default@wordcount] list output_words;
Using default limit of 100
-------------------
RowKey: text2
=> (column=word1, value=1, timestamp=1291666461685000)
=> (column=word2, value=1, timestamp=1291666461685000)
-------------------
RowKey: text3
=> (column=word1, value=1000, timestamp=1291666464152000)
-------------------
RowKey: text1
=> (column=word1, value=1, timestamp=1291666459823000)

3 Rows Returned.
[default@wordcount]

As you can see, this matches the counts for the input during the word count setup. If you want to write your own jobs, WordCount.java in the contrib/word_count/src directory is a good guide to follow. If you prefer writing jobs using Pig, you can already read data from Cassandra, but if you want to output data as well, support for that is planned soon.

Previously



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>