Has anyone had any luck using Cassandra as an input source for map/reduce (AWS EMR) jobs using the Python mrjob toolkit?
I am using the distribution from apache-cassandra-1.1.9-bin.tar.gz . The input format is the ColumnFamilyInputFormat from apache-cassandra-1.1.9/lib/apache-cassandra-1.1.9.jar .
I have configured the job to use the input format by
* Setting self.HADOOP_INPUT_FORMAT = 'org.apache.cassandra.hadoop.ColumnFamilyInputFormat'
* Defining configuration via jobconf and variables 'cassandra.(input.keyspace, input.columnfamily, input.predicate, ...)'
The job seems to find the input format class successfully.
The problem is that mrjob seems to require that some input path be passed to the job. For the standard input formats, the input path points to a local file (that is uploaded to S3), or an S3 object, etc. What should the input path be to indicate that it should pull data from the Cassandra cluster using configuration defined in the jobconf? Is there some notation like cassandra://... that should be used here? Between code in runner.py and emr.py, mrjob enforces that some input path be defined, or it defaults to stdin. Even if I make changes to the code to allow an empty list of inputs, the framework does not seem to delegate to the Cassandra input format.
Has anyone been able to get this to work? What steps am I missing?
thanks-
patrick
