Big Analytics with R, Cassandra, and Hive
The R project is taking over the data world. With a plethora of algorithms at your fingertips it’s not hard to see why R is such a powerful data analysis tool. I was fortunate enough to work with some of the original developers of the then S-Engine at bell labs out of college and even managed to write a few CRAN packages. In fact the ROracle package is now shipped with Oracle’s big data appliance (who would have ever imagined!)
There has been a lot of work recently to integrate Hadoop with R by means of writing map/reduce in R. Most of the data scientists I’ve spoken to don’t really want this, they really want ways to get data into R and use data sampling and other estimation techniques (for example hive sampling). This post will show how you can interact with Cassandra from R as well as with the Cassandra Hive Driver.
Reading data from Cassandra with JDBC
The prerequisites are the RJDBC module, Cassandra >= 1.0 and the Cassandra-JDBC driver. In the demo I’ve put the driver in the same directory as the cassandra jars.
The example code assumes you have run through the Portfolio Manager Demo that comes with DSC/DSE
Alternately there is a new RCassandra package which looks nice too.
R, Cassandra, and Hive
For accessing Hive and Cassandra from R, I will be using DataStax Enterprise.
First, startup the hive server: dse hive –service hiveserver
Conclusion
I hope this post has shown how simple it is to access your Cassandra data from R, and why combining it with the hundreds of statistical methods the community has added is a powerful combination.



great! that’s really handy! thanks!
Hi Jake,
Your 2nd example works for me – thanks a ton – using R with cassandra helps quite a bit.
Your 1st example: (using DSE2.0 – with cassandra 1.0.9, 1 hadoop node, 1 cassandra node – and I tested the JDBC driver to be working fine) I run into issues – after the dbGetQuery I get a R error message alike:
Error in .valueClassTest(standardGeneric(“fetch”), “data.frame”, “fetch”) :
invalid value from generic function ‘fetch’, class “NULL”, expected “data.frame”
Any hints?
oh – and RCassandra works as well, but does not allow for CQL queries – which is (or would be) great about RJDBC + Cassandra JDBC.
oh wonderful. couldn’t get RJDBC to load the cassandra driver, no matter what classpaths/jre/etc I messed with. back to smacking RCassandra around
@OJ The correct drivers are included with DSE2.1.
@Aleks what did you try? What error do you get?
1. Revo-R and Cassandra are installed and running.
2. Able to execute CQL query in CQL-shell successfully.
3. Able to load RJDBC driver successfully in R-shell.
4. Able to connect to Cassandra successfully by firing the command dbConnect() in R-shell, i.e. connection creation & integration has been successful.
5. Only problem persists that CQL query is not getting executed in R-shell through the DBI command dbGetQuery(). On triggering the command it throws the following error:
Error in .verify.JDBC.result(s, “Unable to execute JDBC statement “, statement) :
Unable to execute JDBC statement SELECT * FROM StockHist Limit 10; (org.apache.cassandra.thrift.Cassandra$Client.prepare_cql_query(Ljava/nio/ByteBuffer;Lorg/apache/cassandra/thrift/Compression;)Lorg/apache/cassandra/thrift/CqlPreparedResult;)
Checked the thrift process/daemon, its up and running.
Tried using Datastax-CassandraV-2. instead of V1.0.9, but the problem still persistes. No clue why is this happening.