DataStax Developer Blog

Big Analytics with R, Cassandra, and Hive

By Jake Luciani -  May 11, 2012 | 4 Comments

The R project is taking over the data world. With a plethora of algorithms at your fingertips it’s not hard to see why R is such a powerful data analysis tool. I was fortunate enough to work with some of the original developers of the then S-Engine at bell labs out of college and even managed to write a few CRAN packages. In fact the ROracle package is now shipped with Oracle’s big data appliance (who would have ever imagined!)

There has been a lot of work recently to integrate Hadoop with R by means of writing map/reduce in R. Most of the data scientists I’ve spoken to don’t really want this, they really want ways to get data into R and use data sampling and other estimation techniques (for example hive sampling). This post will show how you can interact with Cassandra from R as well as with the Cassandra Hive Driver.

Reading data from Cassandra with JDBC

The prerequisites are the RJDBC module, Cassandra >= 1.0 and the Cassandra-JDBC driver. In the demo I’ve put the driver in the same directory as the cassandra jars.

The example code assumes you have run through the Portfolio Manager Demo that comes with DSC/DSE

Alternately there is a new RCassandra package which looks nice too.

R, Cassandra, and Hive

For accessing Hive and Cassandra from R, I will be using DataStax Enterprise.

First, startup the hive server: dse hive –service hiveserver

Conclusion

I hope this post has shown how simple it is to access your Cassandra data from R, and why combining it with the hundreds of statistical methods the community has added is a powerful combination.



Comments

  1. Mehmet says:

    Hi

    I have a problem to connect cassandra via R

    after below command terminal is waiting and nothing finish. Could you suggest something…

    thanks a lot

    > casscon <- dbConnect(cassdrv, "jdbc:cassandra://10.200.1.11:8888/ResultlyData")
    log4j:WARN No appenders could be found for logger (org.apache.cassandra.cql.jdbc.CassandraDriver).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

  2. Osman says:

    how to test whether or not the Cassandra-JDBC diver in first example is properly loaded. In my system connect is giving error, but i am not sure whether it is due to the driver or not.

  3. Philipp says:

    Here is my try:

    > library(RJDBC)
    Loading required package: DBI
    Loading required package: rJava
    > cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
    + "./cassandra-jdbc-1.2.5.jar")
    Error in .jfindClass(as.character(driverClass)[1]) : class not found

    Installed:
    JDK 1.8
    R 3.1.0

    Any help?

    Thanks !

    1. kaushal says:

      JDBC(“org.apache.cassandra.cql.jdbc.CassandraDriver”,
      list.files(“/home/xyz/cassandra/lib/”,pattern=”jar$”,full.names=T))

      And copy cassandra-jdbc or cassandra-all jar file in cassandra/lib dir.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>