Jake Luciani

<p>The&nbsp;<a href="http://r-project.org/">R project</a>&nbsp;is taking over the data world. With a plethora of&nbsp;<a href="http://cran.r-project.org/web/packages/available_packages_by_name.html">algorithms at your fingertips</a>&nbsp;it's not hard to see why R is such a powerful data analysis tool. I was fortunate enough to work with some of the original developers of the then&nbsp;<a href="http://en.wikipedia.org/wiki/S_(programming_language)">S-Engine</a>&nbsp;at bell labs out of college and even managed to&nbsp;<a href="http://cran.r-project.org/web/packages/RSvgDevice/index.html">write</a>&nbsp;a&nbsp;<a href="http://cran.r-project.org/web/packages/ROracle/index.html">few</a>&nbsp;CRAN packages. In fact the&nbsp;<a href="http://cran.r-project.org/web/packages/ROracle/index.html">ROracle</a>&nbsp;package is now shipped with Oracle's big data appliance (who would have ever imagined!)</p>

<p>There has been a lot of work recently to integrate&nbsp;<a href="http://www.rhipe.org/">Hadoop with R</a>&nbsp;by means of writing map/reduce in R. Most of the data scientists I've spoken to don't really want this, they really want ways to get data into R and use data sampling and other estimation techniques (for example&nbsp;<a href="https://cwiki.apache.org/Hive/languagemanual-sampling.html">hive sampling</a>). This post will show how you can interact with Cassandra from R as well as with the Cassandra Hive Driver.</p>

<h4>Reading data from Cassandra with JDBC</h4>

<p>The prerequisites are the&nbsp;<a href="http://cran.r-project.org/web/packages/RJDBC/index.html">RJDBC module</a>, Cassandra &gt;= 1.0 and the&nbsp;<a href="http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/downloads/list">Cassandra-JDBC driver</a>. In the demo I've put the driver in the same directory as the cassandra jars.</p>

<p>The example code assumes you have run through the&nbsp;<a href="https://docs.datastax.com/en/index.html#running-the-portfolio-manager-demo-application">Portfolio Manager Demo</a>&nbsp;that comes with DSC/DSE</p>

<p>&nbsp;</p>

<table data-tab-size="8">
	<tbody>
		<tr>
			<td data-line-number="1" id="file-gistfile1-r-L1">&nbsp;</td>
			<td id="file-gistfile1-r-LC1">#Load RJDBC</td>
		</tr>
		<tr>
			<td data-line-number="2" id="file-gistfile1-r-L2">&nbsp;</td>
			<td id="file-gistfile1-r-LC2">library(RJDBC)</td>
		</tr>
		<tr>
			<td data-line-number="3" id="file-gistfile1-r-L3">&nbsp;</td>
			<td id="file-gistfile1-r-LC3">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="4" id="file-gistfile1-r-L4">&nbsp;</td>
			<td id="file-gistfile1-r-LC4">#Load in the Cassandra-JDBC diver</td>
		</tr>
		<tr>
			<td data-line-number="5" id="file-gistfile1-r-L5">&nbsp;</td>
			<td id="file-gistfile1-r-LC5">cassdrv &lt;- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",</td>
		</tr>
		<tr>
			<td data-line-number="6" id="file-gistfile1-r-L6">&nbsp;</td>
			<td id="file-gistfile1-r-LC6">list.files("/Users/jake/workspace/bdp/resources/cassandra/lib/",pattern="jar$",full.names=T))</td>
		</tr>
		<tr>
			<td data-line-number="7" id="file-gistfile1-r-L7">&nbsp;</td>
			<td id="file-gistfile1-r-LC7">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="8" id="file-gistfile1-r-L8">&nbsp;</td>
			<td id="file-gistfile1-r-LC8">#Connect to Cassandra node and Keyspace</td>
		</tr>
		<tr>
			<td data-line-number="9" id="file-gistfile1-r-L9">&nbsp;</td>
			<td id="file-gistfile1-r-LC9">casscon &lt;- dbConnect(cassdrv, "jdbc:cassandra://localhost:9160/PortfolioDemo")</td>
		</tr>
		<tr>
			<td data-line-number="10" id="file-gistfile1-r-L10">&nbsp;</td>
			<td id="file-gistfile1-r-LC10">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="11" id="file-gistfile1-r-L11">&nbsp;</td>
			<td id="file-gistfile1-r-LC11">#Query timeseries data</td>
		</tr>
		<tr>
			<td data-line-number="12" id="file-gistfile1-r-L12">&nbsp;</td>
			<td id="file-gistfile1-r-LC12">res &lt;- dbGetQuery(casscon, "select * from StockHist limit 10")</td>
		</tr>
		<tr>
			<td data-line-number="13" id="file-gistfile1-r-L13">&nbsp;</td>
			<td id="file-gistfile1-r-LC13">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="14" id="file-gistfile1-r-L14">&nbsp;</td>
			<td id="file-gistfile1-r-LC14">#Transpose</td>
		</tr>
		<tr>
			<td data-line-number="15" id="file-gistfile1-r-L15">&nbsp;</td>
			<td id="file-gistfile1-r-LC15">tres &lt;- t(res[2:10])</td>
		</tr>
		<tr>
			<td data-line-number="16" id="file-gistfile1-r-L16">&nbsp;</td>
			<td id="file-gistfile1-r-LC16">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="17" id="file-gistfile1-r-L17">&nbsp;</td>
			<td id="file-gistfile1-r-LC17">#Plot</td>
		</tr>
		<tr>
			<td data-line-number="18" id="file-gistfile1-r-L18">&nbsp;</td>
			<td id="file-gistfile1-r-LC18">boxplot(tres,names=res$KEY,col=topo.colors(length(res$KEY)))</td>
		</tr>
		<tr>
			<td data-line-number="19" id="file-gistfile1-r-L19">&nbsp;</td>
			<td id="file-gistfile1-r-LC19">title("BoxPlot of 10 Stock Price Histories")</td>
		</tr>
	</tbody>
</table>

<p><a href="https://gist.github.com/tjake/2661144/raw/95bee601248accbc12954054af8ce8adceda8827/gistfile1.r">view raw</a><a href="https://gist.github.com/tjake/2661144#file-gistfile1-r">gistfile1.r</a>&nbsp;hosted with&nbsp;&nbsp;by&nbsp;<a href="https://github.com/">GitHub</a></p>

<p>&nbsp;</p>

<p><img alt="BoxPlot of 10 Stock Price Histories" data-entity-type="file" data-entity-uuid="a45d6438-58f6-4960-8333-5998cbe21a99" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2012-05-11-at-11.38.33-AM-300x284.png" /></p>

<p>Alternately there is a new&nbsp;<a href="http://cran.r-project.org/web/packages/RCassandra/index.html">RCassandra</a>&nbsp;package which looks nice too.</p>

<h4>R, Cassandra, and Hive</h4>

<p>For accessing Hive and Cassandra from R, I will be using&nbsp;<a href="https://www.datastax.com/products/datastax-enterprise">DataStax Enterprise</a>.</p>

<p>First, startup the hive server:&nbsp;<strong>dse hive --service hiveserver</strong></p>

<p>&nbsp;</p>

<table data-tab-size="8">
	<tbody>
		<tr>
			<td data-line-number="1" id="file-gistfile1-r-L1">&nbsp;</td>
			<td id="file-gistfile1-r-LC1">#Load RJDBC</td>
		</tr>
		<tr>
			<td data-line-number="2" id="file-gistfile1-r-L2">&nbsp;</td>
			<td id="file-gistfile1-r-LC2">library(RJDBC)</td>
		</tr>
		<tr>
			<td data-line-number="3" id="file-gistfile1-r-L3">&nbsp;</td>
			<td id="file-gistfile1-r-LC3">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="4" id="file-gistfile1-r-L4">&nbsp;</td>
			<td id="file-gistfile1-r-LC4">#Load Hive JDBC driver</td>
		</tr>
		<tr>
			<td data-line-number="5" id="file-gistfile1-r-L5">&nbsp;</td>
			<td id="file-gistfile1-r-LC5">hivedrv &lt;- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",</td>
		</tr>
		<tr>
			<td data-line-number="6" id="file-gistfile1-r-L6">&nbsp;</td>
			<td id="file-gistfile1-r-LC6">c(list.files("/Users/jake/workspace/bdp/resources/hadoop",pattern="jar$",full.names=T),</td>
		</tr>
		<tr>
			<td data-line-number="7" id="file-gistfile1-r-L7">&nbsp;</td>
			<td id="file-gistfile1-r-LC7">list.files("/Users/jake/workspace/bdp/resources/hive/lib",pattern="jar$",full.names=T)))</td>
		</tr>
		<tr>
			<td data-line-number="8" id="file-gistfile1-r-L8">&nbsp;</td>
			<td id="file-gistfile1-r-LC8">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="9" id="file-gistfile1-r-L9">&nbsp;</td>
			<td id="file-gistfile1-r-LC9">#Connect to Hive service</td>
		</tr>
		<tr>
			<td data-line-number="10" id="file-gistfile1-r-L10">&nbsp;</td>
			<td id="file-gistfile1-r-LC10">hivecon &lt;- dbConnect(hivedrv, "jdbc:hive://localhost:10000/default")</td>
		</tr>
		<tr>
			<td data-line-number="11" id="file-gistfile1-r-L11">&nbsp;</td>
			<td id="file-gistfile1-r-LC11">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="12" id="file-gistfile1-r-L12">&nbsp;</td>
			<td id="file-gistfile1-r-LC12">#Create Hive table mapping to Cassandra ColumnFamily</td>
		</tr>
		<tr>
			<td data-line-number="13" id="file-gistfile1-r-L13">&nbsp;</td>
			<td id="file-gistfile1-r-LC13">tmp &lt;- dbSendQuery(hivecon,"create external table StockHist(row_key string, column_name string, value double)</td>
		</tr>
		<tr>
			<td data-line-number="14" id="file-gistfile1-r-L14">&nbsp;</td>
			<td id="file-gistfile1-r-LC14">STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'</td>
		</tr>
		<tr>
			<td data-line-number="15" id="file-gistfile1-r-L15">&nbsp;</td>
			<td id="file-gistfile1-r-LC15">WITH SERDEPROPERTIES ('cassandra.ks.name' = 'PortfolioDemo')")</td>
		</tr>
		<tr>
			<td data-line-number="16" id="file-gistfile1-r-L16">&nbsp;</td>
			<td id="file-gistfile1-r-LC16">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="17" id="file-gistfile1-r-L17">&nbsp;</td>
			<td id="file-gistfile1-r-LC17">#Run Hive Query to get returns</td>
		</tr>
		<tr>
			<td data-line-number="18" id="file-gistfile1-r-L18">&nbsp;</td>
			<td id="file-gistfile1-r-LC18">hres &lt;- dbGetQuery(hivecon,"select a.row_key ticker, AVG((b.value - a.value)) ret</td>
		</tr>
		<tr>
			<td data-line-number="19" id="file-gistfile1-r-L19">&nbsp;</td>
			<td id="file-gistfile1-r-LC19">from StockHist a JOIN StockHist b on</td>
		</tr>
		<tr>
			<td data-line-number="20" id="file-gistfile1-r-L20">&nbsp;</td>
			<td id="file-gistfile1-r-LC20">(a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name)</td>
		</tr>
		<tr>
			<td data-line-number="21" id="file-gistfile1-r-L21">&nbsp;</td>
			<td id="file-gistfile1-r-LC21">group by a.row_key order by ret")</td>
		</tr>
		<tr>
			<td data-line-number="22" id="file-gistfile1-r-L22">&nbsp;</td>
			<td id="file-gistfile1-r-LC22">&nbsp;</td>
		</tr>
		<tr>
			<td data-line-number="23" id="file-gistfile1-r-L23">&nbsp;</td>
			<td id="file-gistfile1-r-LC23">#Plot</td>
		</tr>
		<tr>
			<td data-line-number="24" id="file-gistfile1-r-L24">&nbsp;</td>
			<td id="file-gistfile1-r-LC24">barplot(hres[,2],names.arg=hres[,1],col = topo.colors(length(hres[,2])), border = NA)</td>
		</tr>
		<tr>
			<td data-line-number="25" id="file-gistfile1-r-L25">&nbsp;</td>
			<td id="file-gistfile1-r-LC25">title("Avg 10 Day Return for all Stocks")</td>
		</tr>
	</tbody>
</table>

<p><a href="https://gist.github.com/tjake/2661454/raw/c360d0cd81e65ba987532bcc0d57177aec936739/gistfile1.r">view raw</a><a href="https://gist.github.com/tjake/2661454#file-gistfile1-r">gistfile1.r</a>&nbsp;hosted with&nbsp;&nbsp;by&nbsp;<a href="https://github.com/">GitHub</a></p>

<p>&nbsp;</p>

<p><img alt="Avg 10 Day Return for all Stocks" data-entity-type="file" data-entity-uuid="7312f8de-b495-4d2a-9d83-cfedd19cc566" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2012-05-11-at-1.57.14-PM-300x285.png" /></p>

<h4>Conclusion</h4>

<p>I hope this post has shown how simple it is to access your Cassandra data from R, and why combining it with the hundreds of statistical methods the community has added is a powerful combination.</p>

<p><br />
<br />
&nbsp;</p>


Big Analytics with R, Cassandra, and Hive

Jake LucianiEngineering

Share

Share

Reading data from Cassandra with JDBC

R, Cassandra, and Hive

Conclusion

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI