Artem Aliev

<p><strong>Update May 21, 2015:</strong>&nbsp;DataStax Enterprise (DSE) version 4.7 was just released this week and includes&nbsp;<a href="http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/spark/sparkMLlibDemo.html">official support for the Apache Spark™ MLlib integration</a>. In prior DSE releases (4.5 and 4.6), it was an experimental feature (available for those who want to learn and experiment with it, but not yet recommended for use in production).</p>

<p>DataStax Enterprise (DSE) 4.5 now has the capability to perform in-memory analytics utilizing integrated Apache Spark™. Spark has proven performance on batch and interactive analytics. Spark also supports streaming from external sources making it a powerful real-time analytics platform. Starting a Spark cluster is as simple as editing one line in the DSE config file or by starting DSE with the `dse cassandra -k` command. See the full installation and development documentation&nbsp;<a href="https://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkTOC.html">here</a>.</p>

<p>Apache Spark™ is written in Scala but has support for other languages like Python and Java. For those apprehensive about Scala, here is a good&nbsp;<a href="https://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java" title="Accessing Cassandra from Spark in Java">article</a>&nbsp;about accessing Apache Cassandra™ using the Spark Cassandra Connector Java API. Data scientists who are more familiar with languages such as R, MATLAB, SAS or Octave will likely be more comfortable with Scala than Java. Data analysis tools like R and MATLAB provide an interactive shell to work with data and are usually bundled with a comprehensive set of machine learning algorithms and libraries, but these tools usually have scalability and performance bottlenecks. For users looking for the best possible performance, a Spark cluster’s performance cannot be beaten. The Spark interactive shell (based on Scala shell) looks similar to the R shell and and makes starting with Spark easy. There are a number of Scala tutorials available to quickly get one up to speed with the Scala syntax.</p>

<p>Spark contains a number of libraries for&nbsp;<a href="https://spark.apache.org/docs/0.9.2/streaming-programming-guide.html">data streaming</a>,&nbsp;<a href="https://spark.apache.org/docs/0.9.2/graphx-programming-guide.html">graphs</a>&nbsp;and&nbsp;<a href="https://spark.apache.org/docs/0.9.2/mllib-guide.html">machine learning</a></p>

<p>In this article we’ll focus on&nbsp;<a href="http://spark.apache.org/docs/0.9.1/mllib-guide.html" title="Spark MLlib">Spark MLlib</a>, a module for machine learning which contains the following algorithms:</p>

<ul>
	<li><a href="http://spark.apache.org/docs/0.9.1/mllib-guide.html#binary-classification">Classification</a>(SVM, LogisticRegression, NaiveBayes)</li>
	<li><a href="http://spark.apache.org/docs/0.9.1/mllib-guide.html#clustering">Clustering</a>&nbsp;(Kmeans)</li>
	<li><a href="http://spark.apache.org/docs/0.9.1/mllib-guide.html#linear-regression">Linear Regression</a></li>
	<li><a href="http://spark.apache.org/docs/0.9.1/mllib-guide.html#collaborative-filtering">Collaborative Filtering</a></li>
</ul>

<p>MLlib is being rapidly developed, so many new algorithms are being added. I’d like to show how to do advanced analytics with Spark and Cassandra by solving some classical machine learning task. I will build a classifier for the Iris flower data set using the Naive Bayes algorithm.</p>

<h2>Prepare and Save the Data Set to Apache Cassandra™</h2>

<p>The&nbsp;<a href="http://en.wikipedia.org/wiki/Iris_flower_data_set" title="Iris data set">Iris flower data set</a>&nbsp;is the most commonly used data set in machine learning tutorials. It consists of 50 samples from each of three species of Iris and 4 features measured from each sample. This is not a “Big Data” example, but it provides the fundamental techniques that can be used at scale. You can generate more data if needed. In this example we will show how to store data in Apache Cassandra™, load data back into Spark, and train our model. This process will build a Naive Bayes classifier which will name a flower based on the 4 feature measurements.</p>

<p>Normally you already have data in Cassandra to analyze so this part of the article is optional. You can put data directly into Cassandra through cqlsh or using any Cassandra driver. I will use the Spark connector features to do it from the Spark shell to be consistent.</p>

<p>First put the data set to the shared file system accessible from all cluster nodes. CFS is a natural choice and it's available out of the box in DSE. I also skip the CSV header for parsing convenience.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>
			</td>
			<td>
			<p><code>wget http:</code><code>//www</code><code>.heatonresearch.com</code><code>/dload/data/iris</code><code>.csv</code></p>

			<p><code>tail</code> <code>-n +2 iris.csv |dse hadoop fs -put - iris.csv</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>Start the interactive Spark shell. Everything else will be done in the shell.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>dse spark</code></p>
			</td>
		</tr>
	</tbody>
</table>

<pre>
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 0.9.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
Creating SparkContext...
Created spark context..
Spark context available as sc.
Type in expressions to have them evaluated.
Type :help for more information.
scala&gt;</pre>

<p>DSE Spark creates a preconfigured&nbsp;<code>SparkContext</code>&nbsp;in the variable&nbsp;<code>sc</code>&nbsp;that allows Spark to connect to Cassandra through the&nbsp;<a href="https://github.com/datastax/spark-cassandra-connector">Spark Cassandra Connector</a>. The context object provides methods to select and save data to Cassandra from Spark RDD. Resilient Distributed Dataset (RDD) is a primary Spark abstraction to store distributed data set. It supports a wide set of&nbsp;<a href="http://spark.apache.org/docs/0.9.1/scala-programming-guide.html#rdd-operations">filter/head/tail/map/reduce/cogroup…</a>&nbsp;operations. All operations are parallel and lazy, executed at the moment you call a Spark output action such as collect() or saveAs..() to get final result.</p>

<p>Before we continue It is very useful to define a case class that will wrap the data. The class will be transparently mapped from/to Cassandra rows by connector methods.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>

			<p>3</p>

			<p>4</p>

			<p>5</p>

			<p>6</p>

			<p>7</p>

			<p>8</p>
			</td>
			<td>
			<p><code>case</code> <code>class</code> <code>Iris(</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>id</code><code>:</code><code>java.util.UUID,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>sepal</code><code>_</code><code>l</code><code>:</code><code>Double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>sepal</code><code>_</code><code>w</code><code>:</code><code>Double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>petal</code><code>_</code><code>l</code><code>:</code><code>Double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>petal</code><code>_</code><code>w</code><code>:</code><code>Double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>species</code><code>:</code><code>String</code></p>

			<p><code>)</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>The “id” field is not in the original data set, but a unique key is needed to store data in Cassandra.</p>

<p>Load data from the file</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>val</code> <code>data </code><code>=</code> <code>sc.textFile(</code><code>"iris.csv"</code><code>)</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>Parse data and generate random id for Iris objects</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>

			<p>3</p>

			<p>4</p>

			<p>5</p>
			</td>
			<td>
			<p><code>val</code> <code>parsed </code><code>=</code> <code>data.filter(!</code><code>_</code><code>.isEmpty).map {row </code><code>=</code><code>&gt;</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>val</code> <code>splitted </code><code>=</code> <code>row.split(</code><code>","</code><code>)</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>val</code> <code>Array(sl, sw, pl, pw) </code><code>=</code> <code>splitted.slice(</code><code>0</code><code>,</code><code>4</code><code>).map(</code><code>_</code><code>.toDouble)</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>Iris (java.util.UUID.randomUUID(), sl, sw, pl, pw, splitted(</code><code>4</code><code>))</code></p>

			<p><code>}</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>Let’s print a couple of rows to verify our collection</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>parsed.take(</code><code>2</code><code>).foreach(println)</code></p>
			</td>
		</tr>
	</tbody>
</table>

<pre>
Iris(3799e309-a6dc-4e0c-b319-7bfcb93040c2,5.1,3.5,1.4,0.2,Iris-setosa)
Iris(e14cbb0b-14e7-40bd-a950-a9265594f1f5,4.9,3.0,1.4,0.2,Iris-setosa)</pre>

<p>Data is ready to be processed by Spark or saved to Cassandra. I will store it into Cassandra and then load back.</p>

<p>The Spark Cassandra Connector allows us to execute custom CQL queries. So it can be used to create a Cassandra keyspace and table.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>

			<p>3</p>

			<p>4</p>

			<p>5</p>

			<p>6</p>

			<p>7</p>

			<p>8</p>

			<p>9</p>

			<p>10</p>

			<p>11</p>

			<p>12</p>

			<p>13</p>

			<p>14</p>
			</td>
			<td>
			<p><code>import</code> <code>com.datastax.spark.connector.cql.CassandraConnector</code></p>

			<p><code>CassandraConnector(sc.getConf).withSessionDo { session </code><code>=</code><code>&gt;</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>session.execute(</code><code>"CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"</code><code>)</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>session.execute (</code><code>""</code><code>"CREATE TABLE IF NOT EXISTS</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>test.iris (</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>id uuid primary key,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>sepal_l double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>sepal_w double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>petal_l double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>petal_w double,</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>species text</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code><code>)</code></p>

			<p><code>&nbsp;&nbsp;&nbsp;&nbsp;</code><code>"</code><code>""</code><code>)</code></p>

			<p><code>}</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>Finally save the data to Cassandra</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>parsed.saveToCassandra (</code><code>"test"</code><code>, </code><code>"iris"</code><code>)</code></p>
			</td>
		</tr>
	</tbody>
</table>

<h2>Load Data from Apache Cassandra™</h2>

<p>Loading data is very simple</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>val</code> <code>data </code><code>=</code> <code>sc.cassandraTable[Iris](</code><code>"test"</code><code>, </code><code>"iris"</code><code>).cache()</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>The cache() function will cause the table to be cached in memory and speed up future operations.</p>

<h2>Prepare Data for MLlib</h2>

<p>The MLlib works with LabeledPoint objects that consists of label (double value) to mark a class and array of double features. So we need to define mapping from flower name to index and back. The code will select all ‘species’, get distinct values, index them and create map. Then create reverse map.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>
			</td>
			<td>
			<p><code>val</code> <code>class</code><code>2</code><code>id </code><code>=</code> <code>data.map(</code><code>_</code><code>.species).distinct.collect.zipWithIndex.map{</code><code>case</code> <code>(k,v)</code><code>=</code><code>&gt;(k, v.toDouble)}.toMap</code></p>

			<p><code>val</code> <code>id</code><code>2</code><code>class </code><code>=</code> <code>class</code><code>2</code><code>id.map(</code><code>_</code><code>.swap)</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>Map Iris data to LabeledPoint</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>
			</td>
			<td>
			<p><code>import</code> <code>org.apache.spark.mllib.regression.LabeledPoint</code></p>

			<p><code>val</code> <code>parsedData </code><code>=</code> <code>data.map { i </code><code>=</code><code>&gt; LabeledPoint(class</code><code>2</code><code>id(i.species), Array(i.petal</code><code>_</code><code>l,i.petal</code><code>_</code><code>w,i.sepal</code><code>_</code><code>l,i.sepal</code><code>_</code><code>w)) }</code></p>
			</td>
		</tr>
	</tbody>
</table>

<h2>Work with MLlib predictors</h2>

<p>Train NaiveBayes classifier.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>

			<p>2</p>
			</td>
			<td>
			<p><code>import</code> <code>org.apache.spark.mllib.classification.NaiveBayes</code></p>

			<p><code>val</code> <code>model </code><code>=</code> <code>NaiveBayes.train(parsedData)</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>We are done with learning and now we can recognize irises by passing 4 measures.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>model.predict(Array(</code><code>5</code><code>, </code><code>1.5</code><code>, </code><code>6.4</code><code>, </code><code>3.2</code><code>))</code></p>
			</td>
		</tr>
	</tbody>
</table>

<pre>
res6: Double = 2.0</pre>

<p>Or more readable:</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p>1</p>
			</td>
			<td>
			<p><code>id</code><code>2</code><code>class(model.predict(Array(</code><code>5</code><code>, </code><code>1.5</code><code>, </code><code>6.4</code><code>, </code><code>3.2</code><code>)))</code></p>
			</td>
		</tr>
	</tbody>
</table>

<pre>
res7: String = Iris-versicolor</pre>

<h2>To be continued...</h2>

<p>I’d like to stop at this point. A data scientist will ask a lot of question here:</p>

<ul>
	<li>How can we Split data into training and test sets?</li>
	<li>Save a model and deploy it to production?</li>
	<li>Measure quality of the model?</li>
	<li>Tune training algorithms?</li>
	<li>Chart data?</li>
</ul>

<p>And other good questions. Stay tuned for Part II.</p>


Interactive Advanced Analytics with DSE and Spark MLlib

Artem Aliev

Share

Share

Prepare and Save the Data Set to Apache Cassandra™

Load Data from Apache Cassandra™

Prepare Data for MLlib

Work with MLlib predictors

To be continued...

More Company

DataStax Acquires Langflow to Accelerate Generative AI Development

The Top 5 DataStax Stories from 2023

2023 Recap: Data = AI

DataStax Astra DB Nabs Three Prestigious 2023 TrustRadius “Best of” Awards, Dominates the Vector Databases Category

One-stop Data API for Production GenAI