Amanda Moran

So you want to experiment with Apache Cassandra and Apache Spark to do some Machine Learning, awesome! But there is one downside, you need to create a cluster or ask to borrow someone else's to be able to do your experimentation… but what if I told you there is a way to install everything you need on one node, even on your laptop (if you are using Linux of Mac!). The steps outlined below will install:

<ul>
	<li>Apache Cassandra</li>
	<li>Apache Spark</li>
	<li>Apache Cassandra - Apache Spark Connector</li>
	<li>PySpark</li>
	<li>Jupyter Notebooks</li>
	<li>Cassandra &nbsp;Python Driver</li>
</ul>

Note: With any set of install instructions it will not work in all cases. Each environment is different. Hopefully, this works for you (as it did for me!), but if not use this as a guide. Also, feel free to reach out and add comments on what worked for you!

<img 75="" alt="Cassandra Jupyter Python Apache Spark Logo" data-entity-type="file" data-entity-uuid="4c9fad28-9c4e-457c-95ea-da98133c2a26" src="https://www.datastax.com/sites/default/files/inline-images/install.png" />

<h2>Installing Apache Cassandra</h2>

<img alt="Cassandra Logo" data-entity-type="file" data-entity-uuid="654d213d-8ffd-46d9-b487-8ea8fc2f7cf4" src="https://www.datastax.com/sites/default/files/inline-images/cassandralogo.png" />

<h3>Download bits</h3>

<a href="http://cassandra.apache.org/download/">http://cassandra.apache.org/download/</a>

Untar and Start

<a href="http://cassandra.apache.org/doc/latest/getting_started/installing.html">http://cassandra.apache.org/doc/latest/getting_started/installing.html</a>

<code>tar -xzvf apache-cassandra-x.x.x.tar </code>

<code>.//apache-cassandra-x.x.x/bin/cassandra //This will start Cassandra </code>

You might want to add <code>.//apache-cassandra-x.x.x/bin</code> to your <code>PATH</code> but this is not required.

Using all defaults in this case. For more information about non default configurations review the the Apache Cassandra documentation.&nbsp;

<h3>Create a Keyspace and Table with CQLSH</h3>

We will use this keyspace and table later to validate the connection between Apache Cassandra and Apache Spark.

<code>&nbsp; &nbsp;.//apache-cassandra-x.x.x/bin/cqlsh 
CREATE KEYSPACE IF NOT EXISTS test</code> 
<code>&nbsp;&nbsp; WITH REPLICATION =</code> 
<code>&nbsp;&nbsp; { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };</code> 
<code>CREATE TABLE IF NOT EXISTS testing123 (id int, name text, city text, PRIMARY KEY (id));</code> 
<code>INSERT INTO testing123 (id, name, city) VALUES (1, 'Amanda', 'Bay Area');</code> 
<code>INSERT INTO testing123 (id, name, city) VALUES (2, 'Toby', 'NYC');</code>

<img alt="CQLSH Code" data-entity-type="file" data-entity-uuid="0b8d506f-7575-4330-a42d-202ba1f1f40a" src="https://www.datastax.com/sites/default/files/inline-images/cqlshCode.png" />

<h2>Install Apache Spark in Standalone Mode</h2>

<img alt="Apache Spark Logo" data-entity-type="file" data-entity-uuid="d26bb4b3-82ae-449a-a17b-6d874fd6e44e" src="https://www.datastax.com/sites/default/files/inline-images/sparklogo.png" />

Download bits:

<a href="https://spark.apache.org/downloads.html">https://spark.apache.org/downloads.html</a>

Install:

<a href="https://spark.apache.org/docs/latest/spark-standalone.html#installing-spark-standalone-to-a-cluster">https://spark.apache.org/docs/latest/spark-standalone.html#installing-spark-standalone-to-a-cluster</a>

<code>tar spark-x.x.x-bin-hadoopx.x.tar</code>

Before starting Spark do the following:

<code>export SPARK_HOME=”//spark-x.x.x-bin-hadoopx.x 
cd $SPARK_HOME/conf</code> 
<code>vim spark-defaults.conf</code> 
<code>//Add line spark.jars.packages</code> 
<code>Spark.jars.packages &nbsp; &nbsp; com.datastax.spark:spark-cassandra-connector_2.11:2.3.2</code>

<img alt="Spark Configuration" data-entity-type="file" data-entity-uuid="98a8a7da-fd9d-4688-8360-8fdf6e0846af" src="https://www.datastax.com/sites/default/files/inline-images/sparkConf.png" />

Start Spark In Standalone Mode

<code>cd spark-x.x.x-bin-haoopx.x/</code> 
<code>./sbin/start-master.sh</code>

<h2>Information about the Apache Spark Connector</h2>

The Apache Cassandra and Apache Spark Connector works to move data back and forth from Apache Cassandra to Apache Spark to utilize the power for Apache Spark on the data. This should be co-located with Apache Cassandra and Apache Spark on both on the same node.The connector will gather data from Apache Cassandra and its known token range and page that into the Spark Executor. The connector utilized the DataStax Java driver under the hood to move data between Apache Cassandra and Apache Spark. More information can be found here: <a href="https://databricks.com/session/spark-and-cassandra-2-fast-2-furious">https://databricks.com/session/spark-and-cassandra-2-fast-2-furious</a>

<img alt="Query With Token Bounds" data-entity-type="file" data-entity-uuid="fa47deeb-e5c4-4aa2-8fe3-a3ffeae35174" src="https://www.datastax.com/sites/default/files/inline-images/Screen%20Shot%202019-03-15%20at%2010.46.34%20AM.png" />

Note: Just working with PySpark in this case, and only DataFrames are available.

<a href="https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md">https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15...</a>

<a href="https://spark-packages.org/package/datastax/spark-cassandra-connector">https://spark-packages.org/package/datastax/spark-cassandra-connector</a>

&nbsp;

Test the connection out first -- Using that keyspace and table we created above

<code>.$SPARK_HOME/bin/pyspark 
#Create a dataframe from a table that we created above</code> 
<code>spark.read.format('org.apache.spark.sql.cassandra').options(table='testing123', keyspace='test').load().show()</code>

<img alt="Spark Connector" data-entity-type="file" data-entity-uuid="a61ce661-2f97-496a-8988-406680630fad" src="https://www.datastax.com/sites/default/files/inline-images/sparkConnection.png" />

<h3>Install Jupyter Notebooks with pip</h3>

Reference: <a href="https://jupyter.org/install">https://jupyter.org/install</a>

<code>python -m pip install --upgrade pip</code>

<code>python -m pip install jupyter</code>

Start Jupyter with PySpark

<code>cd spark-2.3.0-bin-hadoop2.7 
export PYSPARK_DRIVER_PYTHON=jupyter</code> 
<code>export PYSPARK_DRIVER_PYTHON_OPTS='notebook'</code> 
<code>SPARK_LOCAL_IP=127.0.0.1 ./bin/pyspark</code>

These commands will launch Jupyter Notebooks on <code>localhost:8888</code>, the downside is if you have existing notebooks you won't be able to navigate to them... but just copy them here ... Not the best solution but it will do to be able to use all these pieces together!

<img alt="Local Host" data-entity-type="file" data-entity-uuid="215c3198-a43f-4320-a130-969fafcc7464" src="https://www.datastax.com/sites/default/files/inline-images/Screen%20Shot%202019-03-15%20at%2010.50.52%20AM.png" />

<h2>Install Apache Cassandra Python Driver</h2>

<code>pip install cassandra-driver</code>

<h3>Create a New Notebook</h3>

Import Packages

<code>Import cassandra 
Import pyspark</code>

Connect to Cluster

<code>from cassandra.cluster import Cluster 
cluster = Cluster(['127.0.01'])</code> 
<code>session = cluster.connect()</code>

Create SparkSession and load the dataframe from the Apache Cassandra table. Verify transfer has occurred by printing the number of rows in the dataframe. We should see “2”

<code>spark = SparkSession.builder.appName('demo').master("local").getOrCreate()</code> 
<code>df = spark.read.format("org.apache.spark.sql.cassandra").options(table="testing123", keyspace="test").load()</code> 
<code>print ("Table Row Count: ")</code> 
<code>print (df.count())</code>

<h2><img alt="Jupyter Testing" data-entity-type="file" data-entity-uuid="0a054aef-c7df-482e-b2b6-c5049f90e0cf" src="https://www.datastax.com/sites/default/files/inline-images/Screen%20Shot%202019-03-15%20at%2011.00.15%20AM.png" /></h2>

<h2>Conclusion</h2>

TADA!

There you have it! You now have Apache Cassandra, Apache Spark, Apache Cassandra-Apache Spark connector, Pyspark, Cassandra Python driver and Jupyter all installed on one node (or local instance!) Congratulations! Enjoy exploring your data!

A few notebooks you might enjoy!

<a href="https://github.com/amandamoran/wineAndChocolate">https://github.com/amandamoran/wineAndChocolate</a>

<a href="https://github.com/amandamoran/pydata">https://github.com/amandamoran/pydata</a>

Reference: <a href="https://medium.com/explore-artificial-intelligence/downloading-spark-and-getting-started-with-python-notebooks-jupyter-locally-on-a-single-computer-98a76236f8c1">https://medium.com/explore-artificial-intelligence/downloading-spark-and...</a>

Simplifying Installing Apache Cassandra, Apache Spark and Jupyter

Amanda Moran

Discover more

Share

Share

Installing Apache Cassandra

Download bits

Create a Keyspace and Table with CQLSH

Install Apache Spark in Standalone Mode

Information about the Apache Spark Connector

Install Jupyter Notebooks with pip

Install Apache Cassandra Python Driver

Create a New Notebook

Conclusion

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI