Amanda Moran

<p>Part 2 of this blog series will focus on how to get DataStax Enterprise Analytics with Apache Cassandra™ and Apache Spark™, Jupyter Notebooks, and all the required Python package dependencies set up via Docker.</p>

<p><img alt="Docker Logos" data-entity-type="file" data-entity-uuid="b845977b-ddf0-4ec2-ba94-8922a9670270" src="https://www.datastax.com/sites/default/files/inline-images/alllogosDocker.png" /></p>

<h2>What Problem Are We Trying to Solve?</h2>

<p>The question of our time: "What movie should I actually see?" Wouldn't it be great if you could ask 1 million people this question? Wouldn't it be great if I could automate this process? And wouldn't it be great if I didn't have to do all the installation steps detailed out in the Part 1 blog in this series?</p>

<p>Data analytics doesn't have to be complicated and neither does the step-up!</p>

<p>To do this we can use the power of Big Data, and power of a combination of technologies: DataStax Enterprise Analytics with Apache Spark™ and Apache Cassandra™, Apache Spark™ Machine Learning Libraries, Python, Pyspark, Twitter Tweets, Twitter Developer API, Jupyter notebooks, Pandas, a Python package Pattern, and Docker!</p>

<h2>How Are We Going to Solve It?</h2>

<p>In the <a href="/blog/2018/09/when-rotten-tomatoes-isnt-enough-twitter-sentiment-analysis-dse-part-1">Part 1</a> blog entry on this topic we installed everything locally using the DSE binary tar file to install, but now we will simplify the process by utilizing Docker and a previously created image.&nbsp;</p>

<h2>How to Get Started</h2>

<h3>Requirements</h3>

<ul>
	<li>Docker</li>
	<li>Download or clone this repo: <a href="https://github.com/amandamoran/pydata">https://github.com/amandamoran/pydata</a>
	<ul>
		<li>Note: This repo also includes notebooks utlizing CSV files if you would like to get started with the notebook but do not wish to create a Twitter Dev API account.</li>
	</ul>
	</li>
</ul>

<h4>Overview</h4>

<ul>
	<li>Install Docker</li>
	<li>Download DataStax Docker Image</li>
	<li>Open Jupyter</li>
</ul>

<h4>Install Docker</h4>

<ul>
	<li>Download correct Docker Community Edition: <a href="https://store.docker.com/search?type=edition&amp;offering=community">https://store.docker.com/search?type=edition&amp;offering=community</a></li>
	<li>Create Log In to Download</li>
	<li>Download Docker</li>
</ul>

<p>&nbsp;</p>

<p><img alt="Download Docker" data-entity-type="file" data-entity-uuid="7dbb79a5-610b-4e86-adf3-1b3c8432a9a0" src="https://www.datastax.com/sites/default/files/inline-images/downloaddocker_0.png" /></p>

<h3>Configure Docker Memory Settings</h3>

<ul>
	<li>Allow for 5 GB of Memory per container</li>
	<li>Docker -&gt; Preferences -&gt; Advanced -&gt; Memory</li>
</ul>

<p>&nbsp;</p>

<p><img alt="Docker Memory Settings" data-entity-type="file" data-entity-uuid="f3e455bf-4bbf-4057-aa6c-cb16da60ccd5" src="https://www.datastax.com/sites/default/files/inline-images/memorydocker.png" /></p>

<h3>Download DSE/Jupyter Images</h3>

<ul>
	<li>cd <strong>YourDownloadPath</strong>/pydata</li>
	<li>docker-compose up -d&nbsp;
	<ul>
		<li>This will take about 6 minutes (depending on your connection speed)</li>
		<li>This will start DataStax Enterprise which includes Apache Spark™, and Jupyter notebooks</li>
		<li>Must run in the same directory as the docker-compose.yaml file (This file is what has all the configuration and information on how to download and deploy these containers.)</li>
	</ul>
	</li>
</ul>

<h3>Open Jupyter</h3>

<ul>
	<li>Once download and start is complete</li>
	<li>Login with token that is in Jupyter logs</li>
	<li>docker logs pydata_jupyter_1</li>
</ul>

<p>&nbsp;</p>

<p><img alt="Open Jupyter" data-entity-type="file" data-entity-uuid="1ac31d4d-b609-4ee8-ae95-f86e0890a6a5" src="https://www.datastax.com/sites/default/files/inline-images/jupyterterm.png" /></p>

<ul>
	<li>Example: <a href="http://127.0.0.1:8889/?token=dcd21bc3a1c1331c6c61d51fb5a9d64c72fca7f4b2a6000e">http://127.0.0.1:8889/?token=dcd21bc3a1c1331c6c61d51fb5a9d64c72fca7f4b2a6000e</a></li>
	<li>Navigate to notebooks directory!!</li>
	<li>Open When Rotten Tomatoes isn’t Enough CSV.ipynb //If you want to be able to play with the notebook without have Twitter API installed</li>
</ul>

<p><img alt="Notebook" data-entity-type="file" data-entity-uuid="693aa4d6-f423-457c-8d72-979093630656" src="https://www.datastax.com/sites/default/files/inline-images/Screen%20Shot%202018-10-30%20at%203.22.12%20PM.png" /></p>

<h3>Congrats, you did it!</h3>

<h2>What's Next:</h2>

<p>Explore the notebook! Play with removing different stop words, change the confidence intervals! Data science is about exploring</p>

<h2>Stay Tuned for Part 3</h2>

<p>Stay tuned for the 3rd and final part of this series that will walk through each cell in the notebook!</p>

<p>Want even more information about how to deploy DSE Docker containers? Check out this excellent blog by Kathryn Erickson: <a href="https://academy.datastax.com/content/docker-tutorial">Docker Tutorial</a>.</p>


When Rotten Tomatoes Isn't Enough: Twitter Sentiment Analysis with DSE Part 2

Amanda Moran

Discover more

Share

Share

What Problem Are We Trying to Solve?

How Are We Going to Solve It?

How to Get Started

Requirements

Overview

Install Docker

Configure Docker Memory Settings

Download DSE/Jupyter Images

Open Jupyter

Congrats, you did it!

What's Next:

Stay Tuned for Part 3

More Company

DataStax Acquires Langflow to Accelerate Generative AI Development

The Top 5 DataStax Stories from 2023

2023 Recap: Data = AI

DataStax Astra DB Nabs Three Prestigious 2023 TrustRadius “Best of” Awards, Dominates the Vector Databases Category

One-stop Data API for Production GenAI