Hamilton Tran

<h3>Introduction</h3>

<p>Solr and Hadoop are two big open source technologies that we have integrated in DataStax Enterprise on top of Cassandra. For those just joining us, Solr allows for full search, and Hadoop provides a distributed file system and allows processing large datasets via MapReduce. &nbsp;In the traditional world, if you wanted to run MapReduce over some data and also do searches over that same data, you would have to ETL that data to your Solr cluster, which has all the pitfalls of trying to keep the data in sync between the two clusters. &nbsp;The beauty of DataStax Enterprise is that with the right replication setting you can search and do mapreduce operations over the same dataset with ease. &nbsp;In this example I'll be using a modified dataset from a survey done by&nbsp;The Pew Research Center&nbsp;about Facebook habits and &nbsp;attitudes.</p>

<h3>Environment</h3>

<p>This demonstration was run on my EC2 cluster , 2 m1.large Ubuntu 12.04 &nbsp;with a &nbsp;binary install of DSE 3.0.4</p>

<p><img alt="ring_output" src="https://www.datastax.com/wp-content/uploads/2013/12/ring_output-700x72.png" /></p>

<p>The cluster has been setup to have 2 virtual datacenters or DCs, an Analytics DC with a node running Hadoop, and a Solr DC with a node running Solr.</p>

<h3>Files</h3>

<p>To begin we need to get the survey file:&nbsp;Omnibus_Dec_2012_csv<br />
I've modified this survey file from the original by removing many of the columns, our primary focus will be two columns&nbsp;<strong>pial1a&nbsp;</strong>and&nbsp;<strong>pial4vb</strong>&nbsp;which map to these two questions</p>

<pre>
PIAL1A&nbsp;&nbsp;&nbsp; As I read the following list of items, please tell me if you happen to have each one, or not.&nbsp; Do you have...<b> [INSERT ITEMS IN ORDER]</b>?
a.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A handheld device made primarily for e-book reading, such as a Nook or Kindle e-reader
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Yes
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; No
8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>(DO NOT READ)</b> Don’t know
9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>(DO NOT READ)</b> Refused

PIAL4vb&nbsp;&nbsp;&nbsp; What made you decide to stop using Facebook?</pre>

<p>Secondly we need to create a solr schema file so that DSE Solr understands how to import the data, index, and store the data in Cassandra. Copy and paste this to a file called answers_schema.xml .&nbsp;This schema tells Solr how to index our documents,<wbr />&nbsp;and will be mirrored in DSE by a Cassandra table.</p>

<pre>
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;schema name="datatypes_test" version="1.0"&gt;
&lt;types&gt;
    &lt;fieldType name="text" class="solr.TextField"&gt;
        &lt;analyzer&gt;
        &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
        &lt;/analyzer&gt;
    &lt;/fieldType&gt;
    &lt;fieldType name="long" class="solr.LongField" multiValued="false"/&gt;
    &lt;fieldType name="int" class="solr.IntField" multiValued="false"/&gt;
  &lt;/types&gt;
  &lt;fields&gt;
    &lt;field name="psraid" type="long" indexed="true" stored="true"/&gt;
    &lt;field name="pial1a" type="int" indexed="true" stored="true"/&gt;
    &lt;field name="pial1b" type="int" indexed="true" stored="true"/&gt;
    &lt;field name="pial1c" type="int" indexed="true" stored="true"/&gt;
    &lt;field name="pial1d" type="int" indexed="true" stored="true"/&gt;
    &lt;field name="pial4vb" type="text" indexed="true" stored="true"/&gt;
    &lt;field name="pial7vb" type="text" indexed="true" stored="true"/&gt;
  &lt;/fields&gt;
  &lt;defaultSearchField&gt;pial4vb&lt;/defaultSearchField&gt;
  &lt;uniqueKey&gt;psraid&lt;/uniqueKey&gt;
&lt;/schema&gt;</pre>

<p>And lastly we are going to use the solrconfig.xml provided to us from the wikipedia demo that ships with DataStax Enterprise.</p>

<pre>
cp dse/demos/wikipedia/solrconfig.xml .</pre>

<h3>Solr</h3>

<p>We will create the keyspace to store our survey data first and set the replication strategy and options such that data will be available in both the Solr DC and the Analytics DC. By default DSE Solr would only store data in the Solr DC.</p>

<pre>
$ cqlsh
Connected to blog at localhost:9160.
[cqlsh 2.2.0 | Cassandra 1.1.9.8 | CQL spec 2.0.0 | Thrift protocol 19.33.0]
Use HELP for help.
cqlsh&gt; create KEYSPACE answers WITH strategy_class = 'NetworkTopologyStrategy' and strategy_options:Solr=1 and strategy_options:Analytics=1;</pre>

<p>Now we can upload the solrconfig and answers_schema xml files up to DSE Solr, this process will automatically create a column family named fbsurvey under the answers keyspace along with the columns and the appropriate metadata.</p>

<pre>
$ curl http://localhost:8983/solr/resource/answers.fbsurvey/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'
SUCCESS

$ curl http://localhost:8983/solr/resource/answers.fbsurvey/schema.xml --data-binary @answers_schema.xml -H 'Content-type:text/xml; charset=utf-8'
SUCCESS

$ curl "http://localhost:8983/solr/admin/cores?action=CREATE&amp;name=answers.fbsurvey"

&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;1612&lt;/int&gt;&lt;/lst&gt;
&lt;/response&gt;</pre>

<p>Now we can upload the survey csv data and have Solr process the data and store it back into Cassandra. We can do a quick count and see the # of records, and check to see that&nbsp;the data transferred over.</p>

<pre>
$ curl http://localhost:8983/solr/answers.fbsurvey/update --data-binary @Omnibus_Dec_2012_csv.csv -H 'Content-Type:application/csv; charset=utf-8'

&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;2553&lt;/int&gt;&lt;/lst&gt;
&lt;/response&gt;

$ cqlsh
Connected to blog at localhost:9160.
[cqlsh 2.2.0 | Cassandra 1.1.9.8 | CQL spec 2.0.0 | Thrift protocol 19.33.0]
Use HELP for help.
cqlsh&gt; select count(*) from answers.fbsurvey;
 count
-------
 1006

cqlsh&gt; select * from answers.fbsurvey limit 1;
 KEY | _docBoost | pial1a | pial1b | pial1c | pial1d | pial4vb | pial7vb
--------+-----------+--------+--------+--------+--------+---------------------------------------------------------------+---------
 201734 | 1.0 | 2 | 2 | 1 | 1 | WASNT INTERESTED. TAKING ME AWAY FROM SOCIAL LIFE AND FAMILY. |</pre>

<p>Now we can search using SOLR's HTTP API and find out how many people mentioned a COMPUTER or FAMILY in their response to why they stopped using Facebook.<br />
The query I'm using here has some added parameters which will properly indent the response for us, as well as only show me the two columns I'm interested in lookin at, the id and pial4vb which contains the person's response.</p>

<pre>
automaton@ip-10-82-235-115:~$ curl "http://localhost:8983/solr/answers.fbsurvey/select/?q=pial4vb:(COMPUTER%20OR%20FAMILY)&amp;indent=true&amp;fl=psraid,pial4vb"

&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;result name="response" numFound="3" start="0"&gt;
  &lt;doc&gt;
    &lt;long name="psraid"&gt;102113&lt;/long&gt;
    &lt;arr name="pial4vb"&gt;
      &lt;str&gt;NO COMPUTER&lt;/str&gt;
    &lt;/arr&gt;&lt;/doc&gt;
  &lt;doc&gt;
    &lt;long name="psraid"&gt;201382&lt;/long&gt;
    &lt;arr name="pial4vb"&gt;
      &lt;str&gt;NO COMPUTER&lt;/str&gt;
    &lt;/arr&gt;&lt;/doc&gt;
  &lt;doc&gt;
    &lt;long name="psraid"&gt;201734&lt;/long&gt;
    &lt;arr name="pial4vb"&gt;
      &lt;str&gt;WASNT INTERESTED. TAKING ME AWAY FROM SOCIAL LIFE AND FAMILY.&lt;/str&gt;
    &lt;/arr&gt;&lt;/doc&gt;
&lt;/result&gt;
&lt;/response&gt;</pre>

<p>No computer? Ouch.</p>

<h3>Hadoop</h3>

<p>Now we hop over to our Hadoop node so we can run some MapReduce jobs over our data that we've imported via Solr. In this example we will use Hive which uses a very SQL like syntax that many of you will be familiar with that makes using MapReduce easy to use. We can easily reference the data in Cassandra by using the name of the keyspace as our database, and the name of the column family as our table in SQL parlance. Let's see who answered yes to owning an e-reader and gave a significant response as to why they don't use Facebook anymore.</p>

<pre>
$ dse hive
Logging initialized using configuration in file:/home/automaton/dse/resources/hive/conf/hive-log4j.properties
 Hive history file=/tmp/automaton/hive_job_log_automaton_201307161535_1235802152.txt
 hive&gt; use answers;
 hive&gt; select row_key from fbsurvey where pial1a=1 and length(pial4vb) &gt; 20;
...
Ended Job = job_201307151605_0036
MapReduce Jobs Launched: 
Job 0: Map: 3 Cumulative CPU: 3.23 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 230 msec
OK
101596
100454
102582
100223
100822
100161
200933
101334
200495
100032
200694
Time taken: 22.862 seconds
hive&gt;</pre>

<h3>Summary</h3>

<p>This example is just the tip of the iceberg &nbsp;in what you can do with Cassandra, Solr, and Hadoop and in DataStax Enterprise your data can be used how you see fit without having to wait or worry about ETL. I glossed over a lot of concepts about Hadoop and Solr in regards to how it all ties to Cassandra in the demonstration, but if you want to know more continue on to the additional reading. If you want to try DataStax Enterprise yourself download it here from this&nbsp;link.</p>

<p>Additional Reading<br />
DataStax Enterprise Hadoop</p>

<p>DataStax Enterprise Search</p>


Look Ma! No ETL!

Hamilton Tran

Share

Share

Introduction

Environment

Files

Solr

Hadoop

Summary

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI