TechnologyJuly 16, 2013

Look Ma! No ETL!

Hamilton Tran
Hamilton Tran
Look Ma! No ETL!

Introduction

Solr and Hadoop are two big open source technologies that we have integrated in DataStax Enterprise on top of Cassandra. For those just joining us, Solr allows for full search, and Hadoop provides a distributed file system and allows processing large datasets via MapReduce.  In the traditional world, if you wanted to run MapReduce over some data and also do searches over that same data, you would have to ETL that data to your Solr cluster, which has all the pitfalls of trying to keep the data in sync between the two clusters.  The beauty of DataStax Enterprise is that with the right replication setting you can search and do mapreduce operations over the same dataset with ease.  In this example I'll be using a modified dataset from a survey done by The Pew Research Center about Facebook habits and  attitudes.

Environment

This demonstration was run on my EC2 cluster , 2 m1.large Ubuntu 12.04  with a  binary install of DSE 3.0.4

ring_output

The cluster has been setup to have 2 virtual datacenters or DCs, an Analytics DC with a node running Hadoop, and a Solr DC with a node running Solr.

Files

To begin we need to get the survey file: Omnibus_Dec_2012_csv
I've modified this survey file from the original by removing many of the columns, our primary focus will be two columns pial1a and pial4vb which map to these two questions

PIAL1A    As I read the following list of items, please tell me if you happen to have each one, or not.  Do you have... [INSERT ITEMS IN ORDER]?
a.        A handheld device made primarily for e-book reading, such as a Nook or Kindle e-reader
1        Yes
2        No
8        (DO NOT READ) Don’t know
9        (DO NOT READ) Refused

PIAL4vb    What made you decide to stop using Facebook?

Secondly we need to create a solr schema file so that DSE Solr understands how to import the data, index, and store the data in Cassandra. Copy and paste this to a file called answers_schema.xml . This schema tells Solr how to index our documents, and will be mirrored in DSE by a Cassandra table.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="datatypes_test" version="1.0">
<types>
    <fieldType name="text" class="solr.TextField">
        <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        </analyzer>
    </fieldType>
    <fieldType name="long" class="solr.LongField" multiValued="false"/>
    <fieldType name="int" class="solr.IntField" multiValued="false"/>
  </types>
  <fields>
    <field name="psraid" type="long" indexed="true" stored="true"/>
    <field name="pial1a" type="int" indexed="true" stored="true"/>
    <field name="pial1b" type="int" indexed="true" stored="true"/>
    <field name="pial1c" type="int" indexed="true" stored="true"/>
    <field name="pial1d" type="int" indexed="true" stored="true"/>
    <field name="pial4vb" type="text" indexed="true" stored="true"/>
    <field name="pial7vb" type="text" indexed="true" stored="true"/>
  </fields>
  <defaultSearchField>pial4vb</defaultSearchField>
  <uniqueKey>psraid</uniqueKey>
</schema>

And lastly we are going to use the solrconfig.xml provided to us from the wikipedia demo that ships with DataStax Enterprise.

cp dse/demos/wikipedia/solrconfig.xml .

Solr

We will create the keyspace to store our survey data first and set the replication strategy and options such that data will be available in both the Solr DC and the Analytics DC. By default DSE Solr would only store data in the Solr DC.

$ cqlsh
Connected to blog at localhost:9160.
[cqlsh 2.2.0 | Cassandra 1.1.9.8 | CQL spec 2.0.0 | Thrift protocol 19.33.0]
Use HELP for help.
cqlsh> create KEYSPACE answers WITH strategy_class = 'NetworkTopologyStrategy' and strategy_options:Solr=1 and strategy_options:Analytics=1;

Now we can upload the solrconfig and answers_schema xml files up to DSE Solr, this process will automatically create a column family named fbsurvey under the answers keyspace along with the columns and the appropriate metadata.

$ curl http://localhost:8983/solr/resource/answers.fbsurvey/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'
SUCCESS

$ curl http://localhost:8983/solr/resource/answers.fbsurvey/schema.xml --data-binary @answers_schema.xml -H 'Content-type:text/xml; charset=utf-8'
SUCCESS

$ curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=answers.fbsurvey"

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1612</int></lst>
</response>

Now we can upload the survey csv data and have Solr process the data and store it back into Cassandra. We can do a quick count and see the # of records, and check to see that the data transferred over.

$ curl http://localhost:8983/solr/answers.fbsurvey/update --data-binary @Omnibus_Dec_2012_csv.csv -H 'Content-Type:application/csv; charset=utf-8'

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">2553</int></lst>
</response>

$ cqlsh
Connected to blog at localhost:9160.
[cqlsh 2.2.0 | Cassandra 1.1.9.8 | CQL spec 2.0.0 | Thrift protocol 19.33.0]
Use HELP for help.
cqlsh> select count(*) from answers.fbsurvey;
 count
-------
 1006

cqlsh> select * from answers.fbsurvey limit 1;
 KEY | _docBoost | pial1a | pial1b | pial1c | pial1d | pial4vb | pial7vb
--------+-----------+--------+--------+--------+--------+---------------------------------------------------------------+---------
 201734 | 1.0 | 2 | 2 | 1 | 1 | WASNT INTERESTED. TAKING ME AWAY FROM SOCIAL LIFE AND FAMILY. |

Now we can search using SOLR's HTTP API and find out how many people mentioned a COMPUTER or FAMILY in their response to why they stopped using Facebook.
The query I'm using here has some added parameters which will properly indent the response for us, as well as only show me the two columns I'm interested in lookin at, the id and pial4vb which contains the person's response.

automaton@ip-10-82-235-115:~$ curl "http://localhost:8983/solr/answers.fbsurvey/select/?q=pial4vb:(COMPUTER%20OR%20FAMILY)&indent=true&fl=psraid,pial4vb"

<?xml version="1.0" encoding="UTF-8"?>
<response>
<result name="response" numFound="3" start="0">
  <doc>
    <long name="psraid">102113</long>
    <arr name="pial4vb">
      <str>NO COMPUTER</str>
    </arr></doc>
  <doc>
    <long name="psraid">201382</long>
    <arr name="pial4vb">
      <str>NO COMPUTER</str>
    </arr></doc>
  <doc>
    <long name="psraid">201734</long>
    <arr name="pial4vb">
      <str>WASNT INTERESTED. TAKING ME AWAY FROM SOCIAL LIFE AND FAMILY.</str>
    </arr></doc>
</result>
</response>

No computer? Ouch.

Hadoop

Now we hop over to our Hadoop node so we can run some MapReduce jobs over our data that we've imported via Solr. In this example we will use Hive which uses a very SQL like syntax that many of you will be familiar with that makes using MapReduce easy to use. We can easily reference the data in Cassandra by using the name of the keyspace as our database, and the name of the column family as our table in SQL parlance. Let's see who answered yes to owning an e-reader and gave a significant response as to why they don't use Facebook anymore.

$ dse hive
Logging initialized using configuration in file:/home/automaton/dse/resources/hive/conf/hive-log4j.properties
 Hive history file=/tmp/automaton/hive_job_log_automaton_201307161535_1235802152.txt
 hive> use answers;
 hive> select row_key from fbsurvey where pial1a=1 and length(pial4vb) > 20;
...
Ended Job = job_201307151605_0036
MapReduce Jobs Launched: 
Job 0: Map: 3 Cumulative CPU: 3.23 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 230 msec
OK
101596
100454
102582
100223
100822
100161
200933
101334
200495
100032
200694
Time taken: 22.862 seconds
hive>

Summary

This example is just the tip of the iceberg  in what you can do with Cassandra, Solr, and Hadoop and in DataStax Enterprise your data can be used how you see fit without having to wait or worry about ETL. I glossed over a lot of concepts about Hadoop and Solr in regards to how it all ties to Cassandra in the demonstration, but if you want to know more continue on to the additional reading. If you want to try DataStax Enterprise yourself download it here from this link.

Additional Reading
DataStax Enterprise Hadoop

DataStax Enterprise Search

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.