DataStax Blog

Why Hadoop and Solr in DataStax Enterprise?

By Robin Schumacher -  June 18, 2013 | 3 Comments

At our recent Cassandra Summit, I had the privilege of speaking with a number of customers who are using DataStax Enterprise for their always-on, line-of-business (LOB) applications. In addition to using Cassandra for their real-time transactional data, many customers are using our Hadoop component for batch analytics and Solr option for enterprise search needs.

Those new to DataStax always wonder about why we include Hadoop and Solr in our enterprise NoSQL platform (along with our production-certified version of Cassandra), and especially question what our intentions are with Hadoop. I’m often asked, “Are you trying to be a rival to Cloudera or Hortonworks?”

My answer? Not at all.

Let me explain why we include both Hadoop and Solr in DataStax Enterprise and how we relate to the Cloudera’s and Hortonwork’s of the world.

A Look Back

The legacy RDBMS world contains different database platforms for different types of workloads. On the one side, you have databases like Oracle, SQL Server, MySQL and others aimed at servicing LOB applications. That data is then fed into data warehouses for analysis where vendors such at Teradata and various columnar databases live. This ‘workload divide’ is no doubt very familiar to you:

legacy RDBMS divide

Even though these workloads are separated for various reasons, they share common needs and functionality. The LOB user has requirements to perform certain analytic operations on the data that exists in their transactional systems, so LOB databases support things such as ROLAP and other analytic functions (e.g. rank, pivot, partition by, windowing, etc.) to meet that need. Further, the LOB user wants to search their data so databases such as Oracle and SQL Server have included a full-text search option for a long time.

On the data warehouse side, these same needs exist. Of course warehouse users need analytic functionality and, naturally, they also want to search their data. This being the case, data warehouse vendors include such functionality.

legacy RDBMS divide details

 So what we’ve had in the past is an application/workload division that shares common functional requirements and is supported by different data management vendors. Let’s now look at what’s happening in the present.

Fast Forward to Today

Today’s modern LOB applications have new data requirements that have proven to exceed the capabilities offered by legacy RDBMS’s. Such apps need a more flexible data model for handling all types of data, they have data coming in from multiple locations at high rates of speed, they must support reading and writing data in multiple locations at the same time (including multiple data centers and cloud availability zones), and they can never go down and must scale in a way so that the app is future-proof from a capacity standpoint.

Enter NoSQL and Apache Cassandra. Cassandra was architected from the ground up for servicing modern LOB applications that have these types of requirements, which is why it’s replacing databases like Oracle at a rapid clip.

On the data analysis side, Hadoop has disrupted the legacy data warehouse vendors with its ability to handle all types of data structures and provides the type of massive parallel processing (MPP) offered by legacy data warehouse DB’s in a more cost efficient way via the use of commodity hardware and open source software.

So although the types of technology (NoSQL and Hadoop) and vendor names have changed, the same application/workload divide exists in today’s big data world where LOB and data warehouse applications are concerned:

nosql hadoop divide

Something else that’s carried over from yesterday’s RDBMS world to today’s modern applications is the need to perform analytics and search in both LOB and data warehouse applications. However, the types of technology utilized for both have changed along with the core data management platforms.

For analytics, software such as MapReduce, Hive, Pig, Mahout, etc., are used. For search, technologies like Solr and ElasticSearch are utilized, which are far superior in many ways to the full-text search options found in RDBMS’s.

So the details of today’s application divide look something like this:

nosql hadoop divide details

Something Borrowed, Something New

So on the one hand, some things remain the same between legacy RDBMS and today’s NoSQL applications. In the same way you would never run your LOB applications on something like Teradata, you don’t run them on Hadoop.

In addition, you certainly still need to perform analytics and search on both LOB systems and data warehouses, so that functionality needs to be present on the NoSQL side of the house as well as the Hadoop data warehouse side (which is why you’ve seen some of the Hadoop vendors now include Solr in their technology mix).

But on the other hand, because the data requirements of modern applications are different today, the types of technology and data management vendors have also changed, both on the LOB and data warehouse sides. With this change has also come a change in technology regarding how analytics and search are handled in modern LOB and data warehouse systems.

This is why we implement Hadoop and Solr with Cassandra in DataStax Enterprise. DataStax Enterprise is directly targeted at today’s LOB applications that need to run analytics and search on their real time, NoSQL databases. We’re happy to leave the modern Hadoop data warehouses for Cloudera and Hortonworks to service.

Customers such as eBay (who replaced Oracle in a number of places with DataStax Enterprise) use our Hadoop analytics to analyze many things such as bidder/seller interaction, product popularity and more while other customers like HealthCare Anytime use analytics to understand unstructured data like doctor’s notes so they can properly bill back Medicare and Medicaid. All of this is done on their real-time Cassandra data that’s transparently fed into DataStax Enterprise Hadoop nodes.

An example of a customer using our Solr integration is Datafiniti who acts as a search engine for data. They consume tons of data into Cassandra and then allow their customers to easily search that real-time data via Solr so they can find exactly what information they want to buy.

Hopefully these types of customer examples along with the background information above helps explain why we have Hadoop and Solr integrated into our enterprise NoSQL platform with Cassandra and how we relate to the major Hadoop vendors. To learn more about DataStax Enterprise, please visit our white papers, documentation, and download pages.



Comments

  1. Bhaskar says:

    Hi,
    We are planning setup Casandra for one of our clients, they already have Hortonworks Hadoop cluster.
    Does Cassandra support existing Hortonworks Hadoop installation?

    Thanks
    Bhaskar

  2. Robin Schumacher Robin Schumacher says:

    You can connect to and query cassandra data via HW, but HW is not supported as a fully integrated piece inside of DSE.

    1. Bhaskar says:

      if I am correct , DSE does’t fully support any Hadoop distribution other than Data Stax?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>