DataStax Developer Blog

How Much Faster is Spark than Hadoop in DataStax Enterprise?

By Robin Schumacher -  July 10, 2014 | 0 Comments

Way back in version 1.0 of DataStax Enterprise (DSE), we supplied built-in capabilities to run batch analytics on Cassandra data with Hadoop MapReduce, Hive, Pig, and Mahout. With our recent 4.5 release of DSE, we integrated Spark into the platform, which provides the ability to run more near real-time analytics on Cassandra.

How much faster is Spark than traditional Hadoop for real-time query use cases on Cassandra data? Our internal benchmarks have shown a wide range of speed differentials between the two (all favoring Spark), but while such tests can provide a general idea of benefits gained with Spark, we wanted to give you a way of easily testing things out for yourself.

New Internet-of-Things Demo in DSE

DSE already ships with a number of bundled demos that help showcase various features of the platform such as enterprise search and more. Because many customers use DSE and Cassandra for Internet of things (IOT)/sensor applications, a couple of our talented engineers decided to create a new demo that was an IOT application, which helps you easily compare the new Spark analytics with the batch Hadoop analytics DSE has had for a while now.

You can install DSE 4.5 and run the new demo – which simulates a weather sensor collection system and analytics application – on your laptop or on multiple nodes in a database cluster. Just follow the simple setup and configuration instructions in our online documentation and then launch the Web-based application interface.

The home page of the demo application contains an introduction to the app and usage instructions. The Near Real-Time Reports section lets you graphically view key weather analytics for a particular geographic region using Spark:

weather demo graph1

The Sample Live Queries section of the app lets you compare Spark vs. traditional Hadoop Hive query response times by just selecting which engine you want to process your request. For example, running the first query twice through Spark produces the following results on my semi-old Mac:

shark query1

shark query2

As can be seen, the first Spark run hits disk, while the second utilizes cached data for a faster response time. By comparison, the same query on standard Hive takes quite a bit longer to complete, with subsequent Hive query runs producing no substantial reduction in response times:

hive query1

In this simple comparison, Spark produces nearly a 6x boost in performance on disk and a 16x gain when utilizing memory and cached data.

You can also test various custom queries vs. the canned sample live queries on the Custom Live Queries page by selecting the criteria you’d like, clicking “Recalculate Query” and submitting the query:

weather demo custom query

Lastly, the BYOH Live Queries page can optionally be used to test and benchmark our new integration with external Hadoop platforms from Cloudera and Hortonworks.

Conclusions

Spark serves as a nice complement to Cassandra for running analytics on operational data. How much faster will Spark be for you on Cassandra than standard Hadoop? Download DataStax Enterprise 4.5 today and give our new Spark integration a try on your own database cluster.

Note that you can control how much demo data is loaded into Cassandra (vs. the defaults used in our standard setup scripts); see the README.dev.md file that’s in the weather sensors demo directory for instructions. For more information on running Spark analytics on Cassandra in DSE, see our online documentation.



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>