Providing more value with DataStax-Spark integration

By Pavan Venkatesh -  July 1, 2014 | 6 Comments

DataStax Enterprise (DSE) 4.5 provides integration with Apache Spark™ and Shark, further enhancing the DSE’s analytical capabilities. Everything works out of the box with the new integration, allowing you to easily run Spark or Shark analytical queries directly on Apache Cassandra™ data.

Note: This blog post was written targeting DSE 4.5 which shipped with Apache Spark™ 0.9.1. Please refer to the DataStax documentation for your specific version of DSE if different.

Apache Spark™ is an in-memory computing framework that runs much faster than Apache Hadoop™ MapReduce on data stored in Apache Cassandra™, thereby improving query response times and decreasing the decision latency. The basic abstraction in Spark are Resilient Distributed Data sets (RDD), a collection of objects distributed on several machines. RDDs let programmers perform in-memory computations on large clusters in a fault tolerant manner (automatically managing machine failures). Spark also can cache data in memory and access it repeatedly as opposed to reading and writing over the disk every time.

Shark is an Apache Hive™ compatible, SQL-like tool that can run queries faster than Apache Hive on Cassandra data. Shark leverages distributed memory abstraction with fault tolerant capabilities and is fully compatible with Apache Hive. Shark can also persist results of a query in memory (stored in columnar fashion) when caching is enabled. This leads to faster query response times with low latency.

The new Apache Spark™ integration and analytics option opens up many additional use cases for DataStax Enterprise. Sample use cases can include time sensitive applications such as click prediction, spam filters, sensor data processing and fraud detection etc.

Additionally, you can combine Spark’s in-memory analytics with DSE’s OLTP in-memory option and keep your OLTP and analytics operations fully contained in memory for the ultimate speed and performance.

Let’s quickly walk through these new analytical capabilities in DSE 4.5 and see how easy it is to enable in DSE.

Built-In Workload Isolation and Management

DataStax Enterprise 4.5 extends workload isolation capabilities for Apache Spark™ and Shark analytics, so the real-time, search and analytic workloads/nodes do not compete for data or compute resources. Everything is managed automatically by DSE without any user intervention. As seen from the OpsCenter UI, Cassandra, Search and Analytic nodes (Spark) are in different logical rings (DCs) and the workload is isolated too.

High Availability Built-In

DSE offers a built-in high availability solution for Spark, which provides the ability for a reserved Spark Master/job tracker to take over in case the original Spark Master process fails (due to node failure etc.). This ensures continuity of your analytic operations without any need to run your queries again.

Built-In Enterprise Security

Many enterprises have strict security and compliance regulations that ensure only authenticated persons are accessing or running analytical queries on corporate database systems. The existing security mechanisms of DSE are extended to analytics. Only specified users can access Cassandra data and are allowed to run analytical queries.

Native Integration

DSE provides a native integration for Spark (under Apache 2.0 license) using DataStax’s Cassandra driver for Java (2.0). The API is very simple and supports automatic conversion of types between Cassandra and Scala (including collections). It also allows user-defined classes for representing Cassandra rows.

Expert Support

The DSE integration with Apache Spark™ and Shark comes with 24/7/365 expert support from DataStax.

On the whole, DataStax Enterprise provides a complete, robust and first class integration for Apache Spark™ and Shark that further enhances DSE’s analytics capabilities, which results in you having the ability to run very fast ad-hoc queries on Apache Cassandra™ data and make quicker decisions.

A Quick Tutorial

Apache Spark™ and Shark are tightly integrated and packaged in DSE, so everything works out of the box. One has to simply install DSE 4.5 using our new installer or through regular installation process.

Apache Spark™ is enabled by default when DSE 4.5 is started in “Analytics” mode.

The following walks you through enabling Apache Spark™ and Shark in DSE. The instructions below assume you’ve already installed DSE nodes and formed a cluster (in this case its 4 node cluster).

  1. Start DataStax Enterprise with following commands:

    1. pavan@ip-10-172-150-198:~/dse-4.5.0/bin$ ./dse cassandra (starts in Cassandra mode, not necessary for this demo purpose. Shown only for workload isolation)

    2. pavan@ip-10-170-130-198:~/dse-4.5.0/bin$ ./dse cassandra –s (starts in Search mode, not necessary for this demo purpose. Shown only for workload isolation)

    3. pavan@ip-10-170-106-27:~/dse-4.5.0/bin$ ./dse cassandra –k (starts Spark trackers in Analytics mode. Spark Master runs on this node)

    4. pavan@ip-10-173-166-28:~/dse-4.5.0/bin$ ./dse cassandra –k (starts Spark trackers in Analytics mode)

A best practice is to always have Analytic nodes deployed in a separate Datacenter (or virtual DC/different availability zones).

  1. Run “./dsetool nodetool status”  to check that everything is running:

  1. Let's focus on the two nodes where Analytics (with Apache Spark™) is enabled.

      1. In order to achieve high availability for Spark Master, you need to set up a reserved Spark Master. This is easily done by running the following command:

        pavan@ip-10-170-106-27:~/dse-4.5.0/bin$  ./dsetool movejt

        The output should say:
        Setting 'reserve' JT to point to /

      2. The output of the “./dsetool ring” should like look this:

    As seen from the image above, a reserved tracker (RT) is assigned to the other Analytics node and would take over in case the original Spark Master (JT) fails.

  1. To launch the Spark interactive shell, you run the following command:

        pavan@ip-10-170-106-27:~/dse-4.5.0/bin$ ./dse spark


  1. To launch Shark, you run the following command:

        pavan@ip-10-172-166-28:~/dse-4.5.0/bin$ ./dse shark


  1. Please check our online docs for a sample demo on how to run Spark and Shark queries on Cassandra data.

DSE-Spark also comes with web UI to track:

  • Total cores and Memory used
  • Number of applications running
  • Workers - ID, address, state, cores, memory
  • Applications – ID, cores, memory per node, submitted time, user, state, duration
  • Logs (stdout, stderr)

You can enjoy the benefits of faster analytics on Apache Cassandra™ data now by downloading DSE 4.5.

DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.

register for classes

get certified

DBA's Guide to NoSQL


  1. Deepak Nulu says:

    An earlier press release about the DataStax-Databricks partnership ( stated that “Partnership will deliver open source code back to the Apache Spark and Apache Cassandra communities”. So is this integration coming to the open-source version of Cassandra as well? Thanks.

  2. DuyHai DOAN says:


    The Scala driver to use Spark with Cassandra is already available as open sourced code here:

  3. Deepak Nulu says:

    @DuyHai, thanks for the link.

  4. Pavan Venkatesh Pavan Venkatesh says:

    Thanks DuyHai for providing the link. The blog also contains the link (click on “native integration”).

  5. Pavan Venkatesh Pavan Venkatesh says:

    I’ve used “./dse cassandra –k” twice (step 1) to start Spark on two nodes. The second Spark node is used for fail-over (step 3.1) in case the first Spark node fails.

    1. Shikha says:

      I trying to start hadoop + Spark node but I’m not able to start spark master. Spark fails with error : SparkMaster: Caused by: Failed to bind to : /54.xx.xx.xx:7077.

      ERROR [SPARK-MASTER-INIT] 2014-07-03 05:12:58,238 (line 104) SparkMaster threw exception in state RUNNING:
      java.lang.Exception: Script execution failed with exit code = 1
      at com.datastax.bdp.spark.AbstractSparkRunner.runService(
      at com.datastax.bdp.spark.AbstractSparkRunner.runService(

      Anyone have any clue ?


Your email address will not be published. Required fields are marked *

Subscribe for newsletter:

Tel. +1 (408) 933-3120 Offices France Germany

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.