Providing more value with DataStax-Spark integration
DataStax Enterprise (DSE) 4.5 provides integration with Spark/Shark, further enhancing the DSE’s analytical capabilities. Everything works out of the box with the new integration, allowing you to easily run Spark/Shark analytical queries directly on Cassandra data.
Spark is an in-memory computing framework that runs much faster than Hadoop’s MapReduce on data stored in Cassandra, thereby improving query response times and decreasing the decision latency. The basic abstraction in Spark are Resilient Distributed Data sets (RDD), a collection of objects distributed on several machines. RDDs let programmers perform in-memory computations on large clusters in a fault tolerant manner (automatically managing machine failures). Spark also can cache data in memory and access it repeatedly as opposed to reading and writing over the disk every time.
Shark is a Hive compatible, SQL-like tool that can run queries faster than Apache Hive on Cassandra data. Shark leverages distributed memory abstraction with fault tolerant capabilities and is fully compatible with Apache Hive. Shark can also persist results of a query in memory (stored in columnar fashion) when caching is enabled. This leads to faster query response times with low latency.
The new Spark integration and analytics option opens up many additional use cases for DataStax Enterprise. Sample use cases can include time sensitive applications such as click prediction, spam filters, sensor data processing and fraud detection etc.
Additionally, you can combine Spark’s in-memory analytics with DSE’s OLTP in-memory option and keep your OLTP and analytics operations fully contained in memory for the ultimate speed and performance.
Let’s quickly walk through these new analytical capabilities in DSE 4.5 and see how easy it is to enable and use Spark/Shark in DSE.
Built-In Workload Isolation and Management
DataStax Enterprise 4.5 extends workload isolation capabilities for Spark/Shark analytics, so the real-time, search and analytic workloads/nodes do not compete for data or compute resources. Everything is managed automatically by DSE without any user intervention. As seen from the OpsCenter UI, Cassandra, Search and Analytic nodes (Spark) are in different logical rings (DCs) and the workload is isolated too.
High Availability Built-In
DSE offers a built-in high availability solution for Spark, which provides the ability for a reserved Spark Master/job tracker to take over in case the original Spark Master process fails (due to node failure etc.). This ensures continuity of your analytic operations without any need to run your queries again.
Built-In Enterprise Security
Many enterprises have strict security and compliance regulations that ensure only authenticated persons are accessing or running analytical queries on corporate database systems. The existing security mechanisms of DSE are extended to Spark/Shark analytics. Only specified users can access Cassandra data and are allowed to run analytical queries.
DSE provides a native integration for Spark (under Apache 2.0 license) using DataStax’s Cassandra driver for Java (2.0). DSE-Spark’s API is very simple and supports automatic conversion of types between Cassandra and Scala (including collections). It also allows user-defined classes for representing Cassandra rows.
The Spark/Shark integration in DSE comes with 24/7/365 expert support from DataStax.
On the whole, DataStax Enterprise provides a complete, robust and first class integration for Spark/Shark that further enhances DSE’s analytics capabilities, which results in you having the ability to run very fast ad-hoc queries on Cassandra data and make quicker decisions.
A Quick Tutorial
Spark is enabled by default when DSE 4.5 is started in “Analytics” mode.
The following walks you through enabling Spark and Shark in DSE. The instructions below assume you’ve already installed DSE nodes and formed a cluster (in this case its 4 node cluster).
Start DataStax Enterprise with following commands:
pavan@ip-10-172-150-198:~/dse-4.5.0/bin$ ./dse cassandra (starts in Cassandra mode, not necessary for this demo purpose. Shown only for workload isolation)
pavan@ip-10-170-130-198:~/dse-4.5.0/bin$ ./dse cassandra –s (starts in Search mode, not necessary for this demo purpose. Shown only for workload isolation)
pavan@ip-10-170-106-27:~/dse-4.5.0/bin$ ./dse cassandra –k (starts Spark trackers in Analytics mode. Spark Master runs on this node)
pavan@ip-10-173-166-28:~/dse-4.5.0/bin$ ./dse cassandra –k (starts Spark trackers in Analytics mode)
A best practice is to always have Analytic nodes deployed in a separate Datacenter (or virtual DC/different availability zones).
Run “./dsetool nodetool status” to check that everything is running:
Let’s focus on the two nodes where Analytics/Spark is enabled.
In order to achieve high availability for Spark Master, you need to set up a reserved Spark Master. This is easily done by running the following command:
pavan@ip-10-170-106-27:~/dse-4.5.0/bin$ ./dsetool movejt 10.172.166.28
The output should say:
Setting ‘reserve’ JT to point to /10.172.166.28
The output of the “./dsetool ring” should like look this:
As seen from the image above, a reserved tracker (RT) is assigned to the other Analytics node and would take over in case the original Spark Master (JT) fails.
To launch the Spark interactive shell, you run the following command:
pavan@ip-10-170-106-27:~/dse-4.5.0/bin$ ./dse spark
To launch Shark, you run the following command:
pavan@ip-10-172-166-28:~/dse-4.5.0/bin$ ./dse shark
Please check our online docs for a sample demo on how to run Spark/Shark queries on Cassandra data.
DSE-Spark also comes with web UI to track:
- Total cores and Memory used
- Number of applications running
- Workers – ID, address, state, cores, memory
- Applications – ID, cores, memory per node, submitted time, user, state, duration
- Logs (stdout, stderr)
You can enjoy the benefits of faster analytics on Cassandra data now by downloading DSE 4.5.