SHARE THIS PAGE




CATEGORIES



TWEET THIS!

Brian Hess

What’s New in DataStax Enterprise Analytics 5.1

By Brian HessApril 24, 2017

We are extremely excited to introduce you to the great new operational analytics features that are now available in DataStax Enterprise (DSE) 5.1, powered by the best distribution of Apache Cassandra™. A common premise of product development is to always invest in areas your users will always care about. For us and DSE Analytics this commonly comes down to performance, security, and ease-of-use.  

You’ll see those qualities across most of the DSE Analytics improvements in 5.1, including:

  • Major updates to open-source components
  • Significant performance improvements
  • Production-ready distributed file system
  • Analysis of tabular or graph data – or both together
  • Improvements to deployment options, security, and authorization

Without further ado, let’s dive into all things new in DSE Analytics 5.1.

Upgraded Apache Spark™

DataStax Enterprise 5.1 includes a significant upgrade to the underlying operational analytics engine, Apache Spark 2.0. Spark 2.0 brings significant improvements, especially in the area of:

  • Spark SQL
  • APIs and usability
  • Operational enhancements

The enhancements to Spark SQL deliver SQL 2003 support, greatly expanding its capabilities to run significantly complex SQL (including all 99 queries of TPC-DS suite). In addition, significant effort has been placed in enhancing the Spark optimizer, codenamed Catalyst, which sits beneath both the SQL engine as well as the more general DataFrame engine, resulting in 2-10x performance improvement from something called “whole stage code generation”.

Beyond SQL, Spark 2.0 also expands the capabilities of Spark’s R integration. DataStax Enterprise 5.1 now includes support for SparkR, enabling Spark access to users of the popular statistical programming language R. Spark 2.0 introduced R-based user-defined functions (UDFs), greatly increasing the capabilities of SparkR, and bringing it much closer to the other Spark languages. To use R with DSE is as simple as installing R and running “dse sparkr” – see here for more details.

Lastly, the machine learning library, MLlib, also has some serious improvement, as well as a new construct called Datasets.

3x Operational Analytics Read Performance

DSE 5.1 includes an exciting new feature called Continuous Paging, which is unique to the DataStax Enterprise database. Continuous Paging is an optimization specifically for  Cassandra queries that scan a large portion of data, a core underlying operation to most analytical queries. By speeding up this core operation, most Spark queries against data stored in DSE’s database benefit from this improvement.  We have tested this in a number of scenarios: selecting all columns or some columns, with or without a clustering-column predicate, and we see a 2.5 – 3.5x performance improvement.

In DataStax Enterprise, the Spark Cassandra Connector has been enhanced to take advantage of Continuous Paging automatically, seamlessly unleashing the potential 3x performance improvement.

An improved DataStax Enterprise File System (DSEFS)

In DSE 5.0 we introduced a distributed file system, named DSEFS, which was then focused on supporting the file system needs of Spark Streaming, such as checkpointing or write-ahead-logging. In DSE 5.1 we enhanced DSEFS to support generic distributed file system use cases in support of operational analytics and as a drop-in replacement for HDFS.

For a quick review, DSEFS improves on the design of the Cassandra File System (CFS), which has shipped with DSE since version 1.0.  While this fit some of the needs back then, DSEFS is a new approach that increases performance and is less impactful to production systems. The main difference is that unlike CFS, DSEFS stores data blocks outside of Cassandra tables, which has a number of attractive consequences.

The overhead from Cassandra operations on the data blocks due to compactions and writing to the commit log are removed, as well as overhead from delete operations. Furthermore, DSEFS can scale to more dense storage per node than CFS.

DSEFS stores the metadata inside Cassandra tables, making the metadata fault-tolerant with no master or leader node to worry about. This is a huge improvement over HDFS which has a dependency on a NameNode to keep the metadata in memory. Additionally, there is no need for HDFS’s SecondaryNameNode nor a dependency on Zookeeper.

DSEFS supports the HDFS API, so applications and frameworks that support HDFS – including Spark – work seamlessly with DSEFS. There is also support for WebHDFS, which allows a REST API to interact with DSEFS. Put simply, DSEFS is more fault-tolerant and simpler to deploy than HDFS.

As an improvement over DSE 5.0, DSEFS user authentication and Linux-like file system permissions properly restrict access, which is a need in an enterprise environment. The DSEFS command supports management operations such as identifying under-replicated blocks due to failures and monitoring available space on each DSEFS node.

Graph Analytics via DseGraphFrames

For DSE Graph, DSE 5.1 includes a new graph analytics capability to perform analysis of graphs using a new DSE-specific API, called DseGraphFrames (full blog post coming soon). Based on the open-source GraphFrames project, users can use Spark via Scala to read DSE Graph data into Spark and perform Spark DataFrame operations and graph algorithms using the Spark engine. In addition to analyzing DSE Graph data, other data can be combined with graph data into a complete multi-model analytical workflow.

A GraphFrame is a construct in Spark consisting of a DataFrame for vertices and another DataFrame for edges between those vertices. By representing the data this way, DataFrame-type operations on vertices and edges are simple and efficient – such as group counts on type of vertex or selecting only certain edges based on a property. Additionally, users will be able to query or manipulate the graph data using SQL, including joins, aggregation, etc. The GraphFrames project also includes some popular graph algorithms written in a Spark-optimized way (e.g., PageRank, triangle counting, etc) and motif-finding to identify structural patterns in a graph.

In addition, the DseGraphFrame, which extends the GraphFrame construct in a DSE-specific way, will enable editing the DSE Graph to add vertices or, modify vertex or edge properties, and delete vertices or edges. Furthermore, the vertex and edge DataFrames can be registered in the Spark SQL catalog and queried using Spark SQL, including over ODBC/JDBC via the Spark SQL Thriftserver. DseGraphFrames can also be used to bulk export and bulk import graphs, to DSEFS for example.

Improved Resource Manager for a more flexible deployment

In DSE 5.1, we have yet again improved the built-in resource manager for Spark jobs. This enhancement includes two main aspects that are related: security and authorization. First, all communication between all processes in Spark – driver, master, worker, executor – are conducted over secure channels so that every Spark job has its own encrypted channels. Insulating the Spark job from all others on the system and provides end-to-end encrypted communication for every Spark job.

Additionally, the ability to submit and run Spark jobs is now a GRANTable privilege in DataStax Enterprise, limiting who can and cannot run Spark jobs on a cluster. Moreover, in situations where there are multiple Analytics data centers in a DSE cluster, administrators can limit which users can run queries on which data center’s Spark resources.

One result of the improvements to the Spark resource manager is greater flexibility in how you can add analytics processing to the data in DSE. One common deployment approach is to have every node in the data center be both a Cassandra data node and a Spark processing node.  

This allows for the Spark processing to read the data locally, and is the only deployment mode of DSE Analytics up until now.

Our two new deployment options have separate nodes for Cassandra data and for Spark processing. This allows for a different number of Spark nodes than Cassandra nodes as well as allowing the number of Spark nodes to increase or decrease without affecting the movement of Cassandra data, but what it gains in flexibility it sacrifices with remote data reads.

The first scenario is to create a DSE data center that has no Cassandra data replicated to it (e.g., via NetworkTopologyStrategy). These nodes will still use Cassandra for Spark resource management (Spark Master selection, intermediate state, etc), but will read data from a different data center in the DSE cluster. One benefit of this approach is that the set of DSE users in the system is the same since it is the same DSE cluster. Additionally, this data center can be spun up, spun down, or resized dynamically without having to move any Cassandra data.

 

The second option is to create a separate external DSE cluster, cluster A in the example below, running the same version of DSE as the “data” DSE cluster, cluster B.  This “Spark” DSE cluster would use its local Cassandra resources to manage Spark resources, but would read data remotely from the “data” DSE cluster. Again, the main benefit is the flexibility of sizing and altering the size, of this “Spark” DSE cluster independently of the “data” DSE cluster. However, unlike the data center scenario above, in this scenario, the DSE users are not shared, since this is a separate DSE cluster. Using LDAP in both clusters can simplify this scenario.

And Away We Go!

These new additions and enhancements to DSE Analytics deliver additional speed, security, flexibility, and simplicity to the DSE platform.  The improvement on the manageability of DSEFS, Spark resource management, and the overall performance will make Spark analysis on DSE scream. In addition, with a production-ready distributed file system, DSEFS, increased flexibility in processing DSE Graph data with DseGraphFrames, and new API such as SparkR, the sky’s the limit as to what you can do.

You can download DSE 5.1 now and read through our updated docs for more information.

Interested in learning more about the new capabilities in DSE 5.1 release? Check out our blogs here:





SHARE THIS PAGE
SUBSCRIBE

Comments

Your email address will not be published. Required fields are marked *

Tel. +1 (408) 933-3120 sales@datastax.com Offices France Germany

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.