What’s New for DataStax Enterprise Analytics 6
An always-on, distributed cloud database needs non-stop analytics to manage workflows, derive insights, and enable analysis for business applications and business analysts. With that in mind, for this release we focused our attention for DSE Analytics on nonstop availability and ease-of-use for operational analytics workloads, resulting in some significant and impactful new features:
- AlwaysOn SQL, a highly-available, enterprise-grade service to support production analytical SQL applications
- DataStax Enterprise (DSE)-specific enhancements to the Spark Catalyst optimizer, including automatic use of DSE Search indices to give automatic performance benefits of DSE Search to Spark operations
- Upgrade of the DSE Analytics engine to Apache Spark™ 2.2, including Structured Streaming to enable improved streaming analytics for DSE
Building on the Goodness of DSE 5.1
Before diving into the new features of DSE 6, it’s worth highlighting a few items from DSE 5.1 that DSE 6 builds on. The most notable of these is the performance gains we introduced via Continuous Paging — namely, up to a 3x improvement for scans from the DSE database. This greatly accelerates operational analytic workloads on data in DSE.
Another big enhancement was the general availability of DSEFS, a continuously available, HDFS-compatible, distributed file system that integrates seamlessly with Spark and is capable of scaling to 40TB per node. DSEFS provides not just checkpointing for Spark Streaming applications but also supports general use cases, including data reception, lambda-architecture-type data flows, and scan-heavy operational analysis.
Another enhancement to call out is the improvements to DSE’s Spark Resource Manager. The Resource Manager in DSE has been highly available for several versions, but in DSE 5.1 significant improvements were made to its fault-tolerance, security, and ease of use. In DSE 5.1, all nodes in the DSE datacenter can accept a Spark job submission, and all communications — client to the cluster, within the cluster, etc. — are protected by encryption.
Introducing DSE AlwaysOn SQL
Given the performance improvements in DSE 5.1, we turned our focus to making the developer experience significantly simpler with analytic queries in DSE 6, and we wanted to ensure these improvements addressed the issues with running analytics in production.
One ubiquitous API for analysis is good old SQL. It’s been around for a long time and a complete and extensive industry has built up to include tools, applications, and expertise around doing data analysis and data flows with SQL. It’s important to remember that despite being built in Scala, using Spark does not require Scala skills. Spark has always included an SQL component, and DSE Analytics inherits that benefit, too.
The Spark community has put a large amount of effort into making a strong SQL engine, but it has largely avoided addressing what it takes to build an enterprise-class ODBC/JDBC service that can be put into production. The service needs to be highly available, simple to use, and implement production-ready security measures for networking, user authentication, and user authorization.
This is what AlwaysOn SQL was designed and built to achieve. It’s a production-ready ODBC/JDBC service that provides SQL access to data in DSE, both the database and DSEFS. This allows ODBC/JDBC analytical applications to be put into production, worry-free. This service will automatically restart in the face of failures, and cached datasets will also be refreshed automatically. Client applications will connect seamlessly to the service without having to know the details of where in the data center the service is running.
Authentication and authorization of SQL users will occur via the same users managed within DSE Advanced Security, removing the need for extra, error-prone security setup steps. Queries against the underlying DSE database will be executed as the authenticated SQL user, providing fine-grained security to the data.
More details on this will be coming soon in a later blog post, so please keep a look out for that.
Improved Spark Analytic Engine
DSE 6 includes a number of improvements “under the hood” to support not just AlwaysOn SQL but also general Spark applications. DataStax has invested in a number of areas to improve the performance of these Spark applications in DSE Analytics.
First, DSE 6 upgrades the DSE Analytics engine to Apache Spark 2.2. This landmark release sees the graduation of the new Structured Streaming component from an “experimental” feature to a full, first-class citizen. Aside from Structured Streaming, Spark 2.2 focuses more on usability, stability, and polish. To support the new Structured Streaming API, DSE 6 includes a new Structured Streaming sink exclusively for DSE, enabling simple, efficient, and robust streaming of data into DSE from Apache Kafka, file systems, or other sources.
DSE Analytics also now automatically leverages any DSE Search indices and pushes down the DSE Search query to let the search engine efficiently perform the query. This allows for not just free-text search but also for Boolean predicates to be efficiently evaluated by the Lucene-based engine that is well-suited to process these queries. In some cases, namely if the query will return a large portion of the data, it is less efficient to get the data via the index and more efficient to simply scan the whole data set and let Spark do the evaluation of the predicates.
DSE Analytics will actually query the DSE Search component to determine how much data would be returned, and accordingly decide which approach is more efficient for this query — automatically. DSE Analytics also introduces a new join strategy exclusively for DSE, the Direct Join. Those of you familiar with Spark Streaming with Apache Cassandra™ are already familiar with the joinWithCassandraTable method for doing lookup to join with data in the database. The new direct join is the Dataset analogue to the RDD method, joinWithCassandraTable. Moreover, the direct join will be chosen automatically in situations where it is the preferred approach to joining with DSE database data. The direct join has clear use for Spark Streaming applications, but it is equally useful for batch applications, as well.
Finally, DSE GraphFrames, introduced in DSE 5.1, has been expanded to provide even more capabilities and improved performance in DSE 6. Additionally, all graphs in DSE Graph will automatically appear in the Spark SQL catalog as vertex and edge tables. This enables simple access to basic SQL and Dataset operations on the table representation of graph data, including via ODBC/JDBC and AlwaysOn SQL.
And Away We Go!
These new enhancements to DSE Analytics deliver improved simplicity, reliability, flexibility, and performance to the DSE platform. AlwaysOn SQL brings production-ready, enterprise-grade ODBC/JDBC to DSE Analytics, enabling a large ecosystem of tools, applications, and expertise. DSE Analytics 6 continues the enhancements and optimizations to DSE’s Spark engine that customers have come to expect from DataStax.