Always On Analytics
Highly available end-user analytics through Spark SQL via ODBC/JDBC.
Deliver real-time analytics at scale with no single point of failure.
With DataStax Enterprise (DSE) 6 we have introduced AlwaysOn SQL, a new, highly-available, secure SQL service for analytical queries. This service will allow BI, ETL, and other tools to connect via the standard ODBC and JDBC protocols to analyze data in DSE, including tabular data from the database, graph data from DSE Graph, and files stored in DSEFS. This service is not just highly available, a necessary requirement for production, but also fully secure, with security integrated with the rest of DSE. Why do We Need It? SQL is a standard that has been around for over 40 years and has grown to be the de-facto language for querying data. Many new database vendors have based their own query languages, such as the Cassandra Query Language (CQL) for Apache CassandraTM and many others, on SQL. Many of these languages, including CQL, support a subset of the SQL standard, since some of the standard is at odds with the goals and designs of these new databases. That said, SQL remains an important tool for data analysis, and an entire ecosystem of tools and expertise has grown around it over the past 40 years. ODBC and JDBC are the two main programming interfaces to submit SQL to a database and retrieve results. These APIs enable SQL-based tools, such as BI tools, to support many databases at once, rallying around a single API against which to design. In addition to tooling, SQL is a common language of choice for data analysis in general. Most data and business analysts are familiar with SQL and interact daily with data via SQL. While Apache SparkTM has other interfaces, such as Scala and Python, many data analysis tasks do not require the heavy weight of a Scala or Python application and can be easily accomplished with SQL. Analysts are looking for a simple way to ask questions of the data and receive results, without having to program in Spark. So, what does an SQL service need to provide to the end user? It must be 100% reliable, meaning it must be always on and not depend on manual intervention to handle failures. It should leverage the common interfaces that exist, namely ODBC and JDBC, and thereby easily integrate with existing tools and allow users to leverage their existing knowledge. It should offer robust security, including authentication and authorization, and this security should be integrated with the rest of the data platform, leveraging the same users, permissions, and user management. It should be fully integrated into the database. It should not feel like a bolt-on addition but integrated and managed like the rest of the platform. A Little History DSE has shipped with SQL access to data for many releases, first via Hive and Hiveserver, then Shark and Sharkserver, and finally with Spark and the Spark SQL Thriftserver, which is what is available in DSE 5.1. However, neither of these solutions were worry-free in production environments. In DSE 5.1, the role of ODBC/JDBC service for DSE was handled by the Spark SQL Thriftserver. This is the same approach taken by open-source Apache Spark and the various Spark vendors. However, the Spark SQL Thriftserver leaves a bit to be desired, in terms of being enterprise-class. First, security in the Spark SQL Thriftserver has some notable shortcomings. While it’s true that you can enable LDAP authentication for the Spark SQL Thriftserver, this is not integrated with DSE, and does not support the variety of authentication schemes that DSE supports. More importantly, the Spark SQL Thriftserver runs as a single, long-running Spark application, and as such it connects to the DSE database as a single user. That is, once any user has authenticated to the Spark SQL Thriftserver, it will query the underlying DSE database as a single user regardless of which user was authenticated. This is a limitation of the open source Spark SQL Thriftserver and the Spark Cassandra Connector, and it’s clearly sub-par for security-conscious customers. Second, the Spark SQL Thriftserver is not highly available. It must be manually started, and it’s left to the administrator to keep an eye on it. If it stops for some reason, the administrator has to restart the service again, potentially on another node (in case of node failure). This lack of high availability is prohibitive to putting it into production. Furthermore, if the service is restarted, then analytical SQL applications need to update their connection to whichever machine the service was restarted on. The Spark SQL Thriftserver supports caching tables in memory to support even faster query times. However, if the service stops, intentionally or unintentionally, then the cached data is not automatically re-cached for the user. Moreover, if the user did issue a manual re-cache operation, when it took the data out of the data source this second time there is no guarantee that the underlying data set hasn’t changed, resulting in a different cached data set and thus different results for queries. AlwaysOn SQL AlwaysOn SQL was designed with these issues in mind, and is extremely conscious of the shortcomings of previous approaches. High Availability The first thing we want to ensure is that this service is highly available and does not require any manual monitoring or intervention. AlwaysOn SQL is enabled to start when DSE starts without any manual steps. When it’s enabled, the DSE nodes in the data center will elect a node to serve as the connection point for AlwaysOn SQL, and the service will be started. If the service stops, either on purpose or due to a failure, the DSE nodes will elect a new leader and that node will start the service and act as the new connection point. How will client applications know which node is serving as the AlwaysOn SQL connection point? There’s an API call that can return the active connection point via the “dse client-tool” command-line tool, CQL, or a REST call to any of the nodes in the data center. Furthermore, we worked with Magnitude Software to enhance the ODBC and JDBC drivers for Spark SQL to seamlessly leverage this REST API so that clients are transparently connected to the active AlwaysOn SQL service node. One challenge with running Spark applications is ensuring that the necessary resources are available, otherwise your application will queue and wait for other applications to finish. In DSE 5.1 we introduced the concept of Workpools, subsets of computing resources that users can be authorized to submit applications to use. For AlwaysOn SQL, we leverage this construct to create a Workpool just for AlwaysOn SQL, ensuring that the resources are always available when AlwaysOn SQL needs to start or restart. Another challenge is how to give a good, consistent cache experience in the face of service restarts. To handle this case, we introduced new SQL syntax to support taking a persistent cache. When you create this new cache, a snapshot of the data from the database will be saved to DSEFS, DSE’s continuously available distributed file system, and then pulled into memory. Therefore, when the AlwaysOn SQL service is restarted, the cache can be repopulated from the DSEFS snapshot, ensuring the same data in the cache as before. This allows applications to reliably query the cached data without having to worry if the service has been restarted or if the data is not consistent from query to query. Advanced Security Production applications require strict security, which is why AlwaysOn SQL was built to integrate with DSE’s Advanced Security features. The users for AlwaysOn SQL are managed along with the users of the DSE database, via CQL as another GRANT-able privilege. When a user authenticates via ODBC/JDBC, AlwaysOn SQL uses DSE to determine permissions and grant access to the service. Users do not need additional usernames or passwords to use AlwaysOn SQL, and, since AlwaysOn SQL leverages DSE Advanced Security, the various Unified Authentication schemes are supported. Authentication is one component to security; the other is authorization. Unlike the Spark SQL Thriftserver, which queries the DSE database as a single user regardless of the authenticated user, AlwaysOn SQL always queries the underlying database as the authenticated user. AlwaysOn SQL uses DSE’s Proxy Execution feature to properly query database tables as the authenticated user, thereby ensuring the end user only has access to tables and data they have permissions to read. All the Data AlwaysOn SQL provides SQL access not just to DSE database tables, which are automatically registered in the SQL catalog for analysis. DSE Graph data is also available via SQL in the form of vertex tables and edge tables, just like in DseGraphFrames. These tables are also automatically registered in the SQL catalog. Data stored in DSEFS can be manually registered as external tables in AlwaysOn SQL. This allows for a simple way to load data from DSEFS into DSE tables via SQL - and ODBC or JDBC - only, and similarly for exporting data from DSE tables and graphs to DSEFS. External data sources can also be registered with AlwaysOn SQL, allowing simple data migration from legacy systems to DSE. Production-Ready SQL for Analytics AlwaysOn SQL represents an important step to providing always-on SQL access to data, in support of business analysts and the tools they use. By enabling SQL users on DSE, we invite a broad group of developers to analyze data in DSE without having to climb the learning curve of Spark programming, including (potentially) learning a new programming language. But beyond enabling tools and developers, AlwaysOn SQL was designed from the ground up to support production deployments in enterprise environments, focusing on availability and security as the two core tenets. You can download DSE 6 now and read through our updated documentation for more information. SQL Support in DataStax Enterprise (white paper) DOWNLOAD NOW
An always-on, distributed cloud database needs non-stop analytics to manage workflows, derive insights, and enable analysis for business applications and business analysts. With that in mind, for this release we focused our attention for DSE Analytics on nonstop availability and ease-of-use for operational analytics workloads, resulting in some significant and impactful new features: AlwaysOn SQL, a highly-available, enterprise-grade service to support production analytical SQL applications DataStax Enterprise (DSE)-specific enhancements to the Spark Catalyst optimizer, including automatic use of DSE Search indices to give automatic performance benefits of DSE Search to Spark operations Upgrade of the DSE Analytics engine to Apache Spark™ 2.2, including Structured Streaming to enable improved streaming analytics for DSE Building on the Goodness of DSE 5.1 Before diving into the new features of DSE 6, it’s worth highlighting a few items from DSE 5.1 that DSE 6 builds on. The most notable of these is the performance gains we introduced via Continuous Paging — namely, up to a 3x improvement for scans from the DSE database. This greatly accelerates operational analytic workloads on data in DSE. Another big enhancement was the general availability of DSEFS, a continuously available, HDFS-compatible, distributed file system that integrates seamlessly with Spark and is capable of scaling to 40TB per node. DSEFS provides not just checkpointing for Spark Streaming applications but also supports general use cases, including data reception, lambda-architecture-type data flows, and scan-heavy operational analysis. Another enhancement to call out is the improvements to DSE’s Spark Resource Manager. The Resource Manager in DSE has been highly available for several versions, but in DSE 5.1 significant improvements were made to its fault-tolerance, security, and ease of use. In DSE 5.1, all nodes in the DSE datacenter can accept a Spark job submission, and all communications — client to the cluster, within the cluster, etc. — are protected by encryption. Introducing DSE AlwaysOn SQL Given the performance improvements in DSE 5.1, we turned our focus to making the developer experience significantly simpler with analytic queries in DSE 6, and we wanted to ensure these improvements addressed the issues with running analytics in production. One ubiquitous API for analysis is good old SQL. It’s been around for a long time and a complete and extensive industry has built up to include tools, applications, and expertise around doing data analysis and data flows with SQL. It’s important to remember that despite being built in Scala, using Spark does not require Scala skills. Spark has always included an SQL component, and DSE Analytics inherits that benefit, too. The Spark community has put a large amount of effort into making a strong SQL engine, but it has largely avoided addressing what it takes to build an enterprise-class ODBC/JDBC service that can be put into production. The service needs to be highly available, simple to use, and implement production-ready security measures for networking, user authentication, and user authorization. This is what AlwaysOn SQL was designed and built to achieve. It’s a production-ready ODBC/JDBC service that provides SQL access to data in DSE, both the database and DSEFS. This allows ODBC/JDBC analytical applications to be put into production, worry-free. This service will automatically restart in the face of failures, and cached datasets will also be refreshed automatically. Client applications will connect seamlessly to the service without having to know the details of where in the data center the service is running. Authentication and authorization of SQL users will occur via the same users managed within DSE Advanced Security, removing the need for extra, error-prone security setup steps. Queries against the underlying DSE database will be executed as the authenticated SQL user, providing fine-grained security to the data. More details on this will be coming soon in a later blog post, so please keep a look out for that. Improved Spark Analytic Engine DSE 6 includes a number of improvements “under the hood” to support not just AlwaysOn SQL but also general Spark applications. DataStax has invested in a number of areas to improve the performance of these Spark applications in DSE Analytics. First, DSE 6 upgrades the DSE Analytics engine to Apache Spark 2.2. This landmark release sees the graduation of the new Structured Streaming component from an “experimental” feature to a full, first-class citizen. Aside from Structured Streaming, Spark 2.2 focuses more on usability, stability, and polish. To support the new Structured Streaming API, DSE 6 includes a new Structured Streaming sink exclusively for DSE, enabling simple, efficient, and robust streaming of data into DSE from Apache Kafka, file systems, or other sources. DSE Analytics also now automatically leverages any DSE Search indices and pushes down the DSE Search query to let the search engine efficiently perform the query. This allows for not just free-text search but also for Boolean predicates to be efficiently evaluated by the Lucene-based engine that is well-suited to process these queries. In some cases, namely if the query will return a large portion of the data, it is less efficient to get the data via the index and more efficient to simply scan the whole data set and let Spark do the evaluation of the predicates. DSE Analytics will actually query the DSE Search component to determine how much data would be returned, and accordingly decide which approach is more efficient for this query — automatically. DSE Analytics also introduces a new join strategy exclusively for DSE, the Direct Join. Those of you familiar with Spark Streaming with Apache Cassandra™ are already familiar with the joinWithCassandraTable method for doing lookup to join with data in the database. The new direct join is the Dataset analogue to the RDD method, joinWithCassandraTable. Moreover, the direct join will be chosen automatically in situations where it is the preferred approach to joining with DSE database data. The direct join has clear use for Spark Streaming applications, but it is equally useful for batch applications, as well. Finally, DSE GraphFrames, introduced in DSE 5.1, has been expanded to provide even more capabilities and improved performance in DSE 6. Additionally, all graphs in DSE Graph will automatically appear in the Spark SQL catalog as vertex and edge tables. This enables simple access to basic SQL and Dataset operations on the table representation of graph data, including via ODBC/JDBC and AlwaysOn SQL. And Away We Go! These new enhancements to DSE Analytics deliver improved simplicity, reliability, flexibility, and performance to the DSE platform. AlwaysOn SQL brings production-ready, enterprise-grade ODBC/JDBC to DSE Analytics, enabling a large ecosystem of tools, applications, and expertise. DSE Analytics 6 continues the enhancements and optimizations to DSE’s Spark engine that customers have come to expect from DataStax. You can download DSE 6 now and read through our updated documentation for more information.
Built on a production-certified version of Apache Spark™ and with integrated search and graph capabilities, DSE Analytics provides highly available, production-ready analytics that enables enterprises to securely build instantly responsive, contextual, always-on applications and generate ad-hoc reports. Read this white paper to learn about the specific features and capabilities of DSE Analytics, and why DSE Analytics is designed for the Right-Now Enterprise.
With DataStax Enterprise (DSE) 6 we have introduced AlwaysOn SQL, a new, highly-available, secure SQL service for analytical queries. This service will allow BI, ETL, and other tools to connect via the standard ODBC and JDBC protocols to analyze data in DSE, including tabular data from the database, graph data from DSE Graph, and files stored in DSEFS. This service is not just highly available, a necessary requirement for production, but also fully secure, with security integrated with the rest of DSE. Why do We Need It? SQL is a standard that has been around for over 40 years and has grown to be the de-facto language for querying data. Many new database vendors have based their own query languages, such as the Cassandra Query Language (CQL) for Apache CassandraTM and many others, on SQL. Many of these languages, including CQL, support a subset of the SQL standard, since some of the standard is at odds with the goals and designs of these new databases. That said, SQL remains an important tool for data analysis, and an entire ecosystem of tools and expertise has grown around it over the past 40 years. ODBC and JDBC are the two main programming interfaces to submit SQL to a database and retrieve results. These APIs enable SQL-based tools, such as BI tools, to support many databases at once, rallying around a single API against which to design. In addition to tooling, SQL is a common language of choice for data analysis in general. Most data and business analysts are familiar with SQL and interact daily with data via SQL. While Apache SparkTM has other interfaces, such as Scala and Python, many data analysis tasks do not require the heavy weight of a Scala or Python application and can be easily accomplished with SQL. Analysts are looking for a simple way to ask questions of the data and receive results, without having to program in Spark. So, what does an SQL service need to provide to the end user? It must be 100% reliable, meaning it must be always on and not depend on manual intervention to handle failures. It should leverage the common interfaces that exist, namely ODBC and JDBC, and thereby easily integrate with existing tools and allow users to leverage their existing knowledge. It should offer robust security, including authentication and authorization, and this security should be integrated with the rest of the data platform, leveraging the same users, permissions, and user management. It should be fully integrated into the database. It should not feel like a bolt-on addition but integrated and managed like the rest of the platform. A Little History DSE has shipped with SQL access to data for many releases, first via Hive and Hiveserver, then Shark and Sharkserver, and finally with Spark and the Spark SQL Thriftserver, which is what is available in DSE 5.1. However, neither of these solutions were worry-free in production environments. In DSE 5.1, the role of ODBC/JDBC service for DSE was handled by the Spark SQL Thriftserver. This is the same approach taken by open-source Apache Spark and the various Spark vendors. However, the Spark SQL Thriftserver leaves a bit to be desired, in terms of being enterprise-class. First, security in the Spark SQL Thriftserver has some notable shortcomings. While it’s true that you can enable LDAP authentication for the Spark SQL Thriftserver, this is not integrated with DSE, and does not support the variety of authentication schemes that DSE supports. More importantly, the Spark SQL Thriftserver runs as a single, long-running Spark application, and as such it connects to the DSE database as a single user. That is, once any user has authenticated to the Spark SQL Thriftserver, it will query the underlying DSE database as a single user regardless of which user was authenticated. This is a limitation of the open source Spark SQL Thriftserver and the Spark Cassandra Connector, and it’s clearly sub-par for security-conscious customers. Second, the Spark SQL Thriftserver is not highly available. It must be manually started, and it’s left to the administrator to keep an eye on it. If it stops for some reason, the administrator has to restart the service again, potentially on another node (in case of node failure). This lack of high availability is prohibitive to putting it into production. Furthermore, if the service is restarted, then analytical SQL applications need to update their connection to whichever machine the service was restarted on. The Spark SQL Thriftserver supports caching tables in memory to support even faster query times. However, if the service stops, intentionally or unintentionally, then the cached data is not automatically re-cached for the user. Moreover, if the user did issue a manual re-cache operation, when it took the data out of the data source this second time there is no guarantee that the underlying data set hasn’t changed, resulting in a different cached data set and thus different results for queries. AlwaysOn SQL AlwaysOn SQL was designed with these issues in mind, and is extremely conscious of the shortcomings of previous approaches. High Availability The first thing we want to ensure is that this service is highly available and does not require any manual monitoring or intervention. AlwaysOn SQL is enabled to start when DSE starts without any manual steps. When it’s enabled, the DSE nodes in the data center will elect a node to serve as the connection point for AlwaysOn SQL, and the service will be started. If the service stops, either on purpose or due to a failure, the DSE nodes will elect a new leader and that node will start the service and act as the new connection point. How will client applications know which node is serving as the AlwaysOn SQL connection point? There’s an API call that can return the active connection point via the “dse client-tool” command-line tool, CQL, or a REST call to any of the nodes in the data center. Furthermore, we worked with Magnitude Software to enhance the ODBC and JDBC drivers for Spark SQL to seamlessly leverage this REST API so that clients are transparently connected to the active AlwaysOn SQL service node. One challenge with running Spark applications is ensuring that the necessary resources are available, otherwise your application will queue and wait for other applications to finish. In DSE 5.1 we introduced the concept of Workpools, subsets of computing resources that users can be authorized to submit applications to use. For AlwaysOn SQL, we leverage this construct to create a Workpool just for AlwaysOn SQL, ensuring that the resources are always available when AlwaysOn SQL needs to start or restart. Another challenge is how to give a good, consistent cache experience in the face of service restarts. To handle this case, we introduced new SQL syntax to support taking a persistent cache. When you create this new cache, a snapshot of the data from the database will be saved to DSEFS, DSE’s continuously available distributed file system, and then pulled into memory. Therefore, when the AlwaysOn SQL service is restarted, the cache can be repopulated from the DSEFS snapshot, ensuring the same data in the cache as before. This allows applications to reliably query the cached data without having to worry if the service has been restarted or if the data is not consistent from query to query. Advanced Security Production applications require strict security, which is why AlwaysOn SQL was built to integrate with DSE’s Advanced Security features. The users for AlwaysOn SQL are managed along with the users of the DSE database, via CQL as another GRANT-able privilege. When a user authenticates via ODBC/JDBC, AlwaysOn SQL uses DSE to determine permissions and grant access to the service. Users do not need additional usernames or passwords to use AlwaysOn SQL, and, since AlwaysOn SQL leverages DSE Advanced Security, the various Unified Authentication schemes are supported. Authentication is one component to security; the other is authorization. Unlike the Spark SQL Thriftserver, which queries the DSE database as a single user regardless of the authenticated user, AlwaysOn SQL always queries the underlying database as the authenticated user. AlwaysOn SQL uses DSE’s Proxy Execution feature to properly query database tables as the authenticated user, thereby ensuring the end user only has access to tables and data they have permissions to read. All the Data AlwaysOn SQL provides SQL access not just to DSE database tables, which are automatically registered in the SQL catalog for analysis. DSE Graph data is also available via SQL in the form of vertex tables and edge tables, just like in DseGraphFrames. These tables are also automatically registered in the SQL catalog. Data stored in DSEFS can be manually registered as external tables in AlwaysOn SQL. This allows for a simple way to load data from DSEFS into DSE tables via SQL - and ODBC or JDBC - only, and similarly for exporting data from DSE tables and graphs to DSEFS. External data sources can also be registered with AlwaysOn SQL, allowing simple data migration from legacy systems to DSE. Production-Ready SQL for Analytics AlwaysOn SQL represents an important step to providing always-on SQL access to data, in support of business analysts and the tools they use. By enabling SQL users on DSE, we invite a broad group of developers to analyze data in DSE without having to climb the learning curve of Spark programming, including (potentially) learning a new programming language. But beyond enabling tools and developers, AlwaysOn SQL was designed from the ground up to support production deployments in enterprise environments, focusing on availability and security as the two core tenets. You can download DSE 6 now and read through our updated documentation for more information. SQL Support in DataStax Enterprise (white paper) DOWNLOAD NOW
There are going to be so many amazing talks this year at DataStax Accelerate. But one of the talks I’m particularly excited for is “To Spark or Not to Spark” by Russell Spitzer. I feel like this talk is the perfect pairing with my talk, “Lightening A Spark with Machine Learning”. While my talk will focus on the practical “what and how” of machine learning with Apache Spark™ and Apache Cassandra™, Russell’s will focus mostly on the “why”. Russell will cover the best and most accessible use cases for using distributed analytics with Spark and DataStax Analytics. Russell will also talk about advanced use cases using Spark’s streaming service. I will be sitting in the front row to make sure I get to ask the first question: “How is Spark streaming different than Apache Kafka™!?” Russel will also be covering some of the “what and how” that my talk will not be covering, such as using Spark to load data, modify tables, and move data from cluster to cluster. These are the topics I am frequently asked about, so I am excited to finally get my own questions answered! I can not wait to hear this talk and I think it’s the perfect pairing to my talk—like peanut butter and jelly! And of course the answer is: “TO SPARK!” Come check out both our talks at Accelerate! DataStax Accelerate REGISTER NOW