Introducing AlwaysOn SQL for DSE Analytics
date: May 3, 2018
This service will allow BI, ETL, and other tools to connect via the standard ODBC and JDBC protocols to analyze data in DSE, including tabular data from the database, graph data from DSE Graph, and files stored in DSEFS. This service is not just highly available, a necessary requirement for production, but also fully secure, with security integrated with the rest of DSE.
Why do We Need It?
SQL is a standard that has been around for over 40 years and has grown to be the de-facto language for querying data. Many new database vendors have based their own query languages, such as the Cassandra Query Language (CQL) for Apache CassandraTM and many others, on SQL. Many of these languages, including CQL, support a subset of the SQL standard, since some of the standard is at odds with the goals and designs of these new databases. That said, SQL remains an important tool for data analysis, and an entire ecosystem of tools and expertise has grown around it over the past 40 years.
ODBC and JDBC are the two main programming interfaces to submit SQL to a database and retrieve results. These APIs enable SQL-based tools, such as BI tools, to support many databases at once, rallying around a single API against which to design.
In addition to tooling, SQL is a common language of choice for data analysis in general. Most data and business analysts are familiar with SQL and interact daily with data via SQL. While Apache SparkTM has other interfaces, such as Scala and Python, many data analysis tasks do not require the heavy weight of a Scala or Python application and can be easily accomplished with SQL. Analysts are looking for a simple way to ask questions of the data and receive results, without having to program in Spark.
So, what does an SQL service need to provide to the end user?
- It must be 100% reliable, meaning it must be always on and not depend on manual intervention to handle failures.
- It should leverage the common interfaces that exist, namely ODBC and JDBC, and thereby easily integrate with existing tools and allow users to leverage their existing knowledge.
- It should offer robust security, including authentication and authorization, and this security should be integrated with the rest of the data platform, leveraging the same users, permissions, and user management.
- It should be fully integrated into the database. It should not feel like a bolt-on addition but integrated and managed like the rest of the platform.
A Little History
DSE has shipped with SQL access to data for many releases, first via Hive and Hiveserver, then Shark and Sharkserver, and finally with Spark and the Spark SQL Thriftserver, which is what is available in DSE 5.1. However, neither of these solutions were worry-free in production environments.
In DSE 5.1, the role of ODBC/JDBC service for DSE was handled by the Spark SQL Thriftserver. This is the same approach taken by open-source Apache Spark and the various Spark vendors. However, the Spark SQL Thriftserver leaves a bit to be desired, in terms of being enterprise-class.
First, security in the Spark SQL Thriftserver has some notable shortcomings. While it’s true that you can enable LDAP authentication for the Spark SQL Thriftserver, this is not integrated with DSE, and does not support the variety of authentication schemes that DSE supports. More importantly, the Spark SQL Thriftserver runs as a single, long-running Spark application, and as such it connects to the DSE database as a single user. That is, once any user has authenticated to the Spark SQL Thriftserver, it will query the underlying DSE database as a single user regardless of which user was authenticated. This is a limitation of the open source Spark SQL Thriftserver and the Spark Cassandra Connector, and it’s clearly sub-par for security-conscious customers.
Second, the Spark SQL Thriftserver is not highly available. It must be manually started, and it’s left to the administrator to keep an eye on it. If it stops for some reason, the administrator has to restart the service again, potentially on another node (in case of node failure). This lack of high availability is prohibitive to putting it into production. Furthermore, if the service is restarted, then analytical SQL applications need to update their connection to whichever machine the service was restarted on.
The Spark SQL Thriftserver supports caching tables in memory to support even faster query times. However, if the service stops, intentionally or unintentionally, then the cached data is not automatically re-cached for the user. Moreover, if the user did issue a manual re-cache operation, when it took the data out of the data source this second time there is no guarantee that the underlying data set hasn’t changed, resulting in a different cached data set and thus different results for queries.
AlwaysOn SQL was designed with these issues in mind, and is extremely conscious of the shortcomings of previous approaches.
The first thing we want to ensure is that this service is highly available and does not require any manual monitoring or intervention. AlwaysOn SQL is enabled to start when DSE starts without any manual steps. When it’s enabled, the DSE nodes in the data center will elect a node to serve as the connection point for AlwaysOn SQL, and the service will be started. If the service stops, either on purpose or due to a failure, the DSE nodes will elect a new leader and that node will start the service and act as the new connection point.
How will client applications know which node is serving as the AlwaysOn SQL connection point? There’s an API call that can return the active connection point via the “dse client-tool” command-line tool, CQL, or a REST call to any of the nodes in the data center. Furthermore, we worked with Magnitude Software to enhance the ODBC and JDBC drivers for Spark SQL to seamlessly leverage this REST API so that clients are transparently connected to the active AlwaysOn SQL service node.
One challenge with running Spark applications is ensuring that the necessary resources are available, otherwise your application will queue and wait for other applications to finish. In DSE 5.1 we introduced the concept of Workpools, subsets of computing resources that users can be authorized to submit applications to use. For AlwaysOn SQL, we leverage this construct to create a Workpool just for AlwaysOn SQL, ensuring that the resources are always available when AlwaysOn SQL needs to start or restart.
Another challenge is how to give a good, consistent cache experience in the face of service restarts. To handle this case, we introduced new SQL syntax to support taking a persistent cache. When you create this new cache, a snapshot of the data from the database will be saved to DSEFS, DSE’s continuously available distributed file system, and then pulled into memory. Therefore, when the AlwaysOn SQL service is restarted, the cache can be repopulated from the DSEFS snapshot, ensuring the same data in the cache as before. This allows applications to reliably query the cached data without having to worry if the service has been restarted or if the data is not consistent from query to query.
Production applications require strict security, which is why AlwaysOn SQL was built to integrate with DSE’s Advanced Security features. The users for AlwaysOn SQL are managed along with the users of the DSE database, via CQL as another GRANT-able privilege. When a user authenticates via ODBC/JDBC, AlwaysOn SQL uses DSE to determine permissions and grant access to the service. Users do not need additional usernames or passwords to use AlwaysOn SQL, and, since AlwaysOn SQL leverages DSE Advanced Security, the various Unified
Authentication schemes are supported. Authentication is one component to security; the other is authorization. Unlike the Spark SQL Thriftserver, which queries the DSE database as a single user regardless of the authenticated user, AlwaysOn SQL always queries the underlying database as the authenticated user. AlwaysOn SQL uses DSE’s Proxy Execution feature to properly query database tables as the authenticated user, thereby ensuring the end user only has access to tables and data they have permissions to read.
All the Data
AlwaysOn SQL provides SQL access not just to DSE database tables, which are automatically registered in the SQL catalog for analysis. DSE Graph data is also available via SQL in the form of vertex tables and edge tables, just like in DseGraphFrames. These tables are also automatically registered in the SQL catalog.
Data stored in DSEFS can be manually registered as external tables in AlwaysOn SQL. This allows for a simple way to load data from DSEFS into DSE tables via SQL - and ODBC or JDBC - only, and similarly for exporting data from DSE tables and graphs to DSEFS. External data sources can also be registered with AlwaysOn SQL, allowing simple data migration from legacy systems to DSE.
Production-Ready SQL for Analytics
AlwaysOn SQL represents an important step to providing always-on SQL access to data, in support of business analysts and the tools they use. By enabling SQL users on DSE, we invite a broad group of developers to analyze data in DSE without having to climb the learning curve of Spark programming, including (potentially) learning a new programming language. But beyond enabling tools and developers, AlwaysOn SQL was designed from the ground up to support production deployments in enterprise environments, focusing on availability and security as the two core tenets.