Since the Open Database Connectivity (ODBC) is such a prolific framework to enable Business Intelligence (BI) tools, DataStax provides a set of ODBC drivers to enable these use cases. While the decision to support the API was an easy product decision, what hasn’t always been easy is how to use the different drivers among the systems for the various BI and ETL tasks. This blog is aimed to keep you on the green-path of proper usage and use cases for these drivers.
How We Got Here
In 2015, DataStax released a new CQL ODBC driver that was made publicly available for Apache Cassandra™ and DataStax Enterprise (DSE). Later, DataStax also provided a SQL ODBC driver for Apache Spark™. Both drivers can be used independently with DataStax Enterprise. With the release of DSE 6, DataStax is providing updated Spark ODBC drivers that take advantage of this functionality.
Apache Cassandra and ODBC
It’s a well-known fact that Cassandra is purpose-built for use cases with scale and high-availability requirements. For those strengths, we also acknowledge that we, as users, are limited on how we can query our data to maximize these core values from Cassandra.
While functional support for an API like ODBC is trivial, you can’t just simply change the API and expect purpose-built databases to become general purpose or even operate in the same way a relational database does. By now, most Cassandra users know that there are limitations on how data can be queried, not just through query syntax but also by how the internal mechanisms of the system work. For example, scanning all the partitions of a table is a Cassandra “anti-pattern” and the performance typically will not meet your expectations.
For all but the most straightforward queries, the performance when leveraging the ODBC driver for CQL is mediocre at best. This is because the driver runs on the client machine and handles all the operations not supported by CQL, such as joins or rollups. These type of queries typically result in out-of-memory errors.
Which ODBC Driver Should I Use and How?
Both of the two publically available ODBC drivers can be used with DSE, but only one works with Cassandra. The two drivers are meant for different DSE components, one for the Cassandra piece, one for Spark.
So, what’s the right way to effectively utilize BI and ETL tools against these components and products? Simple: adhere to the strengths of the technology that’s being connected to and what’s available.
When using the Cassandra ODBC driver for the BI use case of importing and exporting data, the general best practices for CQL access patterns still apply. The best results will come from sticking to these CQL read and write fundamentals while using the Cassandra ODBC driver for this use case. It’s essentially sticking to the best practices of using pure CQL. While this approach may sound limiting, it’s the best path to success because of how Cassandra is built.
For more complicated reads, aggregations, joins, and ETL, DataStax recommends using Apache Spark, which can mean deploying the SQL engine alongside Cassandra. The read-heavy requirements that cannot impose limits on query ability should simply always use Spark and the Spark ODBC driver. This far more capable approach uses the chosen SQL engine of the DSE platform as shown by one of our latest blogs on using Tableau against DSE.
While writes may not be as efficient as the CQL driver for ETL use cases, the read capabilities should outweigh the write performance needs—meaning, if you can’t query or transform the data that you want with CQL, use Spark to get the job done and pay the write penalty.
The same usage recommendations would apply to DataStax Basic, as well, as there is no Spark with Basic. DSE is a more capable data platform versus niche database because it is built with both Cassandra and Spark. There will be a strong preference in using DSE over DataStax Basic for BI needs, not just because of Spark but also because of enhancements such as AlwaysOn SQL, a new, highly-available, secure SQL service in DSE 6 and the custom optimizations in our Spark engine. Admittedly, while DSE has more functionality, it also has a bit more configuration complexity, since we still need to leverage both CQL ODBC and Spark ODBC independently, as described.
Usage Reference Table
As DSE extends the capabilities of CQL with things like search-enabled CQL and implicitly leveraged indexes, more flexibility will be available for the CQL ODBC driver. For now, using the CQL ODBC driver outside of this recommended CQL usage is premature. As we move to unify the platform, there will be convergence to one ODBC/JDBC driver for DataStax, whether it’s basic or enterprise. Download the drivers today from the DataStax downloads page.