Manikandan Srinivasan

For years, a critique directed at NoSQL databases was that you couldn’t do join queries like those possible in an RDBMS. While this is true for some N<a href="https://www.datastax.com/products/datastax-enterprise">oSQL databases</a>, we thought it would be helpful to remind Apache Cassandra™ users that join operations are indeed now possible with Cassandra.

There are a couple of ways that you can join tables together in Cassandra and query them:

<ol>
	<li>
	Use Apache Spark’s SparkSQL™ with Cassandra (either open source or in DataStax Enterprise - DSE).
	</li>
	<li>
	Use DataStax provided ODBC connectors with Cassandra and DSE.
	</li>
</ol>

In this post we’ll first illustrate how to perform SQL Joins [1] with Cassandra tables using SparkSQL and then look at how to use DataStax’s ODBC connector to easily create join queries[2] that can be used to create dashboards with BI software like Tableau [3].

<h2>Creating Join Queries Using Spark and Cassandra</h2>

While you can create your own Cassandra and Spark combination clusters using open source, its a lot easier to use DSE as it bundles and certifies Spark with Cassandra as part of its analytics package. To use Spark in DSE, you simply start one or more nodes in analytics mode, and you’re ready to roll.

DSE ships with a Weather Application Demo that shows how DSE Analytics works. We’ll use a couple of the objects in that demo to illustrate how to perform a simple join operation. For more details on how to setup the demo, and to view much more complicated join queries used in the application, please refer to our online <a href="https://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkWSDemo.html">documentation</a>.

The tables used in this example have the following structures:

<img alt="" src="https://lh6.googleusercontent.com/X-MWgZfBsY4nyK9zBainPPxGynFwbuqeHZjfx0MLf1Pg1L7Z4Jl8paJDfyJzYHwv7S7SXD04NqYRdOLbourkCvS7nge1KrPZR-zeblwxIDF-RL7DpQHLJQbcHGahqpB4owP9Qco" />

To create a join query, we first start one or more DSE nodes in analytic mode by executing:

<img alt="" src="https://lh5.googleusercontent.com/ACoqu_09WhcgnlAJ5LBWxJFbTROSxiAj34awtR1ciefmTA4bo9aB-AXNW3DIi7WrM3Mth0veKP2xcyb4_V2vNQG0jhMCH21a34-5L9ShtsgUUzyoBt7i9LmcePnlev7m8fw89Ac" /> We then indicate what keyspace to use, which in this case is the “weathercql” keyspace: <img alt="" src="https://lh5.googleusercontent.com/gXqdycliWsFoQJtcBXM8jCUtVDkBAo1Bf0V2zFHX9ElTKIN_yPzNox0QQoJ8QC7oaiN2PCGFbxHn27uXm64rcdEse5gWM8yn0lFknqh60fyr2ItJ0w8mW7ja9BwQw9jvlRzsJQM" />

Creating a join operation with SparkSQL involves using the following syntax:

<img alt="" src="https://lh6.googleusercontent.com/fJa4i_7Kf42BMafaJl6mgr8cM1AT3rTqH3H1Nv9sFyZaK3L-xnEY2rQ-cGhSCX4cjklwPSW313GbDaJEVs1CialFnxeOSPk5C2ggIqcLOimlVMqqmgF1qHrJHMZBE_y8sqrQgr0" />

For this example, we’ll join data from the monthly and station table, store the results in a SparkSQL CassandraSQLContext - RDD (resilient distributed dataset) called “results”, iterate through and print the results:

<img alt="" src="https://lh4.googleusercontent.com/yRi3zI95OfuCOzUX7wPfDeFkgxhJo86Khn6ZOvy_5G3JuNnqw5hHl0BWcJCIUDg0a93jvSwWOWDR4UQG4cM2elNYcr6YsSMDkn5GSv5TKlPmyIc1RKaDJvuNH-TaHwqWZF0C1uw" />

<h2>Creating Join Queries with Cassandra and ODBC</h2>

You can create join queries on Cassandra data outside of Spark by using DataStax’s free ODBC driver (we also supply an ODBC driver for Spark). This means that any developer/DBA/BI/ETL tool that has ODBC connectivity can connect to and query data in Cassandra.

It is important to note that join operations done with the current ODBC driver should not involve large tables as the performance may not be acceptable for most queries that target big clusters.

Let’s take a look at how this works with one of the most popular BI tools in the market, which is Tableau. The steps below show a simple way to execute a freehand SQL join query using Tableau [3] and DataStax Enterprise 4.6.

1. First, we create an ODBC connection/datasource ( Fig 1) to DataStax Enterprise.

<img alt="" src="https://lh4.googleusercontent.com/z8H_0dNCoWalqHVrdVhxn8Ov9s6FD7IMqAa_n43GXwW1TBsY4iNLEzn3G4Zlzq6ll-5tt9H0pWgjatJEKq6LpOMplFGUhHhi-X4XW3iPkt_BaHj-mu0h6yWVCvRqwbz3tmPAse8" />

Fig 1: DataStax Cassandra ODBC Connector

2. Next, we open Tableau [3] and connect to Cassandra (Fig 2) using the ODBC connection created in the previous step.

<img alt="" src="https://lh6.googleusercontent.com/hja2DWaPO9L2fhOmAuSIA8jc3B8ppzwddOfno_6eI6mOD5HZw-qsK5cyZ37jypV9uoC8ObYOZOE0oofAaNukL1u7Bes-zkqRo21R6Mjz5Qug_fdEASz_Kiv7YZkQeEBMRr2oRAA" />

Fig 2: Connecting to Cassandra using ODBC

3. Then, we code our join query using Tableau’s Custom SQL Query[4] editor (Fig 3) to create a dashboard that displays the join query’s results.(Fig 4).

<img alt="" src="https://lh5.googleusercontent.com/4eqpS1NF0eDanEewdVHRSlT6_NOkCk97Nsa7z6nWxKZO_pKq2dYNATs82wFcBcnLhW2yrqTnsyIMyH-AS1depKpTREJ_OHKS34Olwm0QuuAjdsfoJqB1HtMa91wEfxYVkupS9t8" />

Fig 3: Custom SQL against DataStax Enterprise

<img alt="" src="https://lh3.googleusercontent.com/eUFVVuFUk1Jb4-jkx_GHvrDbMuh7Zn-HgYo_ob5hxV2RPt6d410tRQCm0m2HzjqckNpOtqFdy8_Deb88o7uckiSe3xM6BXk0hku6y079N2LaV-Mrz9e2Dq343R5y3kTQMR1Xib8" />

Fig 4: Tableau Dashboard

<h2>And Coming Soon...Joins on Steroids with Graph!</h2>

We announced our acquisition of Auerlius back in Feburary of this year, which is the company behind the Titan open source graph database. If you understand what a graph database can do, then you know that one thing it does very well is handle the traversal of multiple relationships between vertices (entities in an RDBMS world) without any need to create indexes or materialized views to overcome join performance inefficiencies in an RDBMS.

In short, a graph database represents the ultimate in joins where ease of use and performance are concerned. Coming soon in DSE will be DSE Graph, which will provide just this type of capability along with multi-model support in DSE.

As an example of how graph can dramatically reduce the complexity of join operations, the below comparison shows a sample, RDBMS join query on the left for a recommendation engine application vs. how the exact same query is handled in a graph database (Fig 5). Big difference, wouldn't you say?

<img alt="" src="https://lh5.googleusercontent.com/PJr9ju3oaOur7mLDGd9NsEFfd2708CAkshsQp_yjxilRYGK4Bqzsccg0vniFv4PH9ny8747GWomDgPZnBdfGRA-2sGSZvJhHwKfrrd-jifPl__JzW0GHeVBnIgeUE3daaVBgsmM" />

Fig 5: RDBMS Join Query vs. Graph Database Join Query

<h2>Conclusion</h2>

Joining Cassandra tables together with SQL-styled queries can be carried out in multiple ways today, with each method being easy to use and code. For more information on creating joins on Cassandra data, please refer to the online documentation <a href="https://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSqlSupportedSyntax.html">here</a> and <a href="https://www.datastax.com/documentation/developer/odbc-driver/doc/odbcdriver/aboutODBC.html">here</a>. You can find downloads of DSE and our ODBC drivers on our <a href="https://www.datastax.com/download">downloads page</a>.

[1] Customers need to consider to the costs of creating such ad-hoc queries against distributed databases.

[2] Customers need to run thorough query performance assessments when using this option

[3] Customers can use any Business Intelligence or ETL solution that supports standard JDBC/ODBC.

[4] Custom SQL is for illustrative purposes only. You can use any methodology supported by your BI vendor to create Reports or Dashboards. ETL jobs can be created in a similar method using your favorite ETL tool.

&nbsp; &nbsp;

How to do Joins in Apache Cassandra™ and DataStax Enterprise

Manikandan Srinivasan

Discover more

Share

Share

Creating Join Queries Using Spark and Cassandra

Creating Join Queries with Cassandra and ODBC

And Coming Soon...Joins on Steroids with Graph!

Conclusion

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI