Tableau + Spark + Cassandra

By Victor Coustenoble -  January 11, 2016 | 5 Comments

This article is a simple tutorial explaining how to connect Tableau Software to Apache Cassandra™ via Apache Spark™

Note: This blog post was written targeting DSE 4.8 which included Apache Spark™ 1.4.1. Please refer to the DataStax documentation for your specific version of DSE if different.

logos
This tutorial explains how to create a simple Tableau Software dashboard based on Cassandra data. The tutorial uses the Spark ODBC driver to integrate Cassandra and Apache Spark. Data and step-by-step instructions for installation and setup of the demo are provided.

1/ Apache Cassandra and DataStax Enterprise

First you need to install a Cassandra cluster and an Apache Spark™ cluster connected with the DataStax Spark Cassandra connector. A very simple way to do that is to use DataStax Enterprise (DSE), it’s free for development or test and it contains Apache Cassandra and Apache Spark already linked.

You can download DataStax Enterprise from https://academy.datastax.com/downloads and find installation instructions here http://docs.datastax.com/en/getting_started/doc/getting_started/installDSE.html.

After the installation is complete, start your DSE Cassandra cluster (it can be a single node) with Apache Spark™ enabled with the command line "dse cassandra -k".

2/ Spark Thrift JDBC/ODBC Server

The Spark SQL Thrift server is a JDBC/ODBC server allowing JDBC and ODBC interfaces for client connections like Tableau to Spark (and then to Cassandra). See here for more details http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/spark/sparkSqlThriftServer.html.

Start the Spark Thrift JDBC/ODBC server with the command line "dse start-spark-sql-thriftserver".

You should see a new SparkSQL application running here http://127.0.0.1:4040/ from the Spark UI manager http://127.0.0.1:7080/.

The IP address is the address of your Spark Master node. You may need to replace 127.0.0.1 with your instance IP address if you are not running Spark cluster or DSE locally. With DSE, you can run the command "dsetool sparkmaster" to find your Spark Master node IP.

sparkui

Note that to connect Tableau Software to Apache Cassandra we would have been able to connect directly via the DataStax ODBC driver. But in this case all computations, joins, aggregates are done on the client side, so it's not efficient and risky for large dataset. On the contrary, with Spark jobs everything is done on the server side and on a distributed manner.

3/ Demo Data

Create the 3 demo tables, you can find all data and the script to create CQL schemas and to load tables here : https://drive.google.com/drive/u/1/folders/0BwpBQmtj50DFaU5jWTJtM1pleUU.

When you have downloaded everything, run the script “ScriptCQL.sh” to create schemas and load data (cqlsh must be in your path or download everything into the cqlsh directory). A keyspace named ks_music with 3 tables albums, performers, countries is created.
schema1 devcenter

4/ ODBC Driver

Download and install the Databricks ODBC driver for Spark from https://databricks.com/spark/odbc-driver-download or from https://academy.datastax.com/downloads/download-drivers.

No specific parameter is need, the default installation is ok. The Mac version can be found only on the Databricks Web site.

5/ Tableau Software

Open Tableau and connect to the Apache Spark server with following settings from the Connect panel:

1

The server IP is the ip address of sparksql thriftserver which may also change depending of your installation. You may also change authentication settings depending of your configuration.

6/ Cassandra Connection

Then you should be able to see all Apache Cassandra keyspaces (named Schema in Tableau interface) and tables (click enter in Schema and Table inputs to see all available Cassandra keyspaces and tables).

Drag and drop albums and performers tables from the ks_music keyspace.

Change the inner join clause with right columns from the 2 tables, Performer from albums table and Name from performers table (click on the blue part of the link between the 2 tables to be able to edit this inner join).

2
Keep a “Live” connection ! Don’t use “Extract” because otherwise all your data will be loaded into Tableau.

3

Update Now” to see a sample of data returned.

7/ Tableau Dashboard

Go to the Tableau worksheet “Sheet 1” and start a simple dashboard.

Convert Year column (from albums table) to Discrete type (click at the right of the Year column to do that from a menu).

3.5
Add Year (from albums table) as Rows, Gender (from performers table) as Columns and Number of Records as the measure.

4
And with the “Show Me” option, convert your table into a stacked bars chart.

5
Done, you have created your first tableau dashboard on live Cassandra data !

8/ SparkSQL and SQL Queries

Finally you can check SQL queries generated on the fly and pass to SparkSQL from the Spark UI http://127.0.0.1:4040/sql/ (SQL tab of the SparkSQL UI).

This shows SparkSQL processes and all SQL queries generated by Tableau Software and executed on top of Apache Cassandra data through the Spark Cassandra connector.

6

Additional links









DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.



register for classes

get certified

DBA's Guide to NoSQL







Comments

  1. Barrie Byron says:

    “The demonstration is done through the Spark ODBC driver and so it demonstrates also the Cassandra and Apache Spark integration.” >> “The tutorial uses the Spark ODBC driver to integrate Cassandra and Apache Spark.”
    “All installations and setups are explained and data are provided to create such demo step-by-step.” >> perhaps we can rewrite to avoiding the plural data/datum conundrum “Data and step-by-step instructions for installation and setup of the demo are provided.”
    1/ “After the installation, start your DSE Cassandra cluster (it can be a single node) with Spark enable with the simple command line “dse cassandra -k. >> gently suggest that we limit the use of the word simple, so it’s omitted here
    “After the installation is complete, start your DSE Cassandra cluster (it can be a single node) with Spark enabled with the command line “dse cassandra -k”.
    4/ “A keyspace named ks_music with 3 tables albums, performers, countries are created.” >> “is created”
    5/ Victor, in addition to the visual Tableu Connect panel, can we list the specific required settings for Spark SQL? Do we want to mention auth ? (or any other user access requirements?)
    8/ “This show SparkSQL processes … ” >> “This shows SparkSQL processes …”

  2. Dori says:

    Hi,

    I’m trying to set up a connection between (spark+cassandra) and Tableau but I’m facing some issues.
    I’m running DataStax Entreprise on a Ubuntu VM and my tableau desktop is on my Windows host machine.

    How can I access to my VM?
    Thanks

  3. Quang says:

    I cannot get this works.

    On step 1, I used DSE 5.0

    On step 2, I successfully get the spark application org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running.

    On step 3, I made some demo data.

    On step 4, downloaded and installed SimbaSharkODBC64 on windows OS, but when I tried to create a new ODBC Data Source with that Simba driver, it keeps producing the error:

    Driver Version: V0.1.1.0002

    Running connectivity tests…

    Attempting connection
    Failed to establish connection
    SQLSTATE: HY000[Simba][SharkODBC] (22) Error from ThriftHiveClient: No more data to read.

    TESTS COMPLETED WITH ERROR.

    And of course on step 5, When creating Tableau Spark SQL connection, it keeps getting error.

    Detailed Error Message:

    Unable to connect to the ODBC Data Source. Check that the necessary drivers are installed and that the connection properties are valid.
    Unable to connect to the server “192.168.1.10”. Check that the server is running and that you have access privileges to the requested database.

  4. RKS says:

    Hi Everyone,

    Is there any to connect DSE Graph with Tableau ?

    1. vincent poncet says:

      Hi RKS,

      Not yet as of DSE 5.0
      That something we are working on.
      Please contact your DataStax Enterprise Sales Manager to get more details.

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

Tel. +1 (408) 933-3120 sales@datastax.com Offices France Germany

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.