Apache Spark! Darling of the Big Data world and the easiest entry point into Machine Learning. It's fast, it's cool, and it's hip. So many great things, but should you be using it in your stack?
Usually when introducing developers to new technology we like to jump on board and start integrating, but it's important to realize that there is a cost to adding any technology to your stack. Sometimes there may even be better solutions with a lower maintenance cost that should be investigated first.
Before I dive in, if this topic is of great interest to you, my talk at Datastax Accelerate will take a closer look into this question, but to whet your appetite, let's take a look at the question of whether to use Spark at fifty thousand feet.
In general, the benefits of Spark are:
- In Memory Data Processing
- Easy to Use Interfaces
- SQL API
- Streaming and Batch Frameworks
- Compatibility with dozens if not hundreds of Datasources
In general, the drawbacks are:
- Hard to debug when things go wrong
- Distributed System, so more difficult failure modes
- Requires a new area of expertise and your company or business
So with this in mind, how do we decide if using Spark is correct? I like to ask myself a few questions, if I answer a few of these with a YES then Spark is definitely going to be a part of my solution.
- Is the data I'm working with already distributed? (Stored in DSEFS, HDFS, S3, Kafka …)
- Do I have to move it to a variety of other formats or databases? (CSV, Oracle, MySQL, Parquet)
- Am I going to have multiple applications which require this Functionality?
- Do I need both batch and streaming capabilities?
- Is my data very large?
- Do I need a JDBC endpoint for data which does not natively have a JDBC endpoint?
- Do I already have Spark as a part of my deployment?
If you don’t answer yes to several of these questions, chances are there is a simpler solution to your problem which will most likely be much cheaper for your organization to design and maintain.
Suppose you have several users who need to use Tableau to generate reports on an adhoc basis. These reports have uncertain contents and will most likely not align strongly with your Cassandra data model. You also have reports that are stored by another application in Parquet files on S3 and this data is also required to build your new reports.
In this case, we can answer yes to several of the above questions. We need a JDBC endpoint to support Tableau, for this, the Always On Sql Server (AOSS) included in DSE is a great option, although the non-ha OSS Spark Thrift Server which it is built on would be a good substitute. In addition, we can access both Parquet and Cassandra data at the same time through the AOSS with a minimum of fuss. What's more, our Parquet data is already distributed to S3, so regardless of what we do, we'll need to be doing a bunch of data transfer over the wire. So with these yes answers, we can see we have a pretty good fit.
An opposing example would be something like the following.
Imagine that you are booting up your cluster for the first time and want to load some data from some .CSV files located on the hard drive of a single machine. This is a one-time operation and probably won't ever be needed to be done again.
Here we probably should not use Spark. Setting up the platform and becoming comfortable with the framework for a single operation is probably overkill, and the amount of time between usage will probably require relearning Spark each time the load is needed. Instead, we would probably benefit from using the DSE Bulk Loader tool. This tool is built specifically for use cases like this, our data is not distributed and we have a one time load we need to perform. Instead of having to learn an entire distributed framework, we can run a single local Java application.
Of course, this isn't the full scope of possible Spark and Cassandra use cases, so if you want to learn more, be sure to visit me at DataStax Accelerate where we'll talk about these use cases and more! You will learn when it's appropriate To Spark or Not To Spark. If you’re interested in implementing Spark in your stack, To top things off, DataStax provides a simple and easy to use connector to link it directly to Cassandra.