Jaroslaw Grabowski

<p>This blog post was written for DataStax Enterprise 5.1.0. Refer to the&nbsp;<a href="https://docs.datastax.com/" rel="noopener" target="_blank">DataStax documentation</a>&nbsp;for your specific version of DSE.</p>

<p>Compiling and executing Apache Spark™ applications with custom dependencies can be a challenging task. Spark beginners can feel overwhelmed by the number of different solutions to this problem. Diversity of library versions, the number of different build tools and finally the build techniques, such as assembling fat JARs and dependency shading, can cause a headache.</p>

<p>In this blog post, we shed light on how to manage compile-time and runtime dependencies of a Spark Application that is compiled and executed against DataStax Enterprise (DSE) or open source Apache Spark (OSS).</p>

<p>Along the way we use a set of predefined bootstrap projects that can be adopted and used as a starting point for developing a new Spark Application. These examples are all about connecting, reading, and writing to and from a DataStax Enterprise or Apache Cassandra(R) system.</p>

<h3>Quick Glossary:</h3>

<p>Spark Driver: A user application that contains a&nbsp;<a href="https://spark.apache.org/docs/latest/cluster-overview.html" rel="noopener" target="_blank">Spark Context</a>.<br />
Spark Context: A Scala class that functions as the control mechanism for distributed work.<br />
Spark Executor: A remote Java Virtual Machine (JVM) that performs work as orchestrated by the Spark Driver.<br />
Runtime classpath: A list of all dependencies available during execution (in execution environment such as Apache Spark cluster). It's important to note that the runtime classpath of the Spark Driver is not necessarily identical to the runtime classpath of the Spark Executor.<br />
Compile classpath: A full list of all dependencies available during compilation (specified with build tool syntax in a build file).</p>

<h1>Choose language and build tool</h1>

<p>First, git clone the DataStax repository&nbsp;<a href="https://github.com/datastax/SparkBuildExamples" rel="noopener" target="_blank">https://github.com/datastax/SparkBuildExamples</a>&nbsp;that provides the code that you are going to work with. Within cloned directories there are Spark Application bootstrap projects for Java and Scala, and for the most frequently used build tools:</p>

<ul>
	<li>Scala Build Tool (sbt)</li>
	<li>Apache Maven™</li>
	<li>Gradle</li>
</ul>

<p>In the context of managing dependencies for the Spark Application, these build tools are equivalent. It is up to you to select the language and build tool that best fits you and your team.</p>

<p>For each build tool, the way the application is built is defined with declarative syntax embedded in files in the application’s directory:</p>

<ul>
	<li>Sbt:&nbsp;build.sbt</li>
	<li>Apache Maven:&nbsp;pom.xml</li>
	<li>Gradle:&nbsp;build.gradle</li>
</ul>

<p>From now on we are going to refer to those files as a build files.</p>

<h1>Choose execution environment</h1>

<p>Two different execution environments are supported in the repository: DSE and OSS.</p>

<h2>DSE</h2>

<p>If you are planning to execute your Spark Application on a DSE cluster, use the&nbsp;dse&nbsp;bootstrap project which greatly simplifies dependency management.</p>

<p>It leverages the&nbsp;dse-spark-dependencies&nbsp;library which instructs a build tool to include all dependency JAR files that are distributed with DSE and are available in the DSE cluster runtime classpath. These JAR files include Apache Spark JARs and their dependencies, Apache Cassandra JARs, Spark Cassandra Connector JAR, and many others. Everything that is needed to build your bootstrap Spark Application is supplied by the&nbsp;dse-spark-dependencies&nbsp;dependency. To view the list of all&nbsp;dse-spark-dependencies&nbsp;dependencies, visit our&nbsp;<a href="https://repo.datastax.com/public-repos/com/datastax/dse/dse-spark-dependencies/" rel="noopener" target="_blank">public repo</a>&nbsp;and inspect the pom files that are relevant to your DSE cluster version.</p>

<p>An example of an DSE built.sbt:</p>

<pre>
libraryDependencies += "com.datastax.dse" % "dse-spark-dependencies" % "5.1.1" % "provided"</pre>

<p>Using this managed dependency will automatically match your compile time dependencies with the DSE dependencies on the runtime classpath. This means there is no possibility in the execution environment for dependency version conflicts, unresolved dependencies etc.</p>

<p><b>Note: The DSE version must match the one in your cluster, please see “Execution environment version” section for details.</b></p>

<p><b>DSE projects templates are built with&nbsp;<b>sbt</b>&nbsp;0.13.13 or later. In case of unresolved dependencies errors, update&nbsp;<b>sbt</b>&nbsp;and then clean&nbsp;<b>ivy</b>&nbsp;cache (with&nbsp;<b>rm ~/.ivy2/cache/com.datastax.dse/dse-spark-dependencies/</b>&nbsp;command).</b></p>

<h2>OSS</h2>

<p>If you are planning to execute your Spark Application on an open source Apache Spark cluster, use the&nbsp;oss&nbsp;bootstrap project. For the&nbsp;oss&nbsp;bootstrap project, all compilation classpath dependencies must be manually specified in build files.</p>

<p>An example of an OSS&nbsp;built.sbt:</p>

<pre>
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion % "provided"
)</pre>

<p>For OSS, you must specify these four dependencies for the compilation classpath.</p>

<p>During execution, the Spark runtime classpath already contains the org.apache.spark.* dependencies, so all we need to do is to add spark-cassandra-connector as an extra dependency.&nbsp;<a href="https://github.com/datastax/spark-cassandra-connector" rel="noopener" target="_blank">The DataStax spark-cassandra-connector</a>&nbsp;doesn’t exist in the Spark cluster by default. The most common method to include this additional dependency is to use&nbsp;--packages&nbsp;argument for the&nbsp;spark-submit&nbsp;command. An example of&nbsp;--packages&nbsp;argument usage is shown in the “Execute” section below.</p>

<p>The Apache Spark versions in the build file must match the Spark version in your Spark cluster. See next section for details.</p>

<h1>Execution environment versions</h1>

<p>It is possible that your DSE or OSS cluster version is different than the one specified in bootstrap project.</p>

<h2>DSE</h2>

<p>If you are a DSE user then checkout the SparkBuildExamples version that matches your DSE cluster version, for example:</p>

<pre>
git checkout &lt;DSE_version&gt;
# example: git checkout 5.0.6</pre>

<p>If you are a DSE 4.8.x user then checkout 4.8.13 or newer 4.8.x version.</p>

<h2>OSS</h2>

<p>If you are planning to execute your application against a Spark cluster different than the one specified in a bootstrap project build file, adjust all dependencies version listed there. Fortunately, the main component versions are variables. See the example below and adjust following according to your needs.</p>

<h3>Sbt</h3>

<pre>
val sparkVersion = "2.0.2"
val connectorVersion = "2.0.0"</pre>

<p>&nbsp;</p>

<h3>Maven</h3>

<pre>
&lt;properties&gt;
  &lt;spark.version&gt;2.0.2&lt;/spark.version&gt;
  &lt;connector.version&gt;2.0.0&lt;/connector.version&gt;
&lt;/properties&gt;</pre>

<p>&nbsp;</p>

<h3>Gradle</h3>

<pre>
def sparkVersion = "2.0.2"
def connectorVersion = "2.0.0"</pre>

<p>Let’s say that your Spark cluster has 1.5.1 version. Go to&nbsp;<a href="https://github.com/datastax/spark-cassandra-connector#version-compatibility" rel="noopener" target="_blank">version compatibility table</a>, there you can see compatible Apache Cassandra versions and Spark Cassandra Connector versions. In this example, our Apache Spark 1.5.1 cluster is compatible with 1.5.x Spark Cassandra Connector, the newest one is 1.5.2 (newest versions can be found on&nbsp;<a href="https://github.com/datastax/spark-cassandra-connector/releases" rel="noopener" target="_blank">Releases page</a>). Adjust the variables accordingly and you are good to go!</p>

<h1>Build</h1>

<p>The build command differs for each build tool. The bootstrap projects can be built with the following commands.</p>

<h2>Sbt</h2>

<pre>
sbt clean assembly
# produces jar in path: target/scala-2.11/writeRead-assembly-0.1.jar</pre>

<h2>Maven</h2>

<pre>
mvn clean package
# produces jar in path: target/writeRead-0.1.jar</pre>

<h2>Gradle</h2>

<pre>
gradle clean shadowJar
# produces jar in path: build/libs/writeRead-0.1.jar</pre>

<h1>Execute</h1>

<p>The&nbsp;spark-submit&nbsp;command differs between environments. In DSE environment, the command is simplified to autodetect parameters like&nbsp;--master. In addition, various other Apache Cassandra and DSE specific parameters are added to the default SparkConf. Use the following commands to execute the JAR that you built. Refer to the Spark&nbsp;<a href="http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit" rel="noopener" target="_blank">docs</a>&nbsp;for details about&nbsp;spark-submit&nbsp;command.</p>

<h2>DSE</h2>

<pre>
dse spark-submit --class com.datastax.spark.example.WriteRead &lt;path_to_produced_jar&gt;</pre>

<h2>OSS</h2>

<pre>
spark-submit --conf spark.cassandra.connection.host=&lt;cassandra_host&gt; --class com.datastax.spark.example.WriteRead --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0 --master &lt;master_url&gt; &lt;path_to_produced_jar&gt;</pre>

<p>Note the usage of&nbsp;--packages&nbsp;to include the spark-cassandra-connector on the runtime classpath for all application JVMs.</p>

<h1>Provide additional dependencies</h1>

<p>Now that you have successfully built and executed this simple application, it’s time to see how extra dependencies can be added to your Spark Application.</p>

<p>Let’s say your application grows with time and there is a need to incorporate an external dependency to add functionality to your application. For this argument, let the new dependency &nbsp;be&nbsp;commons-math3.</p>

<p>To supply this dependency to the compilation classpath, we must provide proper configuration entries in build files.</p>

<p>There are two ways to provide additional dependencies to runtime classpath assembling or manually providing all dependencies with the&nbsp;spark-submit&nbsp;command.</p>

<h2>Assembly</h2>

<p>Assembling is a way of directly including dependencies classes in the resulting JAR file (sometimes called fat-jar or uber-jar) as if these dependency classes were developed along with your application. When the user code is shipped to Apache Spark Executors, these dependency classes are included in the application JAR on the runtime classpath. To see an example, uncomment the following sections in any of your build files.</p>

<h3>Sbt</h3>

<pre>
libraryDependencies += "org.apache.commons" %% "commons-math3" % "3.6.1"</pre>

<p>&nbsp;</p>

<h3>Maven</h3>

<pre>
&lt;dependency&gt;
  &lt;groupId&gt;org.apache.commons&lt;/groupId&gt;
  &lt;artifactId&gt;commons-math3&lt;/artifactId&gt;
  &lt;version&gt;3.6.1&lt;/version&gt;
&lt;/dependency&gt;</pre>

<p>&nbsp;</p>

<h3>Gradle</h3>

<pre>
assembly "org.apache.commons:commons-math3:3.6.1"</pre>

<p>Now you can use&nbsp;commons-math3&nbsp;classes in your application code. When your development is finished, you can create a JAR file using the build command and submit it without any modifications to the&nbsp;spark-submit&nbsp;command. If you are curious to see where the additional dependency is, use any archive application to open the produced JAR to see that&nbsp;commons-math3&nbsp;classes are included.</p>

<p>When assembling, you might run into conflicts where multiple jars attempt to include a file with the same filename but different contents. There are several solutions to this problem, most common are: removing one of the conflicting dependencies or shading (which is described later in this blog post). If all else fails, most plugins have a variety of other merge strategies for handling these situations. For example, the &nbsp;https://github.com/sbt/sbt-assembly#merge-strategy.</p>

<h2>Manually adding JARs to the runtime classpath</h2>

<p>If you don’t want to assembly a fat JAR (maybe the number of additional dependencies produced a 100MB JAR file and you consider this size unusable), use an alternate way to provide additional dependencies to runtime classpath.</p>

<p>Mark some of the dependencies with&nbsp;provided&nbsp;keyword to exclude them from the assembly JAR.</p>

<h3>Sbt</h3>

<pre>
libraryDependencies += "org.apache.commons" %% "commons-math3" % "3.6.1" % "provided"</pre>

<p>&nbsp;</p>

<h3>Maven</h3>

<pre>
&lt;dependency&gt;
  &lt;groupId&gt;org.apache.commons&lt;/groupId&gt;
  &lt;artifactId&gt;commons-math3&lt;/artifactId&gt;
  &lt;version&gt;3.6.1&lt;/version&gt;
  &lt;scope&gt;provided&lt;/scope&gt;
&lt;/dependency&gt;</pre>

<p>&nbsp;</p>

<h3>Gradle</h3>

<pre>
provided "org.apache.commons:commons-math3:3.6.1"</pre>

<p>After building a JAR, manually specify additional dependencies with&nbsp;spark-submit&nbsp;command during application submission. Add or extend existing&nbsp;--packages&nbsp;argument of&nbsp;spark-submit&nbsp;command. Note that multiple dependencies are separated by commas. For example:</p>

<pre>
--packages org.apache.commons:commons-math3:3.6.1,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0</pre>

<h1>User dependencies conflicting with Spark dependencies</h1>

<p>What if you want to use different version of a dependency than the version that is present in the execution environment?</p>

<p>For example, a Spark cluster already has&nbsp;commons-csv&nbsp;in its runtime classpath and the developer needs a different version in their application. Maybe the Spark version is old and doesn’t contain all the needed functionality. Maybe the new version is not backward compatible and breaks Spark Application execution.</p>

<p>This is a common problem and there is a solution: shading.</p>

<h2>Shading</h2>

<p>Shading is a build technique where dependency classes are packaged with application JAR files (like in assembling) but additionally package structure of this classes is altered. This process happens at compile time and is transparent to the developer. Shading simply substitutes all dependency references in a Spark Application with the same (functionality-wise) classes but located in different packages. For example, the class&nbsp;org.apache.commons.csv.CSVParser&nbsp;for Spark Application becomes&nbsp;shaded.org.apache.commons.csv.CSVParser.</p>

<p>To see shading in action uncomment following sections in build file of your choice. This will embed old&nbsp;commons-csv&nbsp;in resulting jar but with prepended package “shaded”.</p>

<h3>Sbt</h3>

<pre>
assembly "org.apache.commons:commons-csv:1.0"</pre>

<p>and</p>

<pre>
assemblyShadeRules in assembly := Seq( 
 ShadeRule.rename("org.apache.commons.csv.**" -&gt; "shaded.org.apache.commons.csv.@1").inAll 
)</pre>

<h3>Maven</h3>

<pre>
&lt;dependency&gt;
  &lt;groupId&gt;org.apache.commons&lt;/groupId&gt;
  &lt;artifactId&gt;commons-csv&lt;/artifactId&gt;
  &lt;version&gt;1.0&lt;/version&gt;
&lt;/dependency&gt;</pre>

<p>and</p>

<pre>
&lt;relocations&gt;
  &lt;relocation&gt;
    &lt;pattern&gt;org.apache.commons.csv&lt;/pattern&gt;
    &lt;shadedPattern&gt;shaded.org.apache.commons.csv&lt;/shadedPattern&gt;
  &lt;/relocation&gt;
&lt;/relocations&gt;</pre>

<p>&nbsp;</p>

<h3>Gradle</h3>

<pre>
libraryDependencies += "org.apache.commons" % "commons-csv" % "1.0"</pre>

<p>and</p>

<pre>
shadowJar {
  relocate 'org.apache.commons.csv', 'shaded.org.apache.commons.csv'
}</pre>

<p>After building the JAR, you can look into its content and see that commons-csv is embedded in&nbsp;shaded&nbsp;directory.</p>

<h1>Summary</h1>

<p>In this article, you learned how to manage compile-time and runtime dependencies of a simple Apache Spark application that connects to an Apache Cassandra database by using the Spark Cassandra Connector. You learned how Scala and Java projects are structured with sbt, gradle, and maven build tools. You also learned different ways to provide additional dependencies and how to resolve dependency conflicts with shading.</p>

<p><br />
<br />
&nbsp;</p>


Jaroslaw Grabowski

Share

Share

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI