Classloading in DSE Analytics
DataStax Enterprise Analytic nodes are powered by Apache Hadoop running on top of Apache Cassandra. Both Hadoop and Cassandra come with their own sets of library dependencies. Additionally, Hadoop relies on running user-code, which may often want to load its own set of dependencies. As long as each module uses a distinct set of dependencies, there are no problems. However, when two modules running within the same JVM need to use different versions of the same library, a library conflict arises. In this blog post I shortly describe how DSE internally manages library dependencies, and our current approach to resolving library conflicts. I'll also uncover some little bits of internal DSE classloader architecture.
The Global Classpath Problem
Early versions of DSE, before version 3.0, put all module dependencies on a common classpath, saved into CLASSPATH variable by the dse-in.sh script, executed whenever dse commands are invoked, e.g. dse hadoop. The DSE and Cassandra dependencies came first on the classpath, then Hadoop, then Hive, then Pig, and so on. Therefore Cassandra dependencies shadowed whatever came with Hadoop. In order to avoid some, fortunately rare, version conflicts, we overrode versions in one module or another, usually forcing both to use the newer version of the library. E.g. Hadoop was fine working with newer versions of commons-lang or commons-codec. So far we were lucky.
Unfortunately, Hadoop needs to load user M/R jobs, given by a jar containing the main program, with additional dependent jars. Of course, we have no control over which external jars the DSE users might want to load. As long as you were not using any of the jars included in the DSE distribution, you were fine. Everything went smoothly also if your job required a library included in DSE, and agreed on the version. However, if you needed a newer version of, let's say, Guava, then simply specifying your own Guava jar as the job dependency did not work, because your jar would be shadowed by the DSE one, which came earlier on the classpath. Strange linkage errors would occur. There was a workaround, though - you could directly replace the jar in the dse (or cassandra) lib directory with your newer version of it, and hope it works fine with it. Most of the time it just worked, but for sure it wasn't a perfect, general solution.
Additionally, when working on integrating Cassandra 1.2 into DSE 3.1, we hit the problem of incompatibility between Hive's ANTLR runtime and Cassandra's ANTLR runtime. This time the version difference was too huge to make both share the same version. Either Cassandra worked or Hive, but not both. Something must have been done...
Classloaders to the Rescue
Java Classloaders are common way of resolving library version conflicts in multitenant applications. Many application servers use this approach to separate classes loaded from different web contexts. This is because two classes with the exactly same qualified name are treated like separate classes, if, and only if, loaded by separate classloaders. This way two different applications can use two versions of the same class in the shared JVM without causing any trouble. This story could probably end here by saying "so we used a different classloader to load Cassandra jars and a different one to load Hadoop jars and the problem was solved", but it is not that simple. You've probably heard scary stories about classloader related bugs. Like the "cannot cast class X to class X" error. Or the "class not found" problems despite you do find it in the right place. I remember endless hours of debugging some application container to find what I was doing wrong. Classloaders are one of those ideas that are great and simple on paper, but can get pretty hairy in the implementation. Let's quickly look closer at how Java does classloading to better understand where those problems might stem from.
Simply put, a classloader is an object responsible for loading a Java class given its full name. Every Java application has the system classloader, which can be set by the java.system.class.loader property. This classloader is the default one and it will get the requests for loading the main class and its dependencies. Of course, it is perfectly fine to instantiate some other classloaders manually and use them to load some subset of classes. If one class needs to load another, dependent class, JVM uses the same classloader that the requesting class has been loaded with. Classloaders usually form a hierarchy, where a classloader at a lower level can delegate classloading request to its parent classloader. The recommended, standard order of loading classes is parent-first, which means the child classloader first tries to load a class with its parent classloader, and only if this fails, it tries to load the class itself (this is the default behavior of the ClassLoader class). Commonly, child classloaders have separate classpaths and are dedicated to load private classes of modules, while the system classloader is used to load classes common to all the modules.
The above described parent-first classloading order is sufficient to implement isolated modules, but in DSE the problem is harder. Cassandra and Hadoop are not isolated. They communicate with each other. Cassandra must be able to load and reference Hadoop classes, and Hadoop M/R jobs must be able to use Cassandra classes. If we loaded all of Cassandra with one child classloader and all of Hadoop with another child classloader, Cassandra would not find Hadoop classes and the other way round. Moreover, some objects are passed between Cassandra and Hadoop, and those classes must be loaded by a single classloader to avoid ClassCastExceptions.
In DSE 3.0 we introduced custom classloaders employing a non-standard classloading delegation scheme. The system (root) classloader has the right to delegate the classloading request down the tree to one of the child classloaders. E.g. the classloader for Cassandra can indirectly use the Hadoop classloader and Hadoop classloader can use Cassandra classloader. The system classloader has a list of rules determining which classloader should be invoked depending on the class name pattern and the classloader that first received the request. The most important rules are:
- load all the java and logging classes with the system classloader;
- load all "org.apache.cassandra.**" classes with the Cassandra classloader;
- load all "org.apache.hadoop.**" and "org.apache.pig.**" classes with the Hadoop classloader;
- load all the other classes with the same classloader that initially received the request.
Those rules ensure that the Hadoop classloader loads not only the Hadoop classes, but also all of its non-Cassandra dependencies. Therefore Cassandra and Hadoop can use each a different versions of ANTLR runtime without any conflict. Or you can use whatever jar you wish with your M/R jobs and Cassandra dependencies will never get into the way nor your jars would get into the way of Cassandra.
Java 6 ClassLoader Locking
Some time after releasing DSE with the new classloading subsystem enabled for Hadoop clients, including Pig, Hive, Mahout and Sqoop, we received a few notifications about rare deadlocks happening inside of the system classloader or one of the child classloaders. Fortunately those problems have been solved now and the recent versions of DSE 3.0, 3.1 and 3.2 do not suffer from them.
In Java 6, the ClassLoader#loadClass method is synchronized. To our surprise, overriding this method with a non-synchronized one, does not remove the synchronization from it. JVM will lock the whole ClassLoader object right before invoking loadClass. Not only JVM locks it when not directly asked for, but, when the method is not explicitly synchronized in the code, JVM does not report it in the stacktrace:
"pool-1-thread-2": at java.lang.ClassLoader.loadClass(ClassLoader.java:406) - waiting to lock (a com.datastax.bdp.loader.DseClientClassLoader) at com.datastax.bdp.loader.SystemClassLoader.loadClassDirectly(SystemClassLoader.java:139) at com.datastax.bdp.loader.SystemClassLoader.tryLoadClass(SystemClassLoader.java:117) at com.datastax.bdp.loader.SystemClassLoader.loadClass(SystemClassLoader.java:84) at com.datastax.bdp.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:48) at com.datastax.bdp.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:34) at org.apache.derby.iapi.services.loader.ClassInspector.accessible(Unknown Source) at org.apache.derby.impl.sql.compile.QueryTreeNode.verifyClassExist(Unknown Source) ... at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source) - locked (a org.apache.derby.impl.jdbc.EmbedConnection40) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source) ...
Because of this hidden loadClass locking, when the classloader for Cassandra receives the request and delegates it to the Hadoop classloader and at the same time Hadoop receives a request and delegates it to the Cassandra classloader, a deadlock can occur. We solved this problem by running delegated requests on a separate thread pool. Instead of calling the delegated classloader's loadClass directly, the system classloader schedules the classloading request for asynchronous execution and invokes wait() on the classloader that initially received the request. Calling wait() releases the classloader lock. Therefore, when loading a single class, only one classloader is kept locked at a time and deadlock is not possible. Once the target classloader finishes loading, it passes the result class object to the waiting classloader and notifies it.
Summary and Limitations
The new classloading mechanism introduced in DSE 3.0 for Hadoop clients allowed to hide DSE and Cassandra dependencies from your M/R jobs. However you still have to be careful with Hadoop dependencies. Keep in mind your jobs share the classpath with Hadoop and by default they are loaded before Hadoop. It is possible to break Hadoop by supplying a conflicting dependency that would shadow a correct Hadoop class. If you want your classes to be loaded after the Hadoop classes, set the mapreduce.user.classpath.first Hadoop property to false.
DataStax has many ways for you to advance in your career and knowledge.
You can take free classes, get certified, or read one of our many white papers.
register for classes
DBA's Guide to NoSQL