Tuning DSE Hadoop MapReduce
date: October 7, 2013
Okay, so you've worked through the DSE Analytics demos and are ready to run your own Map/Reduce jobs with Apache Hadoop™. But how do you make sure your configuration is tuned for the best performance? Which of the myriad of Hadoop configuration settings should you consider changing and to what values should they be set?
In this post we'll take a look at some of the configuration settings that can help you get better performance with your Map/Reduce jobs and see how DSE even helps do some of that work for you!
Note: This blog post was written targeting DSE 3.1. Please refer to the DataStax documentation for your specific version of DSE if different.
DSE Analytics Introduction
One of the many cool features of DataStax Enterprise is the analytics capabilities provided with the Apache Hadoop™ integration. You get the analytic functionality of the Apache Hadoop ecosystem (MapReduce, Hive, Pig, Mahout) together with the Cassandra File System (CFS) and all the benefits of Apache Cassandra's scalability, reliability and simplified operational management.
Check your Cassandra configuration
Before turning our attention to Hadoop parameters, it's important to make sure we're starting with a well-configured Cassandra cluster. Let's take a look at some basic settings affecting your operating system, JVM memory and disk.
First, check the O/S configuration on each of your cluster nodes against the recommended production settings listed in the DataStax documentation. These settings will ensure that your servers have correct resource limit settings, JVM version, JNA, swap disabled, etc and will be the foundation for a well-tuned system.
Next, review your Apache Cassandra™ Java Heap settings. While Map/Reduce tasks run outside the Cassandra JVM, each analytics-enabled node will include additional JobTracker and/or TaskTracker threads and CFS operations can increase memtable usage for column-families in the CFS keyspace. As always, monitor your Cassandra JVM memory usage and adjust accordingly if low memory conditions arise. It's also important to take into consideration that your Map/Reduce tasks will be running as separate JVM processes and you'll need to keep some memory in reserve for them as well.
Lastly, check that the Commit Log is located on a separate disk from the Data directories. This is a standard recommendation made all the more pertinent because large hadoop jobs will only increase the load on the I/O subsystem.
Ensuring that your Apache Cassandra™ cluster is configured properly will provide a good base environment for running Hadoop jobs and avoid time spent later to diagnose and resolve issued caused by improper
Tuning Map Reduce
Comprehensive Hadoop Map/Reduce configuration and tuning is a complex subject and there are too many configurable parameters in Apache Hadoop™ to make a complete list here. Instead, we'll cover a few parameters that can help improve the performance of your DSE Hadoop Map/Reduce jobs.
Settings in core-site.xml
The core-site.xml file contains site-specific configuration Hadoop configuration settings.
dfs.block.size - per file block size
This is the default block size, in bytes, for new files created in the distributed file system. The larger the size of data being processed, the larger the data block size should be. The number of map tasks depends on the input data size and the block size; a larger block size will result in fewer map tasks. A larger block size will also reduce the number of partitions stored in the CFS 'sblocks' column family resulting in a smaller footprint for partition-dependent structures, e.g. key cache, bloom filter, index.
Default value: 64Mb (67108864)
Recommended value: 128Mb (134217728) or 256Mb (268435456)
Settings in mapred-site.xml
The mapred-site.xml file contains site-specific configuration Map/Reduce parameter settings.
This is the local directory where map/reduce tasks store intermediate data files.
Map/reduce tasks process a lot of data and are I/O intensive, so the location for this data should be isolated as much as possible, especially from Cassandra commit log and data directories. A list of directories can be specified to further spread the I/O load.
Recommended value: Set to a separate physical device(s) from Cassandra data file and commit log directories.
This setting determines when to start running reduce tasks and is specified as a percentage of completed map tasks.
With the default value, when 5% of the map tasks have completed, then reduce tasks will start to copy and shuffle intermediate output; however the actual reduce won't start until the mappers are all completed, so a low value tends to occupy reducer slots. Using higher values decreases the overlap between mappers and reduces and allows reducers to wait less before starting.
Default value: 5% (0.05)
Recommended value: 80% (0.80) or higher
Even with the few settings listed above, determining optimal values and setting them on each analytics node takes time and effort.
DSE 3.1 and later now provides some help with that task by inspecting your system and setting auto-tuned default values for the following configuration settings:
These settings are written to dse-mapred-default.xml and can be overridden by values set in mapred-site.xml.
Let's take a look at each of these configuration settings and how auto-tuning arrives at default values.
mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
These are the maximum number of map/reduce tasks permitted to execute simultaneously per node.
If your map/reduce tasks are not CPU intensive, try setting this value equal to the number of cores on each node as long as there is sufficient memory in the system to support the total number of map and reduce tasks together with memory needed for the O/S and Cassandra.
Recommended value: #CPU cores-1
Auto-tuned value: #CPU cores-1, as long as there is enough memory to support them.
This parameter is used to pass any Java options to the map/reduce child processes, and we will use this to set the maximum Java heap size for each map/reduce task. The amount of memory use as an I/O buffer (io.sort.mb below) is closely related to this setting as that buffer is contained the map/reduce task java heap area.
The default size of 256Mb can be too small for many map/reduce tasks and can lead to java OutOfMemoryErrors. You want to increase this value to avoid OOM errors and provide additional memory for I/O operations (see io.sort.mb) but at the same time stay within the constraints of the total available system memory. If the value is too small you risk OutOfMemeoryErrors in your map/reduce tasks and/or excessive intermediate disk spills. A value that is too large wastes memory resources and limits the total number of map/reduce tasks that can be supported concurrently.
The auto-tuning default value is determined by estimating the amount of memory available for hadoop and then dividing that value by the total number of map/reduce tasks (mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum). The amount of memory available for hadoop map/reduce tasks is:
Hadoop Memory = System Memory - Cassandra Memory - 128Mb JVM overhead - 256Mb O/S overhead
Default: 256Mb (-Xmx256m)
Recommend value: 512Mb or 1024MB (-Xmx512m or -Xmx1024m) depending on the amount of available system memory.
Auto-tuned value: Max(256Mb, Hadoop Memory / (max#MapTasks + max#ReduceTasks), where Hadoop Memory is as defined above.
This sets the size of memory buffer used during sort operations. This buffer is contained within the map/reduce task's JVM heap as defined above in mapred.child.java.opts. If this buffer size is too small for the amount of input data, it can lead to intermediate spills to disk and which will later need to be read and merged. Increasing this value will reduce or eliminate the number of intermediate spills going to disk and reduce the overall I/O load on your system.
Default value: 100 Mb
Recommended value: Use 1/4 to 1/2 of the map/reduce task Java heap size setting (in mapred.child.java.opts).
Auto-tuned value: 1/2 of the map/reduce Java heap size
This value sets the number of input files that are merged at once by map/reduce tasks. The higher this value is set means the fewer the number of passes needed to merge the map spills and therefore less disk I/O.
Values up to 100 are commonly recommended for larger data sets; however, since data from all files being merged at once go into the same memory buffer sized by io.sort.mb, increasing io.sort.factor alone will decrease the chunk size (io.sort.mb / io.sort.factor) read from each file. As the chunk size becomes smaller more I/O requests are need to read the data, so try to keep the chunk size around 10Mb (this does not apply to SSD's).
Default value: 10
Recommended value: io.sort.mb / 10
Auto-tuned value: io.sort.mb / 10