Introducing DataStax Bulk Loader
date: May 8, 2018
With DataStax Enterprise (DSE) 6 we’ve introduced one of the most requested features for DSE, a bulk loading (and unloading) utility called the DataStax Bulk Loader (abbreviated as DSBulk). Loading data into a database shouldn’t be something someone spends a lot of time thinking about or doing. There should just be a tool that is simple, fast, robust, and flexible. And that’s what DSBulk is! It’s as much as 4x faster than loading data via COPY commands and community tools.
What’s the Problem?
There are a number of common use cases that need to either put data into the database in bulk, or extract data out of the database in bulk. Here are just a few:
- A new user just installed DSE and wants to load a large data set they’re familiar with so they can explore DSE.
- In order to migrate an application to use DSE, the existing data needs to be moved from the legacy system into DSE.
- In order to make an offline copy of the database, the data in DSE needs to be exported to local files.
- A production workflow has data arrive from disparate locations into a file system on one machine, which then needs to be loaded into DSE.
What all of these cases have in common is that they are trying to take a file, or set of files, on their local machine and efficiently load them into DSE. Or, they are trying to do the reverse, efficiently unload data in DSE onto the local machine. The goal is simple: The tool should be:
- Easy to use.
- Able to support common incoming data formats.
- Able to support multiple field formats, such as dates and times.
- Able to support advanced security configurations.
- Able to gracefully handle badly parsed data and database insertion errors.
- Able to report on the status and completion of loading tasks, including summary statistics (such as the load rate).
- Efficient and fast.
For all the things it can do for data ingestion, it should be able support the analogous features for data extraction.
DataStax Bulk Loader
The first thing to point out, right up front, is that while bulk loading is by far the most dominant use case, DataStax Bulk Loader was also designed to equally cover the bulk unloading use case. A core tenet of the design is that for every loading option there is an analogous unloading option, and vice versa, resulting in a very easy-to-understand tool that can be applied to a variety of use cases. DSBulk is a command-line tool for both Linux and Windows environments. It’s included in DataStax Enterprise and DataStax Basic, and also available as a standalone download. It supports all DSE Advanced Security features, including Kerberos and password authentication, as well as SSL security. It loads data up to 4x faster than cqlsh’s COPY command, and faster than existing community tools.
Figure 1: Load Performance of DSBulk vs Cqlsh DSBulk is designed to load files as they’re presented into existing database tables. That is, DSBulk uses existing tables and will not create new tables. To create tables in DSE, other tools should be used, such as the CQL shell tool, cqlsh. Additionally, DSBulk does not perform any data transformations, except for interpreting input data, such as converting a string representation of a date into a CQL data type. Operations such as converting strings to lowercase, arithmetic on input columns, or filtering out rows based on some criteria, are not supported. To accomplish these tasks, users should leverage other tools, such as DSE Analytics or scripting languages like Bash, Perl, or Python. DSBulk can accept data on STDIN, so users can pipe the output from the transformation step directly into DSBulk for loading. Analogously, DSBulk can export data to STDOUT, which can be piped into a transformation step directly, as well. DSBulk supports input data in JSON or delimited format from a single file or a directory of files (you can even provide a pattern for the filename, such as “/path/to/load/input*.csv”). In both cases, there are a number of input parsing options that can be set, enabling input in a wide variety of formats. These include setting the delimiter (e.g., comma), the date format string, the decimal delimiter (e.g., to support European formats), the string equivalent of NULL, etc. Users can also specify the mapping from input fields to database columns in a flexible manner. But DSBulk goes beyond simply loading the data in files. The user can set a time-to-live on the data being inserted, or add the current timestamp to the data as a column of input. Advanced users can provide specific CQL to be used for the insert, allowing for complex loading scenarios. This is even more useful for taking data out, as the user can specify a subset of the data, rows or columns, to be unloaded. As previously mentioned, DSBulk supports loading data to DSE and DataStax Basic, and supports unloading data from DSE, DataStax Basic, and CQL-driver-compatible sources such as open-source Apache CassandraTM. One note with respect to the other options that exist to load data into DSE: another approach to loading data would be to leverage DSE Analytics’ integrated Apache SparkTM. The benefit here is that multiple processes will be loading data in parallel to DSE. The only requirement in this situation is that the data needs to be in a distributed file system, such as DSEFS or Amazon S3. If the data does reside in those locations, DSE Analytics may be a more efficient approach. But for data that exists in one place, such as on a laptop or a server in the data center, DSBulk is the right choice.
Simple Task, Robust Solution
Data loading is not something that a user should have to spend a lot of time thinking about, and with DSBulk you no longer have to. It’s an efficient, flexible, easy-to-use tool that excels at the simple task of loading and unloading data. DSBulk is available as a standalone download or as part of DSE and DataStax Basic. You can download it from the DataStax download page and read the documentation for more information.