Apache Cassandra 1.1 Documentation

Cassandra Bulk Loader

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

The sstableloader tool provides the ability to bulk load external data into a cluster, load existing SSTables into another cluster with a different number nodes or replication strategy, and restore snapshots.

About sstableloader

The sstableloader tool streams a set of SSTable data files to a live cluster. It does not simply copy the set of SSTables to every node, but transfers the relevant part of the data to each node, conforming to the replication strategy of the cluster. The column family into which the data is loaded does not need to be empty.

Because sstableloader uses Cassandra gossip, make sure that the cassandra.yaml configuration file is in the classpath and set to communicate with the cluster. At least one node of the cluster must be configured as seed. If necessary, properly configure the following properties: listen_address, storage_port, rpc_address, and rpc_port.

If you use sstableloader to load external data, you must first generate SSTables. If you use DataStax Enterprise, you can use Sqoop to migrate your data or if you use Cassandra, follow the procedure described in Using the Cassandra Bulk Loader blog. Before loading the data, you must define the schema of the column families with CLI, Thrift, or CQL.

To get the best throughput from SSTable loading, you can use multiple instances of sstableloader to stream across multiple machines. No hard limit exists on the number of SSTables that sstablerloader can run at the same time, so you can add additional loaders until you see no further improvement.

If you use sstableloader on the same machine as the Cassandra node, you can't use the same network interface as the Cassandra node. However, you can use the JMX > StorageService > bulkload() call from that node. This method takes the absolute path to the directory where the SSTables are located, and loads them just as sstableloader does. However, because the node is both source and destination for the streaming, it increases the load on that node. This means that you should load data from machines that are not Cassandra nodes when loading into a live cluster.

Using sstableloader

In binary installations, sstableloader is located in the <install_location>/bin directory.

The sstableloader bulk loads the SSTables found in the directory <dir_path> to the configured cluster. The parent directory of <dir_path> is used as the keyspace name. For example to load an SSTable named Standard1-he-1-Data.db into keyspace Keyspace1, the files Keyspace1-Standard1-he-1-Data.db and Keyspace1-Standard1-he-1-Index.db must be in a directory called Keyspace1/Standard1/.

bash sstableloader [options] <dir_path>


$ ls -1 Keyspace1/Standard1/
$ <path_to_install>/bin/sstableloader -d localhost <keyspace>/<dir_name>/

where <dir_name> is the directory containing the SSTables. Only the -Data and -Index components are required; -Statistics and -Filter are ignored.

The sstableloader has the following options:

Option Description
-d,--nodes <initial hosts> Connect to comma separated list of hosts for initial ring information.
--debug Display stack traces.
-h,--help Display help.
-i,--ignore <NODES> Do not stream to this comma separated list of nodes.
--no-progress Do not display progress.
-p,--port <rpc port> RPC port (default 9160).
-t,--throttle <throttle> Throttle speed in Mbits (default unlimited).
-v,--verbose Verbose output.
Kerberos authentication options (Available only in DataStax Enterprise 3.0.1)
-pr,--principal Kerberos principal. (Optional, not required if you have run kinit.)
-k,--keytab Keytab location. (Optional, not required if you have run kinit.)
SSL encryption options (Available only in DataStax Enterprise 3.0.1)
--ssl-keystore SSL keystore location.
--ssl-keystore-password SSL keystore password.
--ssl-keystore-type SSL keystore type.
--ssl-truststore SSL truststore location.
--ssl-truststore-password SSL truststore password.
--ssl-truststore-type SSL truststore type.