Sylvain Lebresne

Bulk loading data in Cassandra has historically been difficult. Although Cassandra has had the BinaryMemtable interface from the very beginning, BinaryMemtable is hard to use and provides a relatively minor throughput improvement over normal client writes.

Cassandra 0.8.1 introduces a new tool to solve this problem:&nbsp;<tt>sstableloader</tt>

<h2>Using&nbsp;<tt>sstableloader</tt></h2>

For the most up-to-date information, see the&nbsp;<a href="https://docs.datastax.com/en/archived/cassandra/2.1/">DataStax Community Documentation</a>.

<h3>Overview</h3>

<tt>sstableloader</tt>&nbsp;is a tool that, given a set of&nbsp;<a href="http://wiki.apache.org/cassandra/MemtableSSTable">sstable</a>&nbsp;data files, streams them to a live cluster. It does&nbsp;not&nbsp;simply copy the set of sstables to every node, but only transfers the relevant part of the data to each, conforming to the replication strategy of the cluster.

There are two primary use cases for this new tool:

<ul>
	<li>Bulk loading external data into a cluster: for this you will have to first generate sstables for the data to load, as we will see later in this post.</li>
	<li>Loading pre-existing sstables, typically snapshots, into another cluster with different node counts or replication strategy.</li>
</ul>

&nbsp;

<h3>Example</h3>

Let us start with the second use case to demonstrate how&nbsp;<tt>sstableloader</tt>&nbsp;is used. For that, consider the following scenario: you have a one node test cluster populated with data that you want to transfer into another, multi-node cluster.

A brute-force solution would be to copy all the sstables of the source node to every node in the multi-node destination cluster, restart each node, and then run&nbsp;<tt>nodetool cleanup</tt>&nbsp;on them. This works, but is obviously inefficient, especially if the destination cluster has a lot of nodes.

With&nbsp;<tt>sstableloader</tt>, you first need the sstables to be in a directory whose name is the name of the keyspace of the sstables. This is how they will be stored in either the main data directory, or a snapshot. Then, assuming&nbsp;<tt>sstableloader</tt>&nbsp;is configured to talk to your multi-node cluster: 
 
<code>$ ls TestKeyspace/ 
TestCF-g-1-Data.db TestCF-g-2-Data.db TestCF-g-3-Data.db 
TestCF-g-1-Index.db TestCF-g-2-Index.db TestCF-g-3-Index.db 
$ sstableloader TestKeyspace 
Starting client (and waiting 30 seconds for gossip) ... 
Streaming revelant part of testKeyspace/TestCF-g-1-Data.db TestKeyspace/TestCF-g-2-Data.db TestKeyspace/TestCF-g-3-Data.db to [/127.0.0.1, /127.0.0.2, /127.0.0.3]</code>

<code>progress: [/127.0.0.1 3/3 (100)] [/127.0.0.2 3/3 (100)] [/127.0.0.3 3/3 (100)] [total: 100 - 24MB/s (avg: 18MB/s)] 
Waiting for targets to rebuild indexes ...</code>

<h3>Configuration</h3>

To learn the topology of the cluster, the number of nodes, which ranges of keys each node is responsible for, the schema, etc.,&nbsp;<tt>sstableloader</tt>&nbsp;uses the Cassandra gossip subsystem. It thus requires a directory containing a&nbsp;<tt>cassandra.yaml</tt>&nbsp;configuration file in the classpath. (If you use sstableloader from the Cassandra source tree, the&nbsp;<tt>cassandra.yaml</tt>&nbsp;file in&nbsp;<tt>conf</tt>&nbsp;will be used.)

In this config file, the&nbsp;<tt>listen_address</tt>,&nbsp;<tt>storage_port</tt>,&nbsp;<tt>rpc_address</tt>&nbsp;and&nbsp;<tt>rpc_port</tt>&nbsp;should be set correctly to communicate with the cluster, and at least one node of the cluster you want to load data in should be configured as&nbsp;<tt>seed</tt>. The rest is ignored for the purposes of&nbsp;<tt>sstableloader</tt>.

Because the&nbsp;<tt>sstableloader</tt>&nbsp;uses gossip to communicate with other nodes, if launched on the same machine that a given Cassandra node, it will need to use a different network interface than the Cassandra node. But if you want to load data from a Cassandra node, there is a simpler solution: you can use the&nbsp;<tt>JMX-&gt;StorageService-&gt;bulkload()</tt>&nbsp;call from said node.

This method simply takes the absolute path to the directory where the sstables to load are, and it will load those as&nbsp;<tt>sstableloader</tt>&nbsp;would. However, since the node running&nbsp;<tt>sstableloader</tt>&nbsp;will be both source and destination for the streaming, this will put more load on that particular node, so we advise loading data from machines that are not Cassandra nodes when loading into a live cluster.

Note that the schema for the column families to be loaded should be defined beforehand, using you prefered method: CLI, thrift or CQL.

<h3>Other considerations</h3>

<ul>
	<li>There is no requirement that the column family into which which data is loaded be empty. More generally, it is perfectly reasonable to load data into a live, active cluster.</li>
	<li>To get the best throughput out of the sstable loading, you will want to parallelize the creation of sstables to stream across multiple machines. There is no hard limit on the number of sstable loader that can run at the same time, so you can add additional loaders until you see no further improvement.</li>
	<li>At the time of this writing,&nbsp;<tt>sstableloader</tt>&nbsp;does not handle failure very well. In particular, if a node it is sending to dies, it will get stuck (a progress indicator is displayed so you will be able to tell when that happens and check if one of your node is indeed dead). Until this is fixed, if that happens, you will have to stop the loader and relaunch it. If you know that the transfer has successfully ended on some of the other nodes, you can use the&nbsp;<tt>-i</tt>&nbsp;flag to skip those nodes during the retry.</li>
</ul>

<h2>Bulk-loading external data: a complete example</h2>

<h3>The setup</h3>

If you want to bulk-load external data that is not in sstable form using&nbsp;<tt>sstableloader</tt>, you will have to first generate sstables. To do so, the simplest solution is the new Java class&nbsp;<tt>SSTableSimpleUnsortedWriter</tt>&nbsp;introduced in Cassandra 0.8.2. To demonstrate how it is used, let us consider the example of bulk-loading "user profile" data from a csv file. More precisely, we consider a csv file of the following form: 
 
<code># uuid, firstname, lastname, password, age, email</code>

<code>5bd8c586-ae44-11e0-97b8-0026b0ea8cd0, Alice, Smith, asmi1975, 32, alice.smith@mail.com 
4bd8cb58-ae44-12e0-a2b8-0026b0ed9cd1, Bob, Miller, af3!df8, 28, bob.miller@mail.com 
1ce7cb58-ae44-12e0-a2b8-0026b0ad21ab, Carol, White, cw1845?, 49, c.white@mail.com 
...</code>

From this csv, we want to populate two column families that can have been created (using the CLI) with: 
 
<code>create keyspace Demo; 
use Demo; 
create column family Users 
&nbsp;&nbsp;&nbsp;&nbsp;with key_validation_class=LexicalUUIDType 
&nbsp;&nbsp;&nbsp;&nbsp;and comparator=AsciiType 
&nbsp;&nbsp;&nbsp;&nbsp;and column_metadata=[ 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'firstname', validation_class: AsciiType } 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'lastname', validation_class: AsciiType } 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'password', validation_class: AsciiType } 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'age', validation_class: LongType, index_type: KEYS } 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'email', validation_class: AsciiType }];</code>

<code>create column family Logins 
&nbsp;&nbsp;&nbsp;&nbsp;with key_validation_class=AsciiType 
&nbsp;&nbsp;&nbsp;&nbsp;and comparator=AsciiType 
&nbsp;&nbsp;&nbsp;&nbsp;and column_metadata=[ 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'password', validation_class: AsciiType }, 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ column_name: 'uuid', validation_class: LexicalUUIDType }];</code> 
 
In other words, the column family&nbsp;<tt>Users</tt>&nbsp;will contain user profiles: the key is a uuid identifying the user, the columns are the user properties. We also added a secondary index on the 'age' property, mainly to show that this is supported by the bulk-loading process.

The second column family,&nbsp;<tt>Logins</tt>, associates the user&nbsp;<tt>email</tt>&nbsp;(note that this example assumes that user emails are unique) to its&nbsp;<tt>password</tt>&nbsp;and&nbsp;<tt>identifier</tt>. It is this column family that would typically be queried when a user login to check its credentials and allow to find its identifier to retrieve the profile data (a possibly simpler/better design would be to use a secondary index on the&nbsp;<tt>email</tt>&nbsp;column on&nbsp;<tt>Users</tt>. We don't do this here to show how to load multiple column families together).

<h3>Creating sstables</h3>

A complete Java example of how to create the relevant sstables from the csv file using the&nbsp;<tt>SSTableSimpleUnsortedWriter</tt>&nbsp;class can be found&nbsp;<a href="/sites/default/files/content/blog/past_blogs/past_dev_blogs/DataImportExample.java">here</a>.

To compile this file the Cassandra jar (&gt;= 0.8.2) needs to be in the classpath (<tt>javac -cp &lt;path_to&gt;/apache-cassandra-0.8.2.jar DataImportExample.java</tt>). To run it, the Cassandra jar needs to be present as well as the jar of the librairies used by Cassandra (those in the&nbsp;<tt>lib/</tt>&nbsp;directory of Cassandra source tree). Valid&nbsp;<tt>cassandra.yaml</tt>&nbsp;and&nbsp;<tt>log4j</tt>&nbsp;configuration files should also be accessible; typically, this means the&nbsp;<tt>conf/</tt>&nbsp;directory of the Cassandra source tree should be in the classpath--see&nbsp;<a href="/sites/default/files/content/blog/past_blogs/past_dev_blogs/DataImport.txt">here</a>&nbsp;for a typical launch script that sets all those. As of 0.8.2, you will need to set the&nbsp;<tt>data_file_directories</tt>&nbsp;and&nbsp;<tt>commitlog_directory</tt>&nbsp;directives in said&nbsp;<tt>cassandra.yaml</tt>&nbsp;to accessible directories, but&nbsp;not&nbsp;ones of an existing Cassandra node. (This will be&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-2953">fixed in 0.8.3</a>, but in the meantime using&nbsp;<tt>/tmp</tt>&nbsp;for both is a good idea.) The only useful property you need to set up for&nbsp;<tt>SSTableSimpleUnsortedWriter</tt>&nbsp;is the partitioner you want to use.

Let us run through the important parts of this example:

<ul>
	<li>Creation of the sstable writers: 
	 
	<code>SSTableSimpleUnsortedWriter usersWriter = new SSTableSimpleUnsortedWriter( 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;directory, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;keyspace, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"Users", 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;AsciiType.instance, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;null, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;64); 
	SSTableSimpleUnsortedWriter loginWriter = new SSTableSimpleUnsortedWriter( 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;directory, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;keyspace, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"Logins", 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;AsciiType.instance, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;null, 
	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;64);</code> 
	 
	The&nbsp;<tt>directory</tt>&nbsp;and&nbsp;<tt>keyspace</tt>&nbsp;parameters are the directory where to put the sstables (a Java&nbsp;<tt>File</tt>) and the keyspace of the column families (a&nbsp;<tt>String</tt>), respectively. Next, there are the column family name and the comparator and sub-columns comparator--here, we don't use super columns so the sub-columns comparator is&nbsp;<tt>null</tt>.

	&nbsp;

	The last parameter is a "buffer" size: sstables need to have rows sorted according to the partitioner. For&nbsp;<tt>RandomPartitioner</tt>, this means that row should be ordered by the MD5 of their key. Since there is no chance data will come in that order, SSTableSimpleUnsortedWriter buffers whatever input it gets in memory and "flush" everything in one sstable once the buffer is full. The buffer size is in MB (here 64MB) and actually corresponds to serialized space. That is, the resulting sstables will be approximately 64MB size. Note that the "live" size on the Java heap can be larger, so setting this parameter too large is not advisable, and in any case there is little performance advantage to use a very high value.
	</li>
	<li>Populate with each csv entry: 
	 
	<code>for (...each csv entry...) 
	{ 
	&nbsp;&nbsp;&nbsp;&nbsp;ByteBuffer uuid = ByteBuffer.wrap(decompose(entry.key)); 
	&nbsp;&nbsp;&nbsp;&nbsp;usersWriter.newRow(uuid); 
	&nbsp;&nbsp;&nbsp;&nbsp;usersWriter.addColumn(bytes("firstname"), bytes(entry.firstname), timestamp); 
	&nbsp;&nbsp;&nbsp;&nbsp;usersWriter.addColumn(bytes("lastname"), bytes(entry.lastname), timestamp); 
	&nbsp;&nbsp;&nbsp;&nbsp;usersWriter.addColumn(bytes("passsword"), bytes(entry.password), timestamp); 
	&nbsp;&nbsp;&nbsp;&nbsp;usersWriter.addColumn(bytes("age"), bytes(entry.age), timestamp); 
	&nbsp;&nbsp;&nbsp;&nbsp;usersWriter.addColumn(bytes("email"), bytes(entry.email), timestamp); 
	 
	 
	&nbsp;&nbsp;&nbsp;&nbsp;loginWriter.newRow(bytes(entry.email)); 
	&nbsp;&nbsp;&nbsp;&nbsp;loginWriter.addColumn(bytes("password"), bytes(entry.password), timestamp); 
	&nbsp;&nbsp;&nbsp;&nbsp;loginWriter.addColumn(bytes("uuid"), uuid, timestamp);&lt; 
	} 
	usersWriter.close(); 
	loginWriter.close();</code> 
	 
	In this excerpt,&nbsp;<tt>entry</tt>&nbsp;is a parsed csv entry. Each call to&nbsp;<tt>newRow()</tt>&nbsp;starts a new row that is populated with the column added by&nbsp;<tt>addColumn()</tt>. Though not demonstrated here, it is equally simple to add super, expiring or counter columns; the exact API is described&nbsp;<a href="https://svn.apache.org/viewvc/cassandra/tags/cassandra-0.8.2/src/java/org/apache/cassandra/io/sstable/AbstractSSTableSimpleWriter.java?view=co">here</a>.
	&nbsp;

	Note that the order of additions of rows and of columns inside rows does not matter. It is also possible to "restart" a row multiple times or to add the same column multiple times, in which case the usual conflict resolution rules between columns apply.

	Finally, each writer should be closed, otherwise the resulting sstables will not be complete.
	</li>
</ul>

Once compiled and run with a csv file as argument, this example program will create sstables in the&nbsp;<tt>Demo</tt>&nbsp;directory. Those sstables can then be loaded into a live cluster using&nbsp;<tt>sstableloader</tt>&nbsp;as described in the previous section:&nbsp;<tt>sstableloader Demo/</tt>.

<h3>Other considerations</h3>

<ul>
	<li><tt>SSTableSimpleUnsortedWriter</tt>&nbsp;never flushes to disk between two calls of&nbsp;<tt>newRow()</tt>. As a consequence, all data inserted between two of those calls must fit in memory. If you have a huge row for which this does not hold, you can call&nbsp;<tt>newRow()</tt>&nbsp;regularly, using the same row key, to avoid buffering everything.</li>
	<li>The methods of the simple writer expect ByteBuffers for the row key, column name and column value. Converting data to bytes is your responsibility; this is the&nbsp;raison d'être&nbsp;of the&nbsp;<tt>bytes()</tt>&nbsp;method in the example above.</li>
</ul>

Using the Cassandra Bulk Loader

Sylvain Lebresne

Discover more

Share

Share

Using sstableloader

Overview

Example

Configuration

Other considerations

Bulk-loading external data: a complete example

The setup

Creating sstables

Other considerations

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI