Yuki Morishita

We introduced&nbsp;<code>sstableloader</code>&nbsp;back in 0.8.1, in order to do bulk loading data into Cassandra. 
When it was first introduced, we wrote&nbsp;<a href="https://www.datastax.com/dev/blog/bulk-loading">a blog post</a>&nbsp;about its usage along with generating SSTable to bulk load.

Now, Cassandra version 2.1.0 was released, and bulk loading has been evolved since the old blog post. 
Let's see how the change makes our life easier than before.

<h2 id="what-s-changed-">What's changed?</h2>

Specific changes are:

<ul>
	<li><code>sstableloader</code>&nbsp;no longer participates in gossip membership to get schema and ring information. Instead, it just contacts one of the nodes in the cluster and ask for it. This allows you to bulk load from the same machine where cassandra is running, since it no longer listens at the same port with cassandra.</li>
	<li>Internally, streaming protocol is&nbsp;<a href="https://www.datastax.com/dev/blog/streaming-in-cassandra-2-0">re-designed</a>. You can stream data more efficiently than before.</li>
	<li>New&nbsp;<code>CQLSSTableWriter</code>&nbsp;is introduced(<a href="https://issues.apache.org/jira/browse/CASSANDRA-5894">CASSANDRA-5894</a>). You can now create SSTables using familiar CQL.</li>
</ul>

In the old post, we showed two scenarios where&nbsp;<code>sstableloader</code>&nbsp;is used. Let's see how the changes work in those scenes. 
I use Apache Cassandra ver 2.1.0 through out this example, from cluster to running&nbsp;<code>sstableloader</code>.

<h2 id="example-1-loading-existing-sstables">Example 1 - Loading existing SSTables</h2>

Usage of&nbsp;<code>sstableloader</code>&nbsp;has not changed much, but because it has to contact the node to get schema for loading SSTables, you have to specify the address(es) of the node by using&nbsp;<code>-d</code>&nbsp;option.

So for example, you want to bulk load to

<pre>

&nbsp;</pre>

$&nbsp;bin/sstableloader&nbsp;-d&nbsp;127.0.0.1&nbsp;~/Keyspace1/Standard1-cb5e6f30458811e49349511b628b066f

Established&nbsp;connection&nbsp;to&nbsp;initial&nbsp;hosts

Opening&nbsp;sstables&nbsp;and&nbsp;calculating&nbsp;sections&nbsp;to&nbsp;stream

Streaming&nbsp;relevant&nbsp;part&nbsp;of&nbsp;/data/Keyspace1/Standard1-cb5e6f30458811e49349511b628b066f/Keyspace1-Standard1-ka-6-Data.db&nbsp;/data/Keyspace1/Standard1-cb5e6f30458811e49349511b628b066f/Keyspace1-Standard1-ka-5-Data.db&nbsp;to&nbsp;[/127.0.0.1,&nbsp;/127.0.0.2,&nbsp;/127.0.0.3]

progress:&nbsp;[/127.0.0.1]0:2/2&nbsp;100%&nbsp;[/127.0.0.2]0:2/2&nbsp;100%&nbsp;[/127.0.0.3]0:2/2&nbsp;100%&nbsp;total:&nbsp;100%&nbsp;0&nbsp;&nbsp;MB/s(avg:&nbsp;5&nbsp;MB/s)

Summary&nbsp;statistics:

&nbsp;&nbsp;&nbsp;Connections&nbsp;per&nbsp;host:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;1

&nbsp;&nbsp;&nbsp;Total&nbsp;files&nbsp;transferred:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;6

&nbsp;&nbsp;&nbsp;Total&nbsp;bytes&nbsp;transferred:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;98802914

&nbsp;&nbsp;&nbsp;Total&nbsp;duration&nbsp;(ms):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;9455

&nbsp;&nbsp;&nbsp;Average&nbsp;transfer&nbsp;rate&nbsp;(MB/s):&nbsp;:&nbsp;5

&nbsp;&nbsp;&nbsp;Peak&nbsp;transfer&nbsp;rate&nbsp;(MB/s):&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;11

As you can see, some stats are printed out after the bulk load.

<h2 id="example-2-loading-external-data">Example 2 - Loading external data</h2>

Previously, we had example that creates SSTables from CSV using&nbsp;<code>UnsortedSimpleSSTableWriter</code>&nbsp;and uses&nbsp;<code>sstableloader</code>&nbsp;to load it to Cassandra cluster in the old post. 
Schema there is created with thrift, and it has a simple, flat table structure.

For this updated post, let's do more complex scenario with new&nbsp;<code>CQLSSTableWriter</code>. 
We will create real data from&nbsp;<a href="http://finance.yahoo.com/">Yahoo! Finance</a>&nbsp;to load historical prices of stocks in time-series manner.

<h3 id="schema-definition">Schema definition</h3>

If we take a look at&nbsp;<a href="http://real-chart.finance.yahoo.com/table.csv?s=YHOO">CSV file for Yahoo!(YHOO)</a>, it has 7 fields in it.

<pre>

&nbsp;</pre>

Date,Open,High,Low,Close,Volume,Adj&nbsp;Close

2014-09-25,39.56,39.80,38.82,38.95,35859400,38.95

...

Let's use ticker symbol as our partition key, and 'Date' field as clustering key. 
So schema looks like:

<pre>

&nbsp;</pre>

CREATE&nbsp;TABLE&nbsp;historical_prices&nbsp;(

&nbsp;&nbsp;&nbsp;&nbsp;ticker&nbsp;ascii,

&nbsp;&nbsp;&nbsp;&nbsp;date&nbsp;timestamp,

&nbsp;&nbsp;&nbsp;&nbsp;open&nbsp;decimal,

&nbsp;&nbsp;&nbsp;&nbsp;high&nbsp;decimal,

&nbsp;&nbsp;&nbsp;&nbsp;low&nbsp;decimal,

&nbsp;&nbsp;&nbsp;&nbsp;close&nbsp;decimal,

&nbsp;&nbsp;&nbsp;&nbsp;volume&nbsp;bigint,

&nbsp;&nbsp;&nbsp;&nbsp;adj_close&nbsp;decimal,

&nbsp;&nbsp;&nbsp;&nbsp;PRIMARY&nbsp;KEY&nbsp;(ticker,&nbsp;date)

)&nbsp;WITH&nbsp;CLUSTERING&nbsp;ORDER&nbsp;BY&nbsp;(date&nbsp;DESC);

We use&nbsp;<code>CLUSTERING ORDER BY</code>&nbsp;to query recent data easily.

<h3 id="generating-sstable-using-cqlsstablewriter">Generating SSTable using CQLSSTableWriter</h3>

How do you bulk load data to such a schema? If you choose to use&nbsp;<code>UnsortedSimpleSSTableWriter</code>&nbsp;as we did in the old post, you have to manually construct each cell of complex type to fit to your CQL3 schema. This requires you to have deep knowledge of how CQL3 works internally. 
Enter&nbsp;<code>CQLSSTableWriter</code>.

All you need is DDL for table you want to bulk load, and INSERT statement to insert data to it.

<pre>

&nbsp;</pre>

//&nbsp;Prepare&nbsp;SSTable&nbsp;writer&nbsp;

CQLSSTableWriter.Builder&nbsp;builder&nbsp;=&nbsp;CQLSSTableWriter.builder();

//&nbsp;set&nbsp;output&nbsp;directory&nbsp;

builder.inDirectory(outputDir)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//&nbsp;set&nbsp;target&nbsp;schema&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.forTable(SCHEMA)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//&nbsp;set&nbsp;CQL&nbsp;statement&nbsp;to&nbsp;put&nbsp;data&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.using(INSERT_STMT)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//&nbsp;set&nbsp;partitioner&nbsp;if&nbsp;needed&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//&nbsp;default&nbsp;is&nbsp;Murmur3Partitioner&nbsp;so&nbsp;set&nbsp;if&nbsp;you&nbsp;use&nbsp;different&nbsp;one.&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.withPartitioner(new&nbsp;Murmur3Partitioner());

CQLSSTableWriter&nbsp;writer&nbsp;=&nbsp;builder.build();

&nbsp;

//&nbsp;...snip...&nbsp;

&nbsp;

while&nbsp;((line&nbsp;=&nbsp;csvReader.read())&nbsp;!=&nbsp;null)

{

&nbsp;&nbsp;&nbsp;&nbsp;//&nbsp;We&nbsp;use&nbsp;Java&nbsp;types&nbsp;here&nbsp;based&nbsp;on&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;//&nbsp;https://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/DataType.Name.html#asJavaClass%28%29&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;writer.addRow(ticker,

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DATE_FORMAT.parse(line.get(0)),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;new&nbsp;BigDecimal(line.get(1)),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;new&nbsp;BigDecimal(line.get(2)),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;new&nbsp;BigDecimal(line.get(3)),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;new&nbsp;BigDecimal(line.get(4)),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Long.parseLong(line.get(5)),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;new&nbsp;BigDecimal(line.get(6)));

}

writer.close();

You can see&nbsp;<a href="https://github.com/yukim/cassandra-bulkload-example/">complete example on my github</a>.

After you generating SSTable, you can just use&nbsp;<code>sstableloader</code>&nbsp;to target cluster as described before.

There are still some limitations in&nbsp;<code>CQLSSTableWriter</code>, like you cannot use it in parallel, or user defined types are not supported yet. 
But we keep improving so stay tuned to&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA">Apache JIRA</a>.

<h2 id="wrap-up">Wrap up</h2>

Generating SSTable and bulk loading have been improved over the past release. There are many new features available to make your life easier. 
Start experimenting by yourself today!

Using the Cassandra Bulk Loader, Updated

Yuki Morishita

Discover more

Share

Share

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI