DataStax Developer Blog

Ways to Move Data To/From DataStax Enterprise and Cassandra

By Robin Schumacher -  November 26, 2012 | 1 Comment

One of the most asked questions we get at DataStax is, “How can I move data from other sources to DataStax Enterprise and Cassandra, and vice versa?” I thought I’d quickly outline the top three options that you have available.

COPY command

Cassandra 1.1 and higher supplies the COPY command, which mirrors what the PostgreSQL RDBMS uses for file/export import. The utility is used in Cassandra’s CQL shell, and allows for flat file data to be loaded into Cassandra (nearly all RDBMS’s have unload utilities that allow table data to be written to OS files) as well as data to be written out to OS files. A variety of file formats and delimiters are supported including comma-separated value (CSV), tabs, and more, with CSV being the default.

The syntax for the COPY command is the following:

COPY <column family name> [ ( column [, ...] ) ]   FROM ( ‘<filename>’ | STDIN )   [ WITH <option>='value' [AND ...] ];

COPY <column family name> [ ( column [, ...] ) ]   TO ( ‘<filename>’ | STDOUT )   [ WITH <option>='value' [AND ...] ];

Below are simple examples of the COPY command in action:

cqlsh> SELECT * FROM airplanes;   
name          | mach | manufacturer | year 
--------------+------+--------------+------  
P38-Lightning |  0.7 |     Lockheed | 1937

cqlsh> COPY airplanes (name, mach, year, manufacturer) TO 'temp.csv'
1 rows exported in 0.004 seconds. 
cqlsh> TRUNCATE airplanes;  

cqlsh> COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv'; 1 rows imported in 0.087 seconds.

See our online documentation for more information about the COPY command.

Sqoop

DataStax Enterprise Edition 2.0 and higher includes support for Sqoop, which is a tool designed to transfer data between an RDBMS and Hadoop. We modified Sqoop so you can not only transfer data from an RDBMS to a Hadoop node in a DataStax Enterprise cluster, but also move data directly into Cassandra as well.

I wrote a short tutorial you can reference on how to use Sqoop to move data from MySQL into Cassandra that can be mimicked for any other RDBMS. There’s also a demo of Sqoop that you’ll find in the /demos directory of a DataStax Enterprise download/installation.

ETL Tools

If you need more sophistication applied to a data movement situation (i.e. more than just extract-load), then you can use any number of extract-transform-load (ETL) solutions that now support Cassandra. These tools provide excellent transformation routines that allow you to manipulate source data in literally any way you need and then load it into a Cassandra target. They also supply many other features such as visual, point-and-click interfaces, scheduling engines, and more.

Happily, many ETL vendors who support Cassandra supply community editions of their products that are free and able to solve many different use cases. Enterprise editions are also available that supply many other compelling features that serious enterprise data users need.

You can freely download and try ETL tools from Jaspersoft, Pentaho, and Talend that all work with DataStax Enterprise and community Cassandra.

Conclusion

The good news is you have multiple methods to use for moving data from most any database source system into DataStax Enterprise and Cassandra. If there’s another ETL tool or similar product you’re using that doesn’t support DataStax Enterprise and/or Cassandra, please contact us and we’ll see what we can do to change that.

If you haven’t done so already, download a copy of DataStax Enterprise Edition, which contains a production-ready version of Cassandra, along with Hadoop for analytics and Solr for enterprise search. It’s completely free to use for as long as you like in your development environments.

 



Comments

  1. Amit says:

    Hello Robin,

    Thanks for this information. We are looking at Cassandra to be used as a cache that loads up data from own storage or external data sources. The external data sources complicate the matter as we may not know what could be the keys that would relevant for our clients on a particular day. So we need to load the data into Cassandra by contacting the external data source in cases where the key is not present or a time interval has passed after previous load
    Is there any mechanism to do this?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>