DataStax Developer Blog

Simple data importing and exporting with Cassandra

By Paul Cannon -  July 28, 2012 | 4 Comments

Discussion

There are a number of ways to ingest preexisting data into a Cassandra cluster. The venerable and low-level BinaryMemtable interface was used in the early days, but it was quite difficult to use, and it’s not even an option anymore. The json2sstable and sstableloader tools are very efficient, but setting up the procedure involves careful configuration and network considerations. The BulkOutputFormat is helpful for streaming output data from Hadoop into Cassandra (it uses sstableloader underneath).

While these tools are powerful and can deal with enormous amounts of data, sometimes you want to minimize your own time, not the time needed for the job to complete. Maybe you don’t have that much data (say, a few million rows or less), so even a slow import will be done quickly enough, or maybe you just don’t want to bother learning what exactly an “sstable” is. For this niche, the COPY FROM cqlsh command has recently been introduced, and will be in Cassandra 1.1.3. There is also a corresponding COPY TO command, for exporting data from Cassandra.

The COPY FROM and COPY TO commands are somewhat based on the equivalent COPY commands in PostgreSQL. Like those commands, they work on CSV data, and include a few options to specify exactly what flavor of CSV is in use (CSV is an extremely common format, but there’s no single standard).

How to use

The syntax looks like this:

COPY <tablename> [ ( column [, ...] ) ]
     FROM ( '<filename>' | STDIN )
     [ WITH <option>='value' [AND ...] ];

COPY <tablename> [ ( column [, ...] ) ]
     TO ( '<filename>' | STDOUT )
     [ WITH <option>='value' [AND ...] ];

When importing data (COPY FROM), the default case is that every row in the CSV input is expected to contain the same number of columns, which is the same as the number of columns in the Cassandra table metadata. The fields will be assigned in the respective order. If you want to apply your input data to a particular set of columns, you can specify the column names in parentheses after the table name. The CSV input is specified either by a file path or by the keyword “STDIN“. When giving input on stdin, you can signal the end of the CSV data with a backslash and dot (“\.“) on a line by themselves. COPY FROM does not truncate the table before importing the new data; it adds to the preexisting data.

When exporting data (COPY TO), the default is to output all columns from the Cassandra table metadata, in the order defined. If you only want to export a particular set of columns, you can specify the column names in parentheses after the table name. The destination for the output CSV is specified either by a file path or by the keyword “STDOUT“.

CSV style

The default flavor of CSV understood by the COPY command separates fields by commas (,), separates records with a line separators, quotes field values in double-quotes ("") when necessary to avoid ambiguity, escapes literal double-quotes with a backslash inside a double-quoted string ("\""), and does not have a “header” record on the first line naming the columns.

The options available for the CSV syntax are:

  • DELIMITER (default: ,). The character used to separate fields
  • QUOTE (default: "). Quoting character used to quote field values when necessary
  • ESCAPE (default: \). Character used to escape literal uses of the QUOTE character
  • HEADER (default: false). Whether the first row of the CSV is a header, naming the columns to be used (when true, this causes COPY TO to output a valid header line, and causes COPY FROM to ignore the first line. COPY FROM does not actually use the header for anything.)
  • ENCODING (default: utf8). Used only by COPY TO, when outputting unicode strings
  • NULL (default: empty string). Used only by COPY TO to represent a value that is not present. COPY FROM does not currently have any concept of null values, but that may change after CASSANDRA-3783 is resolved.

A few examples ought to help make these various use cases clearer.

Example 1

Entering data directly during an interactive cqlsh session. Default CSV flavor. The [bracketed] bits are prompts output by cqlsh when in an interactive session, not part of the data:

cqlsh> COPY log.parts FROM STDIN;
[Use \. on a line by itself to end input]
[copy] "F3-14-2,14-5",engine,compression coil
[copy] F3-19-B,engine,trace compression block
[copy] F3-01-1,engine,radion accelerator core
[copy] "F3-0R-A,0F-G",engine,aft alternator
[copy] ASA-I-1192-1,infirmary,extractor
[copy] ASA-I-9288-T,infirmary,adrenaline syringe
[copy] CT-A-2,armory,"Callahan full-bore auto-lock \"Vera\""
[copy] \.
7 rows imported in 0.072 seconds.
cqlsh>

Example 2

Outputting data with a header row, using the | character to separate fields, single-quote characters to quote values, and doubled-single-quotes to represent literal single-quotes in the output (so, just like CQL string literals):

cqlsh> COPY log.chatlogs (ts, content, other) TO './chatlog.dat'
   ... WITH DELIMITER = '|' AND QUOTE = '''' AND ESCAPE = '''' AND NULL = '<null>';
3 rows exported in 0.005 seconds.
cqlsh> exit;
~/datadir$ cat chatlog.dat
17271650717.223|'Thank you for the wine. It''s very... fresh.'|<null>
17264147043.011|'We''ve got some local color happening. Your grand entrance woul
d not go amiss right now.'|<null>
17268884092.729|'It''s a real burn, being right so often.'|<null>

Edit 2012-07-30 16:42 +0000 – noted BinaryMemtable is no longer available



Comments

  1. Vitaly Chirkov says:

    Currently, COPY TO uses SELECT with LIMIT 99999999. You can easily get “Request did not complete within rpc_timeout.” against a large data set.

  2. Ng says:

    There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query — all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Most people say you can not query the data in cassandra without ‘where’ clause. This is
    big restriction to me.

    1. Tyler Hobbs Tyler Hobbs says:

      This is mostly a current limitation of cqlsh. In the near future we’ll add query paging support to cqlsh, which will allow it to handle large queries and large COPY commands.

      You can always use a standard driver to query the data, or use a hadoop job, if needed.

  3. Kasper says:

    Be warned though – COPY will cause you to loose any precision beyond seconds if you attempt to export data using COPY on timestamp columns…

    That’s really sucky when you, like in my case, need to export a few thousand rows that use timestamps as part of their PK.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>