Simple data importing and exporting with Cassandra

By Paul Cannon -  July 28, 2012 | 21 Comments


There are a number of ways to ingest preexisting data into a Cassandra cluster. The venerable and low-level BinaryMemtable interface was used in the early days, but it was quite difficult to use, and it’s not even an option anymore. The json2sstable and sstableloader tools are very efficient, but setting up the procedure involves careful configuration and network considerations. The BulkOutputFormat is helpful for streaming output data from Hadoop into Cassandra (it uses sstableloader underneath).

While these tools are powerful and can deal with enormous amounts of data, sometimes you want to minimize your own time, not the time needed for the job to complete. Maybe you don’t have that much data (say, a few million rows or less), so even a slow import will be done quickly enough, or maybe you just don’t want to bother learning what exactly an “sstable” is. For this niche, the COPY FROM cqlsh command has recently been introduced, and will be in Cassandra 1.1.3. There is also a corresponding COPY TO command, for exporting data from Cassandra.

The COPY FROM and COPY TO commands are somewhat based on the equivalent COPY commands in PostgreSQL. Like those commands, they work on CSV data, and include a few options to specify exactly what flavor of CSV is in use (CSV is an extremely common format, but there’s no single standard).

How to use

The syntax looks like this:

COPY <tablename> [ ( column [, ...] ) ]
     FROM ( '<filename>' | STDIN )
     [ WITH <option>='value' [AND ...] ];

COPY <tablename> [ ( column [, ...] ) ]
     TO ( '<filename>' | STDOUT )
     [ WITH <option>='value' [AND ...] ];

When importing data (COPY FROM), the default case is that every row in the CSV input is expected to contain the same number of columns, which is the same as the number of columns in the Cassandra table metadata. The fields will be assigned in the respective order. If you want to apply your input data to a particular set of columns, you can specify the column names in parentheses after the table name. The CSV input is specified either by a file path or by the keyword “STDIN“. When giving input on stdin, you can signal the end of the CSV data with a backslash and dot (“\.“) on a line by themselves. COPY FROM does not truncate the table before importing the new data; it adds to the preexisting data.

When exporting data (COPY TO), the default is to output all columns from the Cassandra table metadata, in the order defined. If you only want to export a particular set of columns, you can specify the column names in parentheses after the table name. The destination for the output CSV is specified either by a file path or by the keyword “STDOUT“.

CSV style

The default flavor of CSV understood by the COPY command separates fields by commas (,), separates records with a line separators, quotes field values in double-quotes ("") when necessary to avoid ambiguity, escapes literal double-quotes with a backslash inside a double-quoted string ("\""), and does not have a “header” record on the first line naming the columns.

The options available for the CSV syntax are:

  • DELIMITER (default: ,). The character used to separate fields
  • QUOTE (default: "). Quoting character used to quote field values when necessary
  • ESCAPE (default: \). Character used to escape literal uses of the QUOTE character
  • HEADER (default: false). Whether the first row of the CSV is a header, naming the columns to be used (when true, this causes COPY TO to output a valid header line, and causes COPY FROM to ignore the first line. COPY FROM does not actually use the header for anything.)
  • ENCODING (default: utf8). Used only by COPY TO, when outputting unicode strings
  • NULL (default: empty string). Used only by COPY TO to represent a value that is not present. COPY FROM does not currently have any concept of null values, but that may change after CASSANDRA-3783 is resolved.

A few examples ought to help make these various use cases clearer.

Example 1

Entering data directly during an interactive cqlsh session. Default CSV flavor. The [bracketed] bits are prompts output by cqlsh when in an interactive session, not part of the data:

[Use \. on a line by itself to end input]
[copy] "F3-14-2,14-5",engine,compression coil
[copy] F3-19-B,engine,trace compression block
[copy] F3-01-1,engine,radion accelerator core
[copy] "F3-0R-A,0F-G",engine,aft alternator
[copy] ASA-I-1192-1,infirmary,extractor
[copy] ASA-I-9288-T,infirmary,adrenaline syringe
[copy] CT-A-2,armory,"Callahan full-bore auto-lock \"Vera\""
[copy] \.
7 rows imported in 0.072 seconds.

Example 2

Outputting data with a header row, using the | character to separate fields, single-quote characters to quote values, and doubled-single-quotes to represent literal single-quotes in the output (so, just like CQL string literals):

cqlsh> COPY log.chatlogs (ts, content, other) TO './chatlog.dat'
   ... WITH DELIMITER = '|' AND QUOTE = '''' AND ESCAPE = '''' AND NULL = '<null>';
3 rows exported in 0.005 seconds.
cqlsh> exit;
~/datadir$ cat chatlog.dat
17271650717.223|'Thank you for the wine. It''s very... fresh.'|<null>
17264147043.011|'We''ve got some local color happening. Your grand entrance woul
d not go amiss right now.'|<null>
17268884092.729|'It''s a real burn, being right so often.'|<null>

Edit 2012-07-30 16:42 +0000 – noted BinaryMemtable is no longer available


  1. Vitaly Chirkov says:

    Currently, COPY TO uses SELECT with LIMIT 99999999. You can easily get “Request did not complete within rpc_timeout.” against a large data set.

  2. Ng says:

    There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query — all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Most people say you can not query the data in cassandra without ‘where’ clause. This is
    big restriction to me.

    1. Tyler Hobbs Tyler Hobbs says:

      This is mostly a current limitation of cqlsh. In the near future we’ll add query paging support to cqlsh, which will allow it to handle large queries and large COPY commands.

      You can always use a standard driver to query the data, or use a hadoop job, if needed.

    2. Rajesh says:

      Why dont you pipe the output to an Out stream in JS and keep picking up the data from the stream.

  3. Kasper says:

    Be warned though – COPY will cause you to loose any precision beyond seconds if you attempt to export data using COPY on timestamp columns…

    That’s really sucky when you, like in my case, need to export a few thousand rows that use timestamps as part of their PK.

  4. I have built a simple tool by using datastax java library for exporting cassandra table with millions of rows to csv.

  5. Bas de Bruijn says:

    The second example states that a header row will be used, but it is not actually used in the example.

    This blog post has been helpful, thanks.

  6. Abhinandan says:

    how to create a dump file from cassandra database in excel sheet format?

  7. Akhil says:

    Thanks Firman Gautama, it is working well !!!

  8. Parveen Verma says:

    But it is not allowing to import data without creating same columns as in csv file. Which sound very wierd.

  9. Eugene says:

    COPY [ ( column [, …] ) ]
    doesn’t work in Windows enviroment. Gives the following error:
    Error starting import process:

    Can’t pickle : it’s not found as thread.lock
    can only join a started process
    cqlsh:core2> Traceback (most recent call last):
    File “”, line 1, in
    File “C:\Program Files\DataStax Community\python\lib\multiprocessing\”, line 373, in main
    File “C:\Program Files\DataStax Community\python\lib\multiprocessing\”, line 482, in prepare
    file, path_name, etc = imp.find_module(main_name, dirs)
    ImportError: No module named cqlsh

  10. Ashish says:

    Can you tell me the way to export CQL data to Excel. I am using putty for CQL login.

    1. Aaron Ploetz says:

      COPY will not export to an Excel format. However, you can export it as a CSV, which Excel is capable of reading.

  11. Aditya Prasad Konda says:

    i have defined 5 columns in cassandra. for suppose (emp_name, emp_dept, emp_location, emp_bonus, emp_hike)

    4th and 5th columns are float type

    when I’m inserting a record which has only three values suppose (‘paul’, ‘sales’, ‘Detroit’, , )

    I’m getting this exception Failed to import 3638 rows: ParseError – could not convert string to float: , given up without retries

    how to insert empty records to float datatype ??

  12. s says:

    can we copy data into single row

  13. wanttolearn1 says:

    how can i script all table structure (with cqlsh)?
    i want to make another copy of the srutcture some where else.

  14. Pawel Hadam says:

    I have INT and TIMESTAMP columns and I cannot COPY empty (null) values into those columns from a CSV file.

    My table is:

    CREATE TABLE IF NOT EXISTS tytandb.test1

    ID INT,

    My CSV files is:


    My command is:


    And I am receiving this error:

    Failed to import 1 rows: ParseError – invalid literal for int() with base 10: ”, given up without retries

    With VARCHAR it’s OK and I can copy NULL values into VARCHAR columns.

    Any ideas why it does not work with INT/TIMESTAMP.

  15. mahi says:

    thanks for the post,

    Can we import/ copy multiple files into acassandra table which are having same column name in a table and in files?

  16. Kames says:

    Hi, Can you tell me how can I load/import a Json file into cassandra?

  17. Ed says:

    Limit the number of rows with linux pipes i.e. to get just 1000 lines of your table do:

    cqlsh `hostname` -u user -p user-password -e “copy keyspace.table to STDOUT with header=true AND delimiter=’|’;” | head -1003 > /tmp/tt2 ; cat -n /tmp/tt2

  18. Prashanth Karumbaiah says:

    We have migrated a mainframe VSAM to Cassandra. It has 7 million rows. I now need to build CQL or something to extract data to process them. My filter condition is as below. I should ensure to do this using a more performance oriented approach.

    SORT FIELDS=(4,12,CH,A)
    so on… I have about 100 conditions as above.

    What would be the best approach to do this? Do you have any solution/approach/recommendation



Your email address will not be published. Required fields are marked *

Subscribe for newsletter:

Tel. +1 (408) 933-3120 Offices France Germany

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.