The purpose of this article is to describe how to improve performance when importing csv data into Cassandra via cqlsh COPY FROM. If you are not familiar with cqlsh COPY, you can read the documentation here. There is also a previous blog article describing recent changes in cqlsh COPY.
The practical steps described here follow from this companion blog entry, which describes the reasoning behind these techniques.
By default cqlsh ships with an embedded Python driver that is not built with C extensions. In order to increase COPY performance, the Python driver should be installed with the Cython and libev C extensions:
- after installing the dependencies described here,
- install the driver by following the instructions that are available here.
Once the driver is installed, cqlsh must be told to ignore the embedded driver and to use the installed driver instead. This can be done by setting the CQLSH_NO_BUNDLED environment variable to any value, for example on Linux:
Using the benchmark described here, an improvement of approximately 60% was achieved by using a driver compiled with c extensions.
You can also compile with Cython the clqsh copy module. Whilst the driver has been optimized for Cython and using a driver with Cython extensions makes a significant difference, compiling the clqsh copy module (a standard Python module) may give approximately a 5% performance boost but not much further. Should you wish to compile the copy module with Cython, first of all you need to install Cython if not already available:
pip install cython
Then in the cassandra installation folder go to pylib and compile the module with the following command:
python setup.py build_ext --inplace
This command will create pylib/cqlshlib/copyutil.c and copyutil.so. Should you need to revert back to pure Python interpreted code, simply delete these two files.
On Linux, it is advantageous to increase CPU time slices by setting the CPU scheduling type to SCHED_BATCH. To do so, launch cqlsh with chrt as follows:
chrt -b 0 cassandra_install_dir/bin/cqlsh
Changing CPU scheduling may boost performance by an additional 5%.
Parameters that affect performance
There are six COPY FROM parameters that you can experiment with, to optimize performance for a specific workload.
This is the number of worker processes, and its default value is the number of cores on the machine minus one, capped at 16. You can observe the CPU idle time via dstat, dstat -lvrn 10, and if CPU idle time is present using the default number of worker processes, you can increase it.
This is the number of rows sent from the feeder process (which reads data from files) to worker processes; depending on the data-set average row size, it may be advantageous to increase the value for this parameter.
For each chunk, a worker process will batch by ring position as long as there are at least min batch size entries for each position, otherwise it will batch by replica. The advantage of batching by ring position is that any replica for that position can process the batch. Depending on the size of the chunk, the number of nodes in the cluster and the number of VNODES for each node, this value may need adjusting:
- the larger the chunk size, the easier it is to find batching opportunities;
- the larger the number of tokens (VNODES x NUM_NODES), the harder it is to find batching opportunities.
If batching by replica, worker processes will split batches when they reach this size. Increasing this value is useful if you notice timeouts, that is if the cluster has almost reached capacity. From observation, it is useful to increase batch size in this case. However, a batch size that is too large may cause warnings and eventual rejection. Two parameters control this behavior in cassandra.yaml:
This is the speed in rows per second at which the feeder process sends data to the worker processes. Normally, there is no need to change this value, unless a rate greater than 100k rows per second may be achievable.
This parameter disables prepared statements. With prepared statements enabled, worker processes will not only parse csv data, they will also convert csv fields into types that correspond to the CQL column type, that is the string “3”, for example, will be converted to an integer with value 3 and so forth. Python is not efficient at this type of processing, especially with complex data types such as datetime values or composite types (collections or user types). Disabling prepared statements results in the Cassandra process performing the type parsing instead. However, it also forces the Cassandra process to compile each CQL batch statement, resulting in considerable pressure on the Cassandra process. It is recommended that prepared statements should remain enabled in most cases, unless cluster overload is not a concern.
For further background information on why these parameters affect performance, please refer to the companion blog entry.