DataStax News: Astra Streaming now GA with new built-in support for Kafka and RabbitMQ. Read the press release.
Cassandra Data Loading: 8 Tips for Loading Data into Astra DB
The DataStax Bulk Loader tool (
dsbulk) is a command line tool for loading and unloading data from Cassandra and Astra DB. In this blog, we’ll expand on the documentation we provide for
dsbulk with nine tips from the DataStax engineering team to help you optimize the data loading process.
If you haven’t installed
dsbulk yet, you can set up the tool using the following commands:
curl -OL https://downloads.datastax.com/dsbulk/dsbulk-1.8.0.tar.gz
Then, unpack the downloaded distribution:
tar -xzvf dsbulk-1.8.0.tar.gz
To learn more about
dsbulk setup, take a look at our documentation.
Tip #1: Run the DSBulk Loader on a virtual machine
While running your migration, we recommend using a virtual machine (VM) in the same region as your database to decrease latency and increase throughput (number of rows you can load per second).
DSBulk can be easily installed on a VM using the installation commands above. We strongly recommend using a virtual machine instead of running DSBulk directly on your laptop.
Tip #2: Load data directly from AWS S3 or Google Cloud Storage
For data that doesn’t fit on a single machine’s hard drive, or even just to leverage the convenience of cloud object storage,
dsbulk can load large amounts of data directly from AWS S3 or Cloud Storage on Google Cloud Platform (GCP).
Load a single CSV file hosted on GCP by passing
dsbulk a file url:
dsbulk load -url https://storage.googleapis.com/bucket/filename.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret
Load multiple CSVs hosted on GCP by passing
dsbulk a list of file names:
dsbulk load --connector.csv.urlfile https://storage.googleapis.com/bucket/files.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret
Tip #3: The DSBulk Loader works well with Astra DB
To connect to Astra DB you need a Secure Connect Bundle (SCB), and application token. You can download the secure database bundle and obtain your application token from the DataStax Astra DB web console.
dsbulk is compatible with Astra DB by passing your SCB to the
-b flag, client id to the
-u flag and client secret to the
Tip #4: Dealing with rate limits
Tip #5: DSBulk tool pooling options
Tip #6: Tuning DSBulk
Performance tuning is about understanding the bottlenecks in a system and removing them to improve performance. What is performance? In the case of bulk loading we optimize for throughput (as opposed to latency) because the goal is to get as much data into the system as fast as possible. This is different from a traditional Cassandra operational environment where we might optimize for query latencies.
For a deeper dive into the relationship between latency and throughput (under concurrency) take a moment to review Little’s Law.
In practice, as we try to push data faster with DSBulk (the client), we may see latencies increase on AstraDB (the server). If we don’t, that’s a sign that we still have plenty of database capacity and that we can continue to increase the rate in DSBulk. If on the other hand, your latencies are increasing without an increase in throughput, you may have to wait for your database to autoscale or open a support request to get better performance.
DSBulk throughput can be controlled with a few different flags:
All three of these flags control the same thing (target client throughput). They just do so by three different means. So remember to pick only ONE. The documentation recommends tuning
maxConcurrentQueries because it is technically the most efficient. However, we find that
maxPerSecond is easier for users to understand, so we recommend it for almost all scenarios.
To keep a closer eye on the client-side latencies, use the
-report-rate flag. You can also watch the database side latencies in your AstraDB Health Tab.
Tip #7: Handling Errors
If your bulk load is pushing the system to its limits you may want to configure errors and retries so that your job doesn’t just stop when it hits too many errors. Note DSBulk logs any failed inserts in the logs directory, and you can re-process any missed queries in a subsequent run:
Before calling a row an error, set the maximum number of errors before stopping the process with
--dsbulk.log.maxErrors and the maximum number of retries with
Tip #8: Onboarding engineers
Need additional help with your data load? No problem. We’ve got a team of engineers working round the clock, five days a week. Click the chat icon on the bottom right corner of the Astra portal to start a chat and get immediate help from an engineer. All you’ve got to do is let them know the amount of data and the deadline to upload it.
The Final Command
Here’s what your command might look like with all the options set:
dsbulk load -url https://storage.googleapis.com/bucket/filename.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret --driver.advanced.connection.pool.local.size 16 --dsbulk.executor.maxPerSecond 10000 --dsbulk.log.maxErrors 100 --driver.advanced.retry-policy.max-retries=3 --report-rate 10
Loading very large datasets onto Astra DB can be a breeze if you follow the best practices in this article. We hope you find these helpful.
If you prefer to learn about DSBulk via video, check out this quick overview from Steven Smith.
Need additional help loading your data into Cassandra or Astra? Reach out to us at email@example.com.