TechnologyApril 9, 2019

DataStax Bulk Loader Pt. 2 — More Loading

Brian Hess
Brian Hess
DataStax Bulk Loader Pt. 2 — More Loading
$ dsbulk load -url /tmp/dsbulkblog/iris_with_comment.csv -k dsbulkblog -t iris_with_id -comment "#"
$ cqlsh -e "TRUNCATE dsbulkblog.iris_with_id;"
$ dsbulk load -url /tmp/dsbulkblog/iris_with_nulls.csv -k dsbulkblog -t iris_with_id
id  | petal_length | petal_width | sepal_length | sepal_width | species
-----+--------------+-------------+--------------+-------------+---------  
(3 rows)
$ dsbulk load -url /tmp/dsbulkblog/iris_with_null_string.csv -k dsbulkblog -t iris_with_id
$ cqlsh -e "SELECT * FROM dsbulkblog.iris_with_id LIMIT 3;"
id  | petal_length | petal_width | sepal_length | sepal_width | species
-----+--------------+-------------+--------------+-------------+---------  
$ cqlsh -e "TRUNCATE dsbulkblog.iris_with_id;"
$ dsbulk load -url /tmp/dsbulkblog/iris_with_null_string.csv -k dsbulkblog -t iris_with_id -nullStrings "NULL"
$ cqlsh -e "SELECT * FROM dsbulkblog.iris_with_id LIMIT 3;"
id  | petal_length | petal_width | sepal_length | sepal_width | species
-----+--------------+-------------+--------------+-------------+---------  
$ dsbulk load -url /tmp/dsbulkblog/president_birthdates.psv -k dsbulkblog -t president_birthdates -delim "|"
$ dsbulk load -url /tmp/logs/LOAD_20190314-164545-280529/mapping.bad -k dsbulkblog -t president_birthdates -delim "|" -header false -m "president,birthdate_string,birthdate" --codec.date "y.M.d"
$ dsbulk load -url /tmp/logs/LOAD_20190314-164545-280529/mapping.bad -k dsbulkblog -t president_birthdates -delim "|" -header false -m `head -1 /tmp/dsbulkblog/president_birthdates.psv | sed 's/|/,/g'` --codec.date "y.M.d"
$ dsbulk load -url /tmp/dsbulkblog/president_birthdates.psv -delim "|" -query "INSERT INTO dsbulkblog.president_birthdates(president,birthdate,birthdate_string) VALUES (:president, CurrentDate(), :birthdate_string)"
$ dsbulk load -url /tmp/dsbulkblog/president_birthdates.psv -k dsbulkblog -t president_birthdates -delim "|" -m "president=president, CurrentDate()=birthdate, birthdate_string=birthdate_string"
"Baseball Teams","[\"Boston Red Sox\", \"Atlanta Braves\", \"San Diego Padres\", \"Washington Senators\"]"
$ dsbulk load -url /tmp/dsbulkblog/sportsteams.csv -k dsbulkblog -t categories_list
$ dsbulk load -url /tmp/dsbulkblog/sportsteams.csv -k dsbulkblog -t categories_set
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --schema.queryTtl 3600
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --schema.queryTimestamp "2018-01-01T12:34:56+00:00"
$ cat /tmp/dsbulkblog/iris_no_header.csv | awk -F, '{printf("%s,%d\n", $0, $6+10000);}' | dsbulk load -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id,__ttl" -header false
$ cqlsh -e "SELECT id, species, Ttl(species) FROM dsbulkblog.iris_with_id LIMIT 5"
id | species         | ttl(species)
-----+-----------------+--------------  
(5 rows)
$ cat /tmp/dsbulkblog/iris_no_header.csv | awk -F, 'BEGIN{printf("sepal_length,sepal_width,petal_length,petal_width,species,id,ttl_to_use\n");} {printf("%s,%d\n", $0, $6+10000);}' | dsbulk load -query "INSERT INTO dsbulkblog.iris_with_id(sepal_length,sepal_width,petal_length,petal_width,species,id) VALUES (:sepal_length,:sepal_width,:petal_length,:petal_width,:species,:id) USING TTL :ttl_to_use"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -dryRun true
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches  
150 |      0 | 0 | 0.00 |   0.00 | 0.00 | 0.00 |   0.00 | 1.00
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --executor.maxPerSecond 10
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --executor.maxInFlight 3
Discover more
Data ProcessingDataStax Bulk Loader
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.