TechnologyMarch 26, 2019

DataStax Bulk Loader Pt. 1 — Introduction and Loading

Brian Hess
Brian Hess
DataStax Bulk Loader Pt. 1 — Introduction and Loading
$ cqlsh -e "SELECT COUNT(*) FROM dsbulkblog.iris_with_id;"
(1 rows)
cqlsh -e "CREATE KEYSPACE dsbulkblog WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};"
cqlsh -e "CREATE TABLE dsbulkblog.iris_with_id (id int PRIMARY KEY, petal_length double, petal_width double, sepal_length double, sepal_width double, species text);"
cqlsh -e "CREATE TABLE dsbulkblog.categories_set(category TEXT PRIMARY KEY, examples SET<text>);" </text>
<text>cqlsh -e "CREATE TABLE dsbulkblog.categories_list(category TEXT PRIMARY KEY, examples LIST<text>);" </text></text>
<text><text>cqlsh -e "CREATE TABLE dsbulkblog.president_birthdates(president TEXT PRIMARY KEY, birthdate_string TEXT, birthdate DATE);" </text></text>
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches
150 | 0 | 329 | 0.02 | 0.05 | 9.84 | 39.58 | 45.88 | 1.00
$ cat /tmp/dsbulkblog/iris.csv | sed 's/Iris-//g' | dsbulk load -k dsbulkblog -t iris_with_id
(3 rows)
$ cat /tmp/dsbulkblog/iris_no_header.csv | awk 'BEGIN{print "sepal_length,sepal_width,petal_length,petal_width,species,id"} {print $0}' | dsbulk load -k dsbulkblog -t iris_with_id
$ dsbulk load -k dsbulkblog -t iris_with_id -url https://gist.github.com/brianmhess/8864cf0cb0ce9ea1fd64e579e9f41100/raw/522a1f564a381d7eacf6955c490bb6331d4369b2/iris.csv
$ mkdir /tmp/dsbulkblog/iris $ cp /tmp/dsbulkblog/iris.csv /tmp/dsbulkblog/iris
$ dsbulk load -url /tmp/dsbulkblog/iris -k dsbulkblog -t iris_with_id
$ cp /tmp/dsbulkblog/president_birthdates.psv /tmp/dsbulkblog/iris
$ dsbulk load -url /tmp/dsbulkblog/iris -k dsbulkblog -t iris_with_id --connector.csv.fileNamePattern "**/*.csv"
sepal_length,sepal_width,petal_length,petal_width,species,id
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length, petal_width = petal_width, species = species, id = id"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length, petal_width = petal_width, species = species, id = id"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "*=*"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "* = -species"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "* = [-species, -petal_length]"
$ dsbulk load -url /tmp/dsbulkblog/iris_no_header.csv -k dsbulkblog -t iris_with_id -header false -m "0=sepal_length, 1=sepal_width, 2=petal_length, 3=petal_width, 4=species, 5=id"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "field1=sepal_length,field2=sepal_width,field3=petal_length,field4=petal_width,field5=species,field6=id"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -header false -m "0=sepal_length,1=sepal_width,2=petal_length,3=petal_width,4=species,5=id"
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id" -header false -skipRecords 1
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id" -header false -skipRecords 101
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches   
50 |      0 | 122 | 0.01 |   0.05 | 10.13 | 23.33 |  23.33 | 1.00
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -maxRecords 20
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches 20 | 0 | 53 | 0.00 | 0.05 | 5.86 | 12.26 | 12.26 | 1.00
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -skipRecords 100 -maxRecords 20
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches  
102 |    101 |  0 | 0.00 |  0.00 | 0.00 |  0.00 | 0.00 |   0.00 Rejected records can be found in the following file(s): mapping.bad
Resource: file:/tmp/dsbulkblog/iris.csv
Position: 129
Source: 6.4,2.8,5.6,2.1,Iris-virginica,128\u000a
java.lang.IllegalArgumentException: Could not parse 'Iris-virginica'; accepted formats are: a valid number (e.g. '1234.56'), a valid Java numeric format (e.g. '-123.45e6'), a valid date-time pattern (e.g. '2019-03-14T16:25:41.267Z'), or a valid boolean word        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:119)        at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72)        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:55)        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:26)        at com.datastax.dsbulk.engine.internal.codecs.ConvertingCodec.serialize(ConvertingCodec.java:50)        Suppressed: java.time.format.DateTimeParseException: Text 'Iris-virginica' could not be parsed at index 0                at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)                at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1819)                at com.datastax.dsbulk.engine.internal.codecs.util.SimpleTemporalFormat.parse(SimpleTemporalFormat.java:41)                at com.datastax.dsbulk.engine.internal.codecs.util.ZonedTemporalFormat.parse(ZonedTemporalFormat.java:45)                at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:106)                Suppressed: java.lang.NumberFormatException: For input string: "Iris-virginica"                        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)                        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)                        at java.lang.Double.parseDouble(Double.java:538)                        at java.lang.Double.valueOf(Double.java:502)                        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:101)                        Suppressed: java.lang.NumberFormatException: null                                at java.math.BigDecimal.<init>(BigDecimal.java:494)                                at java.math.BigDecimal.<init>(BigDecimal.java:383)                                at java.math.BigDecimal.<init>(BigDecimal.java:806)                                at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:96)                                at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72)                                Suppressed: java.text.ParseException: Invalid number format: Iris-virginica                                        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:247)                                        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:92)                                        at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72)                                        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:55)                                        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:26) </init></init></init>
$ dsbulk load -url /tmp/logs/LOAD_20190314-162539-255317/mapping.bad -k dsbulkblog -t iris_with_id -header false -m "0=sepal_length,1=sepal_width,2=petal_length,3=petal_width,4=species, 5=id"
Operation directory: /tmp/logs/LOAD_20190314-163033-590156 total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches  
101 |      0 | 231 | 0.01 |   0.05 | 8.50 | 17.17 |  18.22 | 1.00
Operation LOAD_20190314-163033-590156 completed successfully in 0 seconds.
Last processed positions can be found in positions.txt
$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s\n", $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id
Operation directory: /tmp/logs/LOAD_20190314-163121-694645
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
Operation LOAD_20190314-163121-694645 aborted: Too many errors, the maximum allowed is 100. total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches  
102 |    101 |  0 | 0.00 |  0.00 | 0.00 |  0.00 | 0.00 |   0.00
Rejected records can be found in the following file(s): mapping.bad
Errors are detailed in the following file(s): mapping-errors.log
Last processed positions can be found in positions.txt
--schema.allowMissingFields:
$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s\n", $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id --schema.allowMissingFields true
$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s,%s,extra\n", $1, $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id
$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s,%s,extra\n", $1, $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id --schema.allowExtraFields false
Operation directory: /tmp/logs/LOAD_20190314-163305-346514
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
Operation LOAD_20190314-163305-346514 aborted: Too many errors, the maximum allowed is 100. total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches  
102 |    101 |  0 | 0.00 |  0.00 | 0.00 |  0.00 | 0.00 |   0.00
Rejected records can be found in the following file(s): mapping.bad
Errors are detailed in the following file(s): mapping-errors.log
Last processed positions can be found in positions.txt
$ cat /tmp/dsbulkblog/iris.csv | sed "s/Iris/  Iris/g" | dsbulk load -k dsbulkblog -t iris_with_id
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms
150 | 0 | 392 | 0.01 | 0.02 | 9.49 | 41.68 | 41.68
Operation UNLOAD_20190321-144112-777452 completed successfully in 0 seconds.
species
$ cat /tmp/dsbulkblog/iris.csv | sed "s/Iris/ Iris/g" | dsbulk load -k dsbulkblog -t iris_with_id --connector.csv.ignoreLeadingWhitespaces true
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms  
150 |      0 | 416 | 0.01 |   0.01 | 8.16 | 36.96 |  36.96
Operation UNLOAD_20190321-144510-244786 completed successfully in 0 seconds.
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -query "INSERT INTO dsbulkblog.iris_with_id(id,petal_width,petal_length,sepal_width,sepal_length,species) VALUES (:id, :petal_width, :petal_length, :sepal_width, :sepal_length, :species)"
$ dsbulk load -url /tmp/dsbulkblog/iris_no_header.csv -query "INSERT INTO dsbulkblog.iris_with_id(petal_width,petal_length,sepal_width,sepal_length,species,id) VALUES (?,?,?,?,?,?)" -header false
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -query "INSERT INTO dsbulkblog.iris_with_id(id,petal_width,petal_length,sepal_width,sepal_length,species) VALUES (:id, :petal_width, :petal_length, :sepal_width, :sepal_length, 'some kind of iris')"
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches   
10 |      0 | 32 | 0.00 |   0.00 | 4.59 | 5.96 |   5.96 | 1.00
$ cqlsh -e "SELECT * FROM dsbulkblog.iris_with_id WHERE id IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)"
(6 rows)
Discover more
Data ProcessingDataStax Bulk Loader
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.