Technology•March 26, 2019

DataStax Bulk Loader Pt. 1 — Introduction and Loading

Brian Hess

$ cqlsh -e "SELECT COUNT(*) FROM dsbulkblog.iris_with_id;"

(1 rows)

cqlsh -e "CREATE KEYSPACE dsbulkblog WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};"

cqlsh -e "CREATE TABLE dsbulkblog.iris_with_id (id int PRIMARY KEY, petal_length double, petal_width double, sepal_length double, sepal_width double, species text);"

cqlsh -e "CREATE TABLE dsbulkblog.categories_set(category TEXT PRIMARY KEY, examples SET<text>);" </text>

<text>cqlsh -e "CREATE TABLE dsbulkblog.categories_list(category TEXT PRIMARY KEY, examples LIST<text>);" </text></text>

<text><text>cqlsh -e "CREATE TABLE dsbulkblog.president_birthdates(president TEXT PRIMARY KEY, birthdate_string TEXT, birthdate DATE);" </text></text>

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

150 | 0 | 329 | 0.02 | 0.05 | 9.84 | 39.58 | 45.88 | 1.00

$ cat /tmp/dsbulkblog/iris.csv | sed 's/Iris-//g' | dsbulk load -k dsbulkblog -t iris_with_id

(3 rows)

$ cat /tmp/dsbulkblog/iris_no_header.csv | awk 'BEGIN{print "sepal_length,sepal_width,petal_length,petal_width,species,id"} {print $0}' | dsbulk load -k dsbulkblog -t iris_with_id

$ dsbulk load -k dsbulkblog -t iris_with_id -url https://gist.github.com/brianmhess/8864cf0cb0ce9ea1fd64e579e9f41100/raw/522a1f564a381d7eacf6955c490bb6331d4369b2/iris.csv

$ mkdir /tmp/dsbulkblog/iris $ cp /tmp/dsbulkblog/iris.csv /tmp/dsbulkblog/iris

$ dsbulk load -url /tmp/dsbulkblog/iris -k dsbulkblog -t iris_with_id

$ cp /tmp/dsbulkblog/president_birthdates.psv /tmp/dsbulkblog/iris

$ dsbulk load -url /tmp/dsbulkblog/iris -k dsbulkblog -t iris_with_id --connector.csv.fileNamePattern "**/*.csv"

sepal_length,sepal_width,petal_length,petal_width,species,id

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length, petal_width = petal_width, species = species, id = id"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length, petal_width = petal_width, species = species, id = id"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "*=*"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "* = -species"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "* = [-species, -petal_length]"

$ dsbulk load -url /tmp/dsbulkblog/iris_no_header.csv -k dsbulkblog -t iris_with_id -header false -m "0=sepal_length, 1=sepal_width, 2=petal_length, 3=petal_width, 4=species, 5=id"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "field1=sepal_length,field2=sepal_width,field3=petal_length,field4=petal_width,field5=species,field6=id"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -header false -m "0=sepal_length,1=sepal_width,2=petal_length,3=petal_width,4=species,5=id"

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id" -header false -skipRecords 1

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "sepal_length,sepal_width,petal_length,petal_width,species,id" -header false -skipRecords 101

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

50 |      0 | 122 | 0.01 |   0.05 | 10.13 | 23.33 |  23.33 | 1.00

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -maxRecords 20

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches 20 | 0 | 53 | 0.00 | 0.05 | 5.86 | 12.26 | 12.26 | 1.00

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -skipRecords 100 -maxRecords 20

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

102 |    101 |  0 | 0.00 |  0.00 | 0.00 |  0.00 | 0.00 |   0.00 Rejected records can be found in the following file(s): mapping.bad

Resource: file:/tmp/dsbulkblog/iris.csv

Position: 129

Source: 6.4,2.8,5.6,2.1,Iris-virginica,128\u000a

java.lang.IllegalArgumentException: Could not parse 'Iris-virginica'; accepted formats are: a valid number (e.g. '1234.56'), a valid Java numeric format (e.g. '-123.45e6'), a valid date-time pattern (e.g. '2019-03-14T16:25:41.267Z'), or a valid boolean word        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:119)        at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72)        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:55)        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:26)        at com.datastax.dsbulk.engine.internal.codecs.ConvertingCodec.serialize(ConvertingCodec.java:50)        Suppressed: java.time.format.DateTimeParseException: Text 'Iris-virginica' could not be parsed at index 0                at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)                at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1819)                at com.datastax.dsbulk.engine.internal.codecs.util.SimpleTemporalFormat.parse(SimpleTemporalFormat.java:41)                at com.datastax.dsbulk.engine.internal.codecs.util.ZonedTemporalFormat.parse(ZonedTemporalFormat.java:45)                at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:106)                Suppressed: java.lang.NumberFormatException: For input string: "Iris-virginica"                        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)                        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)                        at java.lang.Double.parseDouble(Double.java:538)                        at java.lang.Double.valueOf(Double.java:502)                        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:101)                        Suppressed: java.lang.NumberFormatException: null                                at java.math.BigDecimal.<init>(BigDecimal.java:494)                                at java.math.BigDecimal.<init>(BigDecimal.java:383)                                at java.math.BigDecimal.<init>(BigDecimal.java:806)                                at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:96)                                at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72)                                Suppressed: java.text.ParseException: Invalid number format: Iris-virginica                                        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:247)                                        at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:92)                                        at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72)                                        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:55)                                        at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:26) </init></init></init>

$ dsbulk load -url /tmp/logs/LOAD_20190314-162539-255317/mapping.bad -k dsbulkblog -t iris_with_id -header false -m "0=sepal_length,1=sepal_width,2=petal_length,3=petal_width,4=species, 5=id"

Operation directory: /tmp/logs/LOAD_20190314-163033-590156 total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

101 |      0 | 231 | 0.01 |   0.05 | 8.50 | 17.17 |  18.22 | 1.00

Operation LOAD_20190314-163033-590156 completed successfully in 0 seconds.

Last processed positions can be found in positions.txt

$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s\n", $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id

Operation directory: /tmp/logs/LOAD_20190314-163121-694645

At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.

Operation LOAD_20190314-163121-694645 aborted: Too many errors, the maximum allowed is 100. total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

102 |    101 |  0 | 0.00 |  0.00 | 0.00 |  0.00 | 0.00 |   0.00

Rejected records can be found in the following file(s): mapping.bad

Errors are detailed in the following file(s): mapping-errors.log

Last processed positions can be found in positions.txt

--schema.allowMissingFields:

$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s\n", $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id --schema.allowMissingFields true

$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s,%s,extra\n", $1, $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id

$ cat /tmp/dsbulkblog/iris.csv | awk -F, '{printf("%s,%s,%s,%s,%s,%s,extra\n", $1, $2, $3, $4, $5, $6)}' | dsbulk load -k dsbulkblog -t iris_with_id --schema.allowExtraFields false

Operation directory: /tmp/logs/LOAD_20190314-163305-346514

At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.

Operation LOAD_20190314-163305-346514 aborted: Too many errors, the maximum allowed is 100. total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

102 |    101 |  0 | 0.00 |  0.00 | 0.00 |  0.00 | 0.00 |   0.00

Rejected records can be found in the following file(s): mapping.bad

Errors are detailed in the following file(s): mapping-errors.log

Last processed positions can be found in positions.txt

$ cat /tmp/dsbulkblog/iris.csv | sed "s/Iris/  Iris/g" | dsbulk load -k dsbulkblog -t iris_with_id

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms

150 | 0 | 392 | 0.01 | 0.02 | 9.49 | 41.68 | 41.68

Operation UNLOAD_20190321-144112-777452 completed successfully in 0 seconds.

species

$ cat /tmp/dsbulkblog/iris.csv | sed "s/Iris/ Iris/g" | dsbulk load -k dsbulkblog -t iris_with_id --connector.csv.ignoreLeadingWhitespaces true

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms

150 |      0 | 416 | 0.01 |   0.01 | 8.16 | 36.96 |  36.96

Operation UNLOAD_20190321-144510-244786 completed successfully in 0 seconds.

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -query "INSERT INTO dsbulkblog.iris_with_id(id,petal_width,petal_length,sepal_width,sepal_length,species) VALUES (:id, :petal_width, :petal_length, :sepal_width, :sepal_length, :species)"

$ dsbulk load -url /tmp/dsbulkblog/iris_no_header.csv -query "INSERT INTO dsbulkblog.iris_with_id(petal_width,petal_length,sepal_width,sepal_length,species,id) VALUES (?,?,?,?,?,?)" -header false

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -query "INSERT INTO dsbulkblog.iris_with_id(id,petal_width,petal_length,sepal_width,sepal_length,species) VALUES (:id, :petal_width, :petal_length, :sepal_width, :sepal_length, 'some kind of iris')"

total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches

10 |      0 | 32 | 0.00 |   0.00 | 4.59 | 5.96 |   5.96 | 1.00

$ cqlsh -e "SELECT * FROM dsbulkblog.iris_with_id WHERE id IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)"

(6 rows)

Discover more

Data ProcessingDataStax Bulk Loader

More Technology

View All

Technology • April 25, 2024

How to Build a Crystal Image Search App with Vector Search

Technology • April 18, 2024

Knowledge Graphs for RAG without a GraphDB

Technology • April 17, 2024

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Technology • April 16, 2024

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

Learn More

Get Started for Free

DataStax Bulk Loader Pt. 1 — Introduction and Loading

Brian Hess

Discover more

Share

Share

Setup

Loading Examples

Example 1: Simple

Example 2: Loading from stdin

Example 2.2: Loading from a URL

Example 2.3: Loading a directory of files

Example 2.4: Loading only some files in a directory

Example 3: Mappings

Example 3.1: Mapping by index

Example 3.2: Mapping as a list

Example 3.3: Skipping records

Example 3.4: Loading only some records

Example 3.5: Bad mapping or bad parsing

Example 3.6: Missing fields and extra fields

Example 3.7: Whitespace

Example 4: Providing the Query

Example 4.1: Providing the query itself

Example 4.2: Providing the query with positional mapping

Example 4.3: Providing a query with constant values