Technology•July 8, 2019

DataStax Bulk Loader Pt. 5 — Counting

Brian Hess

The first three blog posts in this series dealt with data loading (here, here, and here), and the fourth blog post (here) dealt with data unloading. This blog will deal with the count mode for dsbulk.

Something new in dsbulk version 1.1.0 is the ability to count the data in a table. This is a common task that folks do once the load data to ensure that the data was loaded correctly. It could be that a load failed midway and they would like to see how far it got. It could be that the primary key was not what was desired, and instead of unique inserts it became overwrites.

Example 23: Simple Counting

Let’s start with a simple count of the data:

$ dsbulk count -k dsbulkblog -t iris_with_id

$ dsbulk count -k dsbulkblog -t iris_with_id --stats.modes global

$ dsbulk count -k dsbulkblog -t iris_with_id -stats global

These all produce the same output:

Operation directory: /tmp/logs/COUNT_20190314-171517-238903.

total | failed | rows/s | mb/s | kb/row | p50 ms | p99ms | p999ms

   150 |      0 | 400 | 0.00 |   0.00 | 18.68 | 18.74 |  18.74

Operation COUNT_20190314-171517-238903 completed successfully in 0 seconds.

Example 24: Counting without other information printed

We can remove the extraneous reporting information by reducing the verbosity via:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0

Which produces just:

Example 25: Counting by host

There are a few different ways to group up the counts. The first is to count the rows per host:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode hosts

I did this example on my local machine, so there is only one host. The output for me was:

/127.0.0.1:9042 150 100.00

The first column is the host, the second is the count, and the third is the percentage of the total.

Example 26: Counting by range

Sometimes it’s important to understand the size by the token ranges in the system. To do this, we use the ranges mode:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges

Again, in a single node machine the output is a little simple:

-9223372036854775808 -9223372036854775808 150 100.00

Here we have the beginning token of the range, the ending token of the range (again, a little strange as we have only one range), the count, and the percentage.

Example 27: Counting the largest partitions

The partitions option will count the largest partitions for a table, in terms of number of rows per partition. To do this, we need to create a table with a clustering column. Let’s do that with the Iris data set using:

$ cqlsh -e "CREATE TABLE dsbulkblog.iris_clustered(id INT, petal_length DOUBLE, petal_width DOUBLE, sepal_length DOUBLE, sepal_width DOUBLE, species TEXT, PRIMARY KEY ((species), id))"

We can load the iris.csv data into it with:

$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_clustered

Now we can count the largest partitions with:

$ dsbulk count -k dsbulkblog -t iris_clustered --log.verbosity 0 --stats.mode partitions

Which produces this output:

'Iris-virginica' 50 33.33

'Iris-versicolor' 50 33.33

'Iris-setosa' 50 33.33

The first column is the primary key for the partition, the second is the count, and the third is the percentage of the total.

Example 28: Counting by ranges and hosts

The --stats.mode parameter takes a list of modes, for example:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges,hosts

Which produces this output:

Total rows per host:

/127.0.0.1:9042 150 100.00

Total rows per token range:

-9223372036854775808 -9223372036854775808 150 100.00

Example 29: Counting with a predicate

Sometimes you want to filter and only count some of the records. To do this, we can use the --schema.query (or -q) to specify the predicates. To do this, we will supply the full SELECT statement as if we were going to unload the data, but dsbulk will instead count the results.

For example, if we only wanted to count the rows where petal_width = 2, we could use:

$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE petal_width = 2 ALLOW FILTERING"

Which would produce the following output:

Operation directory: /tmp/logs/COUNT_20190314-171916-543786.

total | failed | rows/s | mb/s | kb/row | p50 ms |  p99ms | p999ms

   6 |      0 | 18 | 0.00 |   0.00 | 130.81 | 131.07 | 131.07

Operation COUNT_20190314-171916-543786 completed successfully in 0 seconds.

We can add parallelism to this query by doing:

$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE Petal_width = 2 AND Token(id) > :start AND Token(id) <= :end ALLOW FILTERING"

To download the DataStax Bulk Loader, click here.

For an intro to unloading, read the previous Bulk Loader blog here.

For DataStax Bulk Loader Part 6 on Examples for Loading From Other Locations, go here.

dsbulk DataStax Enterprise

Discover more

Data ProcessingDataStax Bulk Loader

JUMP TO SECTION

More Technology

View All

How to Build a Crystal Image Search App with Vector Search

Technology • April 25, 2024

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

Learn More

Get Started for Free

DataStax Bulk Loader Pt. 5 — Counting

Brian Hess

Example 23: Simple Counting

Example 24: Counting without other information printed

Example 25: Counting by host

Example 26: Counting by range

Example 27: Counting the largest partitions

Example 28: Counting by ranges and hosts

Example 29: Counting with a predicate

Discover more

Share

Share

Example 23: Simple Counting

Example 24: Counting without other information printed

Example 25: Counting by host

Example 26: Counting by range

Example 27: Counting the largest partitions

Example 28: Counting by ranges and hosts

Example 29: Counting with a predicate

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI