Brian Hess

The first three blog posts in this series dealt with data loading (<a href="/blog/2019/03/datastax-bulk-loader-introduction-and-loading">here</a>, <a href="/blog/2019/04/datastax-bulk-loader-more-loading">here</a>, and <a href="/blog/2019/04/datastax-bulk-loader-common-settings">here</a>), and the fourth blog post (<a href="/blog/2019/06/datastax-bulk-loader-unloading">here</a>) dealt with data unloading. &nbsp;This blog will deal with the count mode for dsbulk.

<hr />
Something new in dsbulk version 1.1.0 is the ability to count the data in a table. &nbsp;This is a common task that folks do once the load data to ensure that the data was loaded correctly. &nbsp;It could be that a load failed midway and they would like to see how far it got. It could be that the primary key was not what was desired, and instead of unique inserts it became overwrites.

<h2>Example 23: Simple Counting</h2>

Let’s start with a simple count of the data:

<code>$ dsbulk count -k dsbulkblog -t iris_with_id </code>

Or

<code>$ dsbulk count -k dsbulkblog -t iris_with_id --stats.modes global </code>

Or

<code>$ dsbulk count -k dsbulkblog -t iris_with_id -stats global </code>

These all produce the same output:

<code>Operation directory: /tmp/logs/COUNT_20190314-171517-238903. </code>

<code>total | failed | rows/s | mb/s | kb/row | p50 ms | p99ms | p999ms &nbsp;&nbsp;</code>

<code>&nbsp; &nbsp;150 |&nbsp; &nbsp; &nbsp; 0 | 400 | 0.00 | &nbsp; 0.00 | 18.68 | 18.74 |&nbsp; 18.74 </code>

<code>Operation COUNT_20190314-171517-238903 completed successfully in 0 seconds. </code>

<code>150 </code>

<h2>Example 24: Counting without other information printed</h2>

We can remove the extraneous reporting information by reducing the verbosity via:

<code>$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 </code>

Which produces just:

<code>150 </code>

<h2>Example 25: Counting by host</h2>

There are a few different ways to group up the counts. The first is to count the rows per host:

<code>$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode hosts </code>

I did this example on my local machine, so there is only one host. The output for me was:

<code>/127.0.0.1:9042 150 100.00 </code>

The first column is the host, the second is the count, and the third is the percentage of the total.

<h2>Example 26: Counting by range</h2>

Sometimes it’s important to understand the size by the token&nbsp;<code>ranges</code>&nbsp;in the system. To do this, we use the <code>ranges</code> mode:

<code>$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges </code>

Again, in a single node machine the output is a little simple:

<code>-9223372036854775808 -9223372036854775808 150 100.00 </code>

Here we have the beginning token of the range, the ending token of the range (again, a little strange as we have only one range), the count, and the percentage.

<h2>Example 27: Counting the largest partitions</h2>

The <code>partitions</code> option will count the largest partitions for a table, in terms of number of rows per partition. To do this, we need to create a table with a clustering column. Let’s do that with the Iris data set using:

<code>$ cqlsh -e "CREATE TABLE dsbulkblog.iris_clustered(id INT, petal_length DOUBLE, petal_width DOUBLE, sepal_length DOUBLE, sepal_width DOUBLE, species TEXT, PRIMARY KEY ((species), id))" </code>

We can load the <code>iris.csv</code> data into it with:

<code>$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_clustered </code>

Now we can count the largest partitions with:

<code>$ dsbulk count -k dsbulkblog -t iris_clustered --log.verbosity 0 --stats.mode partitions </code>

Which produces this output:

<code>'Iris-virginica' 50 33.33 </code>

<code>'Iris-versicolor' 50 33.33 </code>

<code>'Iris-setosa' 50 33.33 </code>

The first column is the primary key for the partition, the second is the count, and the third is the percentage of the total.

<h2>Example 28: Counting by ranges and hosts</h2>

The&nbsp; <code>--stats.mode</code>&nbsp;parameter takes a list of modes, for example:

$ dsbulk count -k dsbulkblog -t iris_with_id --log.verbosity 0 --stats.mode ranges,hosts

Which produces this output:

<code>Total rows per host: </code>

<code>/127.0.0.1:9042 150 100.00 </code>

<code>Total rows per token range: </code>

<code>-9223372036854775808 -9223372036854775808 150 100.00 </code>

<h2>Example 29: Counting with a predicate</h2>

Sometimes you want to filter and only count some of the records. To do this, we can use the <code>--schema.query</code> (or -q) to specify the predicates. To do this, we will supply the full SELECT statement as if we were going to unload the data, but dsbulk will instead count the results.

For example, if we only wanted to count the rows where <code>petal_width = 2</code>, we could use:

<code>$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE petal_width = 2 ALLOW FILTERING" </code>

Which would produce the following output:

<code>Operation directory: /tmp/logs/COUNT_20190314-171916-543786. </code>

<code>total | failed | rows/s | mb/s | kb/row | p50 ms |&nbsp; p99ms | p999ms &nbsp;&nbsp;&nbsp;&nbsp;</code>

<code>&nbsp; &nbsp;6 |&nbsp; &nbsp; &nbsp; 0 | 18 | 0.00 | &nbsp; 0.00 | 130.81 | 131.07 | 131.07 </code>

<code>Operation COUNT_20190314-171916-543786 completed successfully in 0 seconds. </code>

<code>6 </code>

We can add parallelism to this query by doing:

<code>$ dsbulk count -query "SELECT id FROM dsbulkblog.iris_with_id WHERE Petal_width = 2 AND Token(id) &gt; :start AND Token(id) &lt;= :end ALLOW FILTERING" </code>

<hr />
To download the DataStax Bulk Loader, click&nbsp;<a href="https://downloads.datastax.com/#bulk-loader" target="_blank">here</a>.

For an intro to unloading, read the previous Bulk Loader blog&nbsp;<a href="https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading">here</a>.

For DataStax Bulk Loader Part 6 on&nbsp;Examples for Loading From Other Locations, go <a href="https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations">here</a>.

<img alt="dsbulk DataStax Enterprise" data-entity-type="file" data-entity-uuid="2b7b7f65-7a00-4335-a6ca-8d99d8541f92" src="https://www.datastax.com/sites/default/files/inline-images/dsbulk_3.png" />

DataStax Bulk Loader Pt. 5 — Counting

Brian Hess

Discover more

Share

Share

Example 23: Simple Counting

Example 24: Counting without other information printed

Example 25: Counting by host

Example 26: Counting by range

Example 27: Counting the largest partitions

Example 28: Counting by ranges and hosts

Example 29: Counting with a predicate

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI