Brian Hess joins the show to explain why the bulk loader is a vital tool for a distributed database, the history of bulk loaders for Apache Cassandra, and the virtues of the new DSBulk.

Highlights!

0:15 - Jeff welcomes Brian Hess to the show and discusses the scalability of sweater vest clusters

1:09 - Why bulk loading is a capability that people just assume exists for all databases

2:29 - Existing tools for bulk loading for Cassandra / DSE include: 1) the cqlsh COPY TO / FROM command - which doesn’t scale or handle errors well

3:31 - 2) Cassandra’s sstableloader can be used to load data but isn’t really a bulk loader.

4:21 - 3) People have also used Spark and the DSE Spark Connector to load data

4:53 - 4) Brian wrote his own “Cassandra loader” open source project using CQL

6:32 - Introducing DS Bulk, a brand new bulk loader which builds on lessons learned from Cassandra loader

7:25 - Features include loading/unloading from JSON or CSV, number/data formats, security, 

8:41 - Supported transformations include support for the now() function, not case management. The tool operates via std in/out so that you can chain results with tools like sed and awk

9:52 - Unloading features include column selection and filtering / limiting

11:24 - What makes DS Bulk a superior tool: 1) high performance (4x faster than cqlsh COPY)

13:12 - 2) Error handling including the ability to isolate errors and continue, and a dry run mode

16:08 - 3) Ease of use and configurability

16:42 - Challenging parts of building the driver were handling some offbeat use cases, getting the user experience right, and prioritizing features for this first release 

20:02 - DS Bulk is a distinct tool from the DSE Graph Loader - at least for now

22:13 - Brian’s shout outs to the DS Bulk team

WATCH

Speakers

Brian Hess, Strategic Solution Engineer

Brian Hess

Strategic Solution Engineer at DataStax

Jeff Carpenter Headshot

Jeffrey Carpenter

Director of Developer Advocacy at DataStax