Jake Luciani

<p><strong><em>This is the first part of a two part series on the new stress tool.</em></strong></p>

<h2>Background</h2>

<p>The choices you make when data modeling your application can make a big difference in how it performs. How many times have you seen database benchmarks that look impressive in an article but when you try them with your data/schema you are left disappointed and confused? &nbsp;To get a proper understanding how a database scales and to capacity plan your application,&nbsp;requires significant effort in load testing. &nbsp;Most importantly, to understand the tradeoffs you are making in your data model and settings requires multiple iterations before you get it right.</p>

<p>To make testing data models simpler, we have extended the cassandra-stress tool in Cassandra 2.1 to support stress testing arbitrary CQL tables and arbitrary queries on that table. &nbsp;We think it will be a very useful tool for users who want to quickly see how a schema will perform. &nbsp;And it will help us on the Cassandra team diagnose and fix performance problems and other issues from a single tool. &nbsp;Although this tool comes in the Cassandra 2.1 release it also works on Cassandra 2.0 clusters.</p>

<p>In this post I'll explain how to create a CQL based stress profile and how to execute it. Finally I'll cover some of the current limitations.</p>

<h2>The new stress YAML profile</h2>

<p>The new cassandra-stress now supports a YAML based profile that is used to define your specific schema with any potential compaction strategy, cache settings and types you wish, without having to write a custom tool.</p>

<p>The YAML file is split into a few sections:</p>

<ol>
	<li><strong>DDL</strong>&nbsp;- for defining your schema</li>
	<li><strong>Column Distributions</strong>&nbsp;- for defining the shape and size of each column globally and within each partition</li>
	<li><strong>Insert Distributions</strong>&nbsp;&nbsp;- for defining how the data is written during the stress test</li>
	<li><strong>DML</strong>&nbsp;- &nbsp;for defining how the data is queried during the stress test</li>
</ol>

<p>To help explain the file let's define one to model a simple app to hold blog posts for multiple websites, the posts are ordered in reverse chronological order.</p>

<h3>DDL</h3>

<p>The DDL section is straight forward. &nbsp;Just define the keyspace and table information. &nbsp;If the schema is not yet defined the stress tool will create it the first time you run stress on this profile. &nbsp;If you have already created the schema separately then you only need to define the keyspace and table names.</p>

<p><img alt="DML stress" data-entity-type="file" data-entity-uuid="11469d49-93c6-402e-8ca3-b85fb0543cd7" src="https://www.datastax.com/sites/default/files/inline-images/DML-stress1.png" /></p>

<h3>Column Distributions</h3>

<p>Next, the 'columnspec' section describes the different distributions to use for each column. &nbsp;These distributions model the size of the data in the column, the number of unique values, and the clustering of them within a given partition.&nbsp;These distributions are used to auto generate data that "looks" like what you would see in reality. &nbsp;The actual data is garbage but it's reproducible and procedural to generate.</p>

<p>The possible distributions are:</p>

<ul>
	<li><strong>EXP</strong>(min..max)

	<ul>
		<li>&nbsp;An&nbsp;<a href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>&nbsp;over the range [min..max]</li>
	</ul>
	</li>
	<li><strong>EXTREME</strong>(min..max,shape)
	<ul>
		<li>An extreme value (<a href="http://en.wikipedia.org/wiki/Weibull_distribution">Weibull</a>) distribution over the range [min..max]</li>
	</ul>
	</li>
	<li><strong>GAUSSIAN</strong>(min..max,stdvrng)
	<ul>
		<li>A&nbsp;<a href="http://en.wikipedia.org/wiki/Gaussian_distribution">gaussian/normal distribution</a>, where mean=(min+max)/2, and stdev is (mean-min)/stdvrng</li>
	</ul>
	</li>
	<li><strong>GAUSSIAN</strong>(min..max,mean,stdev)
	<ul>
		<li>A&nbsp;<a href="http://en.wikipedia.org/wiki/Gaussian_distribution">gaussian/normal distribution</a>, with explicitly defined mean and stdev</li>
	</ul>
	</li>
	<li><strong>UNIFORM</strong>(min..max)
	<ul>
		<li>A<a href="http://en.wikipedia.org/wiki/Uniform_distribution_(continuous)">&nbsp;uniform distribution</a>&nbsp;over the range [min, max]</li>
	</ul>
	</li>
	<li><strong>FIXED</strong>(val)
	<ul>
		<li>A fixed distribution, always returning the same value</li>
	</ul>
	</li>
</ul>

<p><strong>NOTE</strong>: If you use a ~ prefix, the distribution will be inverted.</p>

<p>For each column you can specify (note the defaults):</p>

<ul>
	<li>Size distribution - Defines the distribution of sizes for text, blob, set and list types &nbsp;(default of UNIFORM(4..8))</li>
	<li>Population distribution - Defines the distribution of unique values for the column values (default of&nbsp;UNIFORM(1..100B))</li>
	<li>Cluster distribution - Defines the &nbsp;distribution for the number of clustering prefixes within a given partition (default of FIXED(1))</li>
</ul>

<p>In our example it makes sense to size the fields appropriately to their limits in reality. Most blogs have large bodies and at most a thousand posts per blog.</p>

<p><img alt="column spec stress" data-entity-type="file" data-entity-uuid="84251cee-fc91-4326-a15f-ecd58a6c6377" src="https://www.datastax.com/sites/default/files/inline-images/columspec-stress.png" /></p>

<h3>&nbsp;</h3>

<h3>Insert Distributions</h3>

<p>The insert section lets you specify how data is inserted during stress. &nbsp;This get's a little tricky to think about but it's pretty straight forward once you grasp it.</p>

<p>For each insert operation you can specify the following distributions/ratios:</p>

<ul>
	<li>Partition distribution
	<ul>
		<li>The number of partitions to update per batch (default FIXED(1))</li>
	</ul>
	</li>
	<li>select distribution ratio
	<ul>
		<li>The&nbsp;ratio of rows each partition should insert&nbsp;as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns). default FIXED(1)/1</li>
	</ul>
	</li>
	<li>Batch type
	<ul>
		<li>The type of CQL batch to use. Either LOGGED/UNLOGGED (default LOGGED)</li>
	</ul>
	</li>
</ul>

<p>In our example it makes sense to only insert a single blog post at once to a single domain.</p>

<p><img alt="insert section" data-entity-type="file" data-entity-uuid="e756b89c-4507-4fbe-9b7a-6608e0e1743a" src="https://www.datastax.com/sites/default/files/inline-images/insert_section.png" /></p>

<h3>DML</h3>

<p>You can specify any CQL query on the table by naming them under the 'queries' section.</p>

<p>The 'fields' field specifies if the bind variables should be picked from the same row or across all rows in the partition&nbsp;</p>

<p>In our example case we may want to see how fetching the most recent post for a domain as well as the previous 10 post meta-information to show in a timeline view.</p>

<p><img alt="query section" data-entity-type="file" data-entity-uuid="4f4a9bd5-4baf-4b5e-a2bb-343318d4471a" src="https://www.datastax.com/sites/default/files/inline-images/query_section.png" /></p>

<h2>Putting it all together</h2>

<p>So now that we have our profile we can run it with the following commands. &nbsp;&nbsp;<strong><a href="https://gist.github.com/tjake/fb166a659e8fe4c8d4a3">The complete YAML and results is located here</a>.</strong></p>

<p><strong>Inserts</strong>:</p>

<pre>
./bin/cassandra-stress user profile=./blogpost.yaml ops\(insert=1\)</pre>

<p>Without any other options stress will run our inserts starting with 4 threads and increasing them till it reaches a limit. All inserts are done with the native transport and prepared statements.&nbsp;The full list of cassandra-stress features is listed under the help command.</p>

<p>On my laptop this was&nbsp;~8,500 inserts/s with 401 threads. This is significantly slower then the default stress, but we don't expect it to be as fast since this is &gt; 1Kb per insert.</p>

<p><strong>Queries</strong>:</p>

<pre>
./bin/cassandra-stress user profile=blogpost.yaml ops\(singlepost=1\)</pre>

<p>Reading a single post yields ~7000 queries/sec</p>

<pre>
./bin/cassandra-stress user profile=./blogpost.yaml ops\(timeline=1\)</pre>

<p>Reading a timline yields ~7000 queries/sec, but ~25000 CQL rows/sec since this is multiple rows per domain</p>

<p><strong>Mixed</strong>:</p>

<pre>
./bin/cassandra-stress user profile=./blogpost.yaml ops\(singlepost=2,timeline=1,insert=1\)</pre>

<p>We can also run many types of queries and inserts at once. &nbsp;This syntax sends three queries for every one insert.</p>

<h2>Other YAML examples</h2>

<p>Cassandra 2.1 comes with three sample yaml files in the tools directory with more advanced examples</p>

<p><a href="https://github.com/apache/cassandra/tree/cassandra-2.1/tools">https://github.com/apache/cassandra/tree/cassandra-2.1/tools</a></p>

<h2>Limitations/improvements</h2>

<p>The new stress covers a lot of use cases but there are some things it can't do. &nbsp;We do plan to address these in future releases:</p>

<ul>
	<li>Doesn't support map types or user defined types</li>
	<li>Indexes must be manually added to your tables</li>
</ul>

<p>Some of the features we wish to add are:</p>

<ul>
	<li>Random sentence, instead of random string, generation to more accurately test the effect of compression.</li>
	<li>More control over read and write patterns, like only query the most recent partitions added.</li>
</ul>

<h2>To be continued...</h2>

<p>This post covered some of the basic of the new stress tool, in the next post we will cover a more advanced example.</p>


Improved Cassandra 2.1 Stress Tool: Benchmark Any Schema – Part 1

Jake LucianiEngineering

Share

Share

Background

The new stress YAML profile

DDL

Column Distributions

Insert Distributions

DML

Putting it all together

Other YAML examples

Limitations/improvements

To be continued...

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI