Jonathan Ellis

Cassandra long ago moved beyond its original design of limiting reads to primary key lookups.

This additional power does bring with it some additional complexity. To make it easier to understand what Cassandra is doing under the hood, we've added request tracing to Cassandra 1.2.

<h3>Tracing an insert</h3>

As a simple example, consider this example table:

&nbsp;

<pre>
<code>CREATE TABLE test (
 a int PRIMARY KEY,
 b text
);

INSERT INTO test (a, b) VALUES (1, 'example');
INSERT INTO test (a, b) VALUES (2, 'example');
INSERT INTO test (a, b) VALUES (3, 'example');
INSERT INTO test (a, b) VALUES (4, 'example');
INSERT INTO test (a, b) VALUES (5, 'example');
INSERT INTO test (a, b) VALUES (6, 'example');
INSERT INTO test (a, b) VALUES (7, 'example');
INSERT INTO test (a, b) VALUES (8, 'example');
INSERT INTO test (a, b) VALUES (9, 'example');
INSERT INTO test (a, b) VALUES (10, 'example');
</code></pre>

&nbsp;

Now we'll enable tracing and insert a row. (These numbers are from a&nbsp;<a href="https://github.com/pcmanus/ccm">ccm</a>&nbsp;cluster with all three nodes on my macbook air on a cold JVM.)

&nbsp;

<pre>
<code>cqlsh&gt; tracing on;
Now tracing requests.

cqlsh:foo&gt; INSERT INTO test (a, b) VALUES (1, 'example');
Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9

 activity | timestamp | source | source_elapsed
-------------------------------------+--------------+-----------+----------------
 execute_cql3_query | 00:02:37,015 | 127.0.0.1 | 0
 Parsing statement | 00:02:37,015 | 127.0.0.1 | 81
 Preparing statement | 00:02:37,015 | 127.0.0.1 | 273
 Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540
 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779

 Messsage received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63
 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220
 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250
 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277
 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378
 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710
 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888

 Messsage received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334
 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550
 Request complete | 00:02:37,017 | 127.0.0.1 | 2581
</code></pre>

&nbsp;

You can see that there are three distinct stages to a simple insert like this:

<ol>
	<li>The coordinator figures out which node(s) this row should be replicated to</li>
	<li>The replica (in yellow) appends the row to the commitlog, then adds it to the&nbsp;<a href="http://www.slideshare.net/pcmanus/fosdem-2012">memtable</a></li>
	<li>The coordinator receives a confirmation from the replica and tells the client that the request was successful</li>
</ol>

<h3>Tracing a sequential scan</h3>

Let's look at a more complicated example. Here I've pre-inserted ten rows like the above one.

&nbsp;

<pre>
<code>cqlsh:foo&gt; SELECT * FROM test;

 activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------+--------------+-----------+----------------
 execute_cql3_query | 16:36:52,749 | 127.0.0.1 | 0
 Parsing statement | 16:36:52,749 | 127.0.0.1 | 56
 Peparing statement | 16:36:52,749 | 127.0.0.1 | 259
 Determining replicas to query | 16:36:52,749 | 127.0.0.1 | 540

 Executing seq scan across 0 sstables for [min(-9223372036854775808), max(0)] | 16:36:52,750 | 127.0.0.1 | 1522
 Scanned 6 rows and matched 6 | 16:36:52,751 | 127.0.0.1 | 2062

 Sending message to /127.0.0.2 | 16:36:52,751 | 127.0.0.1 | 2292

 Message received from /127.0.0.1 | 16:36:52,751 | 127.0.0.2 | 37
 Executing seq scan across 0 sstables for (max(0), max(3074457345618258602)] | 16:36:52,752 | 127.0.0.2 | 663
 Scanned 2 rows and matched 2 | 16:36:52,755 | 127.0.0.2 | 3952
 Enqueuing response to /127.0.0.1 | 16:36:52,755 | 127.0.0.2 | 4001
 Sending message to /127.0.0.1 | 16:36:52,756 | 127.0.0.2 | 4366

 Message received from /127.0.0.2 | 16:36:52,757 | 127.0.0.1 | 8019
 Processing response from /127.0.0.2 | 16:36:52,757 | 127.0.0.1 | 8544
 Sending message to /127.0.0.3 | 16:36:52,758 | 127.0.0.1 | 9117

 Message received from /127.0.0.1 | 16:36:52,759 | 127.0.0.3 | 36
 Executing seq scan across 0 sstables for (max(3074457345618258602), max(6148914691236517204)] | 16:36:52,764 | 127.0.0.3 | 4891
 Scanned 1 rows and matched 1 | 16:36:52,765 | 127.0.0.3 | 6418
 Enqueuing response to /127.0.0.1 | 16:36:52,765 | 127.0.0.3 | 6460
 Sending message to /127.0.0.1 | 16:36:52,765 | 127.0.0.3 | 6874

 Message received from /127.0.0.3 | 16:36:52,766 | 127.0.0.1 | 17470
 Processing response from /127.0.0.3 | 16:36:52,766 | 127.0.0.1 | 17703
 Executing seq scan across 0 sstables for (max(6148914691236517204), min(-9223372036854775808)] | 16:36:52,767 | 127.0.0.1 | 18473

 Scanned 1 rows and matched 1 | 16:36:52,767 | 127.0.0.1 | 18681

 Request complete | 16:36:52,768 | 127.0.0.1 | 19293
</code></pre>

&nbsp;

This is substantially longer, since we're doing a sequential scan across the whole cluster:

<ol>
	<li>The coordinator sets up the replicas to query</li>
	<li>(blue) The first replica queries is the coordinator, which has 6 rows</li>
	<li>(yellow) 127.0.0.2 has 2 rows</li>
	<li>(green)&gt; 127.0.0.3 has 1 row</li>
	<li>(blue)&gt; A second scan (of a different replication range) on the coordinator finds 1 more row</li>
</ol>

(<a href="https://issues.apache.org/jira/browse/CASSANDRA-4858">CASSANDRA-4858</a>&nbsp;is open to merge the two queries against the coordinator to a single one.)

<h3>Tracing an indexed query</h3>

Now I'd like to examine using tracing to diagnose performance problems. Consider a simple table of users:

&nbsp;

<pre>
<code>CREATE TABLE users (
 id uuid PRIMARY KEY,
 firstname text,
 lastname text
);

CREATE INDEX ON users (firstname);
</code></pre>

&nbsp;

Now I've inserted a user named Bob Higginbigham, and 99,999 other users named Bob Smith.

&nbsp;

<pre>
<code> activity | timestamp | source | source_elapsed
--------------------------------------------------------------+--------------+-----------+----------------
 execute_cql3_query | 17:04:36,417 | 127.0.0.1 | 0
 Parsing statement | 17:04:36,417 | 127.0.0.1 | 57
 Peparing statement | 17:04:36,418 | 127.0.0.1 | 682
 Determining replicas to query | 17:04:36,424 | 127.0.0.1 | 6623
 [... removed: the coordinator scans itself... ]
 Sending message to /127.0.0.2 | 17:04:46,433 | 127.0.0.1 | 105485

 Message received from /127.0.0.1 | 17:04:46,439 | 127.0.0.2 | 29
Executing indexed scan for (max(0), max(3074457345618258602)] | 17:04:46,441 | 127.0.0.2 | 1844
Executing single-partition query on users.users_firstname_idx | 17:04:46,442 | 127.0.0.2 | 2692
 Acquiring sstable references | 17:04:46,442 | 127.0.0.2 | 2712
 Merging memtable contents | 17:04:46,442 | 127.0.0.2 | 2740
 Scanned 32137 rows and matched 1 | 17:04:57,243 | 127.0.0.2 | 112842
 Enqueuing response to /127.0.0.1 | 17:04:57,243 | 127.0.0.2 | 112861
 Sending message to /127.0.0.1 | 17:04:57,243 | 127.0.0.2 | 112938

 [... removed: the coordinator scands 127.0.0.3 ...]
 Request complete | 17:04:46,458 | 127.0.0.1 | 291378
</code></pre>

&nbsp;

I've edited this trace to focus on the replica that owns Bob Higginbigham.

Note how Cassandra has to scan all 30,000+ rows (all 100,000 including the other machines) to find the Bob we're looking for, since we only have an index on firstname. Moral: Cassandra isn't black magic, you still need to create appropriate indexes to get the performance you want.

<h3>Tracing a queue antipattern</h3>

Here's a more subtle one.

It is tempting to use Cassandra as a durable queue, e.g.

&nbsp;

<pre>
<code>CREATE TABLE queues (
 id text,
 created_at timeuuid,
 value blob,
 PRIMARY KEY (id, created_at)
);
</code></pre>

&nbsp;

This relies on&nbsp;<a href="https://www.datastax.com/dev/blog/schema-in-cassandra-1-1">Cassandra's clustering</a>&nbsp;within a partition (where the partition is the queue id) to order queue entries by creation time. Then grabbing the most recent queue entry is just&nbsp;<tt>SELECT FROM queues WHERE id = 'myqueue' ORDER BY created_at LIMIT 1</tt>.

Here's what that looks like after creating and removing 100,000 entries:

&nbsp;

<pre>
<code> activity | timestamp | source | source_elapsed
------------------------------------------+--------------+-----------+----------------
 execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0
 Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541

 Messsage received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39
 Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943
 Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973
 Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020
 Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081
 Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072
 Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220
 Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314

 Messsage received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908
 Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650
 Request complete | 19:31:05,688 | 127.0.0.1 | 38047
</code></pre>

&nbsp;

Take a look in the middle here: "Read 1 live cells and 100000 tombstoned."

Because Cassandra uses a&nbsp;<a href="http://www.slideshare.net/pcmanus/fosdem-2012">log-structured storage engine</a>, deletes do not immediately remove all traces of a row. Instead, Cassandra writes a deletion marker called a tombstone that supresses the old data until it can be compacted away.

So what we see in this trace is Cassandra having to read past all the older, deleted entries, before it gets to one that is still alive -- a sign that you need to rethink your data model.

<h3>Availability</h3>

Tracing is part of Cassandra 1.2. Beta 2 is&nbsp;<a href="http://cassandra.apache.org/download/">ready for testing</a>; we expect the final release before the end of the year.

Request tracing in Cassandra 1.2

Jonathan EllisTechnology

Share

Share

Tracing an insert

Tracing a sequential scan

Tracing an indexed query

Tracing a queue antipattern

Availability

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI