Aleksey Yeschenko

<h2>Deletes in Cassandra</h2>

<p>Cassandra uses a log-structured storage engine. Because of this, deletes&nbsp;do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting. Only then a tombstone can be permanently discarded by compaction.</p>

<p>This scheme allows for very fast deletes (and writes in general), but it's not free: aside from the obvious RAM/disk overhead of tombstones, you might have to pay a certain price when reading data&nbsp;back if you haven't modeled your data well.</p>

<p>Specifically, tombstones will&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5143" title="https://issues.apache.org/jira/browse/CASSANDRA-5143">bite</a>&nbsp;you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.</p>

<h2>Symptoms of a wrong data model</h2>

<p>To illustrate this scenario, let's consider the most extreme case - using Cassandra as a durable queue, a known anti-pattern, e.g.</p>

<pre>
<code>CREATE TABLE queues (
</code>
<code>    name text,
</code>
<code>    enqueued_at timeuuid,
</code>
<code>    payload blob,
</code>
<code>    PRIMARY KEY (name, enqueued_at)
</code>
<code>);
</code>
</pre>

<p>Having enqueued 10000 10-byte messages and then dequeued 9999 of them, one by one, let's peek at the last remaining message using cqlsh with TRACING ON:&nbsp;</p>

<pre>
<code>SELECT enqueued_at, payload
</code>
<code>  FROM queues
</code>
<code> WHERE name = 'queue-1'
</code>
<code> LIMIT 1;
</code>
</pre>

<pre>
<code>activity                                   | source    | elapsed
</code>
<code>-------------------------------------------+-----------+--------
</code>
<code>                        execute_cql3_query | 127.0.0.3 |       0
</code>
<code>                         Parsing statement | 127.0.0.3 |      48
</code>
<code>                        Peparing statement | 127.0.0.3 |     362
</code>
<code>          Message received from /127.0.0.3 | 127.0.0.1 |      42
</code>
<code>             Sending message to /127.0.0.1 | 127.0.0.3 |     718
</code>
<code>Executing single-partition query on queues | 127.0.0.1 |     145
</code>
<code>              Acquiring sstable references | 127.0.0.1 |     158
</code>
<code>                 Merging memtable contents | 127.0.0.1 |     189
</code>
<code>Merging data from memtables and 0 sstables | 127.0.0.1 |     235
</code>
<code>    Read 1 live and 19998 tombstoned cells | 127.0.0.1 |  251102
</code>
<code>          Enqueuing response to /127.0.0.3 | 127.0.0.1 |  252976
</code>
<code>             Sending message to /127.0.0.3 | 127.0.0.1 |  253052
</code>
<code>          Message received from /127.0.0.1 | 127.0.0.3 |  324314
</code>
<code>       Processing response from /127.0.0.1 | 127.0.0.3 |  324535
</code>
<code>                          Request complete | 127.0.0.3 |  324812
</code>
</pre>

<p>Now even though the whole row was still in memory, the request took more than 300 milliseconds (all the numbers are from a 3-node&nbsp;<a href="https://github.com/pcmanus/ccm" title="https://github.com/pcmanus/ccm">ccm</a>&nbsp;cluster running on a 2012 MacBook Air).</p>

<h2>Why did the query take so long to complete?</h2>

<p>A slice query will keep reading columns until one of the following condition is met (assuming regular, non-reverse order):</p>

<ul>
	<li>the specified limit of&nbsp;<em>live</em>&nbsp;columns has been read</li>
	<li>a column beyond the finish column has been read (if specified)</li>
	<li>all columns in the row have been read</li>
</ul>

<p>In the previous scenario Cassandra had to read 9999 tombstones (and create 9999 DeletedColumn objects) before it could get to the only live entry. And all the collected tombstones 1) were consuming heap and 2) had to be serialised and sent back to the coordinator node&nbsp;along with the single live column.</p>

<p>For comparison, it took less than 1 millisecond for the same query to complete when no column-level tombstones were involved.</p>

<p>The queue example might be extreme, but you'll see the same behaviour when performing slice queries on any row with lots of deleted columns. Also, expiring columns, while more subtle, are going to have the same effect on slice queries once they expire and become tombstones.</p>

<h2>Potential workarounds</h2>

<p>If you are seeing this pattern (have to read past many deleted columns before getting to the live ones), chances are that you got your data model wrong and must fix it.</p>

<p>For example, consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren't needed anymore.</p>

<p>In other words, if you use column-level deletes (or expiring columns) heavily and also need to perform slice queries over that data, try grouping columns with close 'expiration date' together and getting rid of them in a single move.</p>

<h2>When you know where your live columns begin</h2>

<p>Note that it's possible to improve on this hypothetical queue scenario. Specifically, when knowing what the last entry was, a consumer can specify the start column and thus somewhat mitigate the effect of tombstones by not having to either 1) start scanning at the beginning of the row and 2) collect and keep all the irrelevant tombstones in memory.</p>

<p>To show what I mean, let's modify the original example by using the previously consumed entry's key as the start column for the query, i.e.</p>

<pre>
<code>SELECT enqueued_at, payload
</code>
<code>  FROM queues
</code>
<code> WHERE name = 'queue-1'
</code>
<code>   AND enqueued_at &gt; 9d1cb818-9d7a-11b6-96ba-60c5470cbf0e
</code>
<code> LIMIT 1;
</code>
</pre>

<pre>
<code>activity                                   | source    | elapsed
</code>
<code>-------------------------------------------+-----------+--------
</code>
<code>                        execute_cql3_query | 127.0.0.3 |       0
</code>
<code>                         Parsing statement | 127.0.0.3 |      45
</code>
<code>                        Peparing statement | 127.0.0.3 |     329
</code>
<code>             Sending message to /127.0.0.1 | 127.0.0.3 |     965
</code>
<code>          Message received from /127.0.0.3 | 127.0.0.1 |      34
</code>
<code>Executing single-partition query on queues | 127.0.0.1 |     339
</code>
<code>              Acquiring sstable references | 127.0.0.1 |     355
</code>
<code>                 Merging memtable contents | 127.0.0.1 |     461
</code>
<code> Partition index lookup over for sstable 3 | 127.0.0.1 |    1122
</code>
<code>Merging data from memtables and 1 sstables | 127.0.0.1 |    2268
</code>
<code>        Read 1 live and 0 tombstoned cells | 127.0.0.1 |    4404
</code>
<code>          Message received from /127.0.0.1 | 127.0.0.3 |    6109
</code>
<code>          Enqueuing response to /127.0.0.3 | 127.0.0.1 |    4492
</code>
<code>             Sending message to /127.0.0.3 | 127.0.0.1 |    4606
</code>
<code>       Processing response from /127.0.0.1 | 127.0.0.3 |    6608
</code>
<code>                          Request complete | 127.0.0.3 |    6901
</code>
</pre>

<p>Despite reading from disk this time, the complete request took 7 milliseconds. Specifying a start column allowed to start scanning the row close to the actual live column and to skip collecting all the tombstones. The difference grows larger with size of the row increasing.</p>

<h2>Summary</h2>

<ul>
	<li>Lots of deleted columns (also expiring columns) and slice queries don't play well together. If you observe this pattern in your cluster, you should correct your data model.</li>
	<li>If you know where your live data begins, hint Cassandra with a start column, to reduce the scan times and the amount of tombstones to collect.</li>
	<li>Do not use Cassandra to implement a durable queue.</li>
</ul>


Cassandra Anti-Patterns: Queues and Queue-like Datasets | Datastax

Aleksey Yeschenko

Discover more

Share

Share

Deletes in Cassandra

Symptoms of a wrong data model

Why did the query take so long to complete?

Potential workarounds

When you know where your live columns begin

Summary

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI