Jonathan Ellis

<h3>Wasteful Bloom filter allocation</h3>

<p>Compaction is the process whereby Cassandra&nbsp;<a href="https://wiki.apache.org/cassandra/MemtableSSTable">merges its log-structured data files</a>&nbsp;to evict obsolete or deleted rows. These data files (sstables) are composed of&nbsp;<a href="https://legacy-datastax-corporate.pantheonsite.io/documentation/cassandra/2.0/webhelp/cassandra/dml/dml_about_reads_c.html">several components</a>&nbsp;to make reads efficient.</p>

<p>The first component that gets&nbsp;<a href="https://legacy-datastax-corporate.pantheonsite.io/documentation/cassandra/2.0/webhelp/cassandra/dml/dml_about_reads_c.html">consulted on a read</a>&nbsp;is the Bloom filter. A&nbsp;<a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a>&nbsp;is a probabilistic set that takes just a few bits per key stored, and is thus much more memory-efficient than actually storing the partition keys themselves. The bloom filter takes 1-2GB of memory per billion partitions. (By default, Cassandra&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5029">uses a smaller bloom filter for Leveled compaction</a>&nbsp;since the&nbsp;<a href="https://legacy-datastax-corporate.pantheonsite.io/dev/blog/leveled-compaction-in-apache-cassandra">leveling</a>&nbsp;means we expect to consult the bloom filter less often for sstables that don't contain the partition in question.)</p>

<p>Our first big concern was&nbsp;<a href="https://legacy-datastax-corporate.pantheonsite.io/dev/blog/performance-improvements-in-cassandra-1-2">moving this memory off-heap</a>&nbsp;to support larger data sets. With that done, we're looking at other ways to improve this.</p>

<p>One big gain would be avoiding unnecessary worst-case bloom filter allocations. That is, given two initial sstables, the result of the compaction could be this for two sstables that don't overlap at all:</p>
<img alt="bloom filter" data-align="center" data-entity-type="file" data-entity-uuid="25c8dfc7-2172-4337-bf36-0dd38313dc84" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2014-01-27-at-8.50.42-PM-700x446.png" />
<p>or, it could be this if they overlap entirely:</p>
<img alt="overlapping bloom filter" data-align="center" data-entity-type="file" data-entity-uuid="c9012f94-cb24-49ed-82dd-84af004e6eee" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2014-01-27-at-8.51.39-PM-700x449.png" />
<p>Or, it could be anywhere in between.</p>

<p>Because bloom filters are not re-sizeable, we need to pre-allocate them at the start of the compaction, but at the start of the compaction, we don't know how much the sstables being compacted overlap. Since bloom filter performance deteriorates dramatically when over-filled, we allocate our bloom filters to be large enough even if the sstables do not overlap at all. Which means that if they do overlap (which they should if compaction is doing a good job picking candidates), then we waste space -- up to 100% per sstable compacted:</p>
<img alt="bloom filter overlap" data-align="center" data-entity-type="file" data-entity-uuid="ba7d7896-0c89-4b39-9489-6d41b5ad5f36" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2014-01-27-at-8.52.33-PM.png" />
<h3>Accurate estimates with HyperLogLog</h3>

<p>To solve this problem, we need a fairly accurate estimate of how much the compacting sstables overlap, before we start merging them. This is a good fit for a class of algorithms called cardinality estimation. For instance, given a set of evenly distributed random numbers,&nbsp;<a href="http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation">we can estimate how many are unique by tracking the smallest number in the set</a>:</p>

<blockquote>
<p>If the maximum possible value is m, and the smallest value we find is x, we can then estimate there to be about m/x unique values in the total set. For instance, if we scan a dataset of numbers between 0 and 1, and find that the smallest value in the set is 0.01, it's reasonable to assume there are roughly 100 unique values in the set; any more and we would expect to see a smaller minimum value.</p>
</blockquote>

<p>The actual algorithm&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-5906">used in Cassandra 2.1</a>&nbsp;is called HyperLogLog,&nbsp;<a href="https://github.com/addthis/stream-lib">as implemented in Java by AddThis</a>. (Technically, this implements a variant called HyperLogLog++.) The details are out of scope for this post, but I can highly recommend&nbsp;<a href="http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation">Damn Cool Algorithms' explanation</a>.</p>

<p>Crucially, and unlike the simplistic min-tracking example, HyperLogLog lets us combine two cardinality estimates to get an estimate of the union of the sets they summarize, which is exactly what we need for estimating how many elements will be in the merged bloom filter. Experimental results show that we save about 40% of the bloom filter overhead by getting this more accurate count, although this will be highly workload-dependent.</p>

<h3>Future work</h3>

<p>Potentially even more useful would be using cardinality estimation to pick better compaction candidates. Instead of blindly merging sstables of a similar size a la SizeTieredCompactionStrategy:</p>
<img alt="size tiered compaction strategy" data-align="center" data-entity-type="file" data-entity-uuid="589e7b26-a67c-4a5d-a25c-1b95f6e98e46" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2014-01-27-at-9.00.11-PM-700x290.png" />
<p>we could merge the candidates that overlap most, which would be a big improvement both for overwrite-heavy and append-mostly workloads:</p>
<img alt="merge the overlap" data-align="center" data-entity-type="file" data-entity-uuid="de3a2971-70ac-4cfb-8874-f51ea520e2c4" src="https://www.datastax.com/sites/default/files/inline-images/Screen-Shot-2014-01-27-at-9.01.08-PM-700x279.png" />
<p>Unfortunately, the HyperLogLog estimates we use for bloom filter estimates are large enough (~10KB per sstable) that keeping them on-heap permanently would lose a lot of our recent gains, and keeping them off-heap would require re-engineering stream-lib. While that isn't a deal breaker, we may be able to do better by using&nbsp;<a href="http://en.wikipedia.org/wiki/MinHash">minhash</a>&nbsp;to approximate the&nbsp;<a href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity coefficient</a>&nbsp;rather than estimating merged cardinality directly with HyperLogLog. Follow&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-6474">CASSANDRA-6474</a>&nbsp;for the gory details, but you can expect to see similarity-based compaction in Cassandra later this year.</p>


Improving compaction in Cassandra with cardinality estimation

Jonathan EllisTechnology

Share

Share

Wasteful Bloom filter allocation

Accurate estimates with HyperLogLog

Future work

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI