DateTieredCompactionStrategy: Compaction for Time Series Data

By Marcus Eriksson -  November 19, 2014 | 13 Comments

This post will introduce DateTieredCompactionStrategy (DTCS) which is a new compaction strategy that ships with Cassandra 2.0.11, 2.1.1 and later. It aims to keep data written at the same time in the same SSTables to improve performance of time series-like workloads. DTCS was contributed by Björn Hegerfors at Spotify.

Background

Reading data in Cassandra

When a write comes in to Cassandra it writes that data to a commit log and stores it in a Memtable in-memory on the node. Once the memtable is full, it is flushed to an SSTable on disk. Once an SSTable is flushed, it never changes. When a read is executed, data from the Memtable is merged with relevant data from the SSTables on disk. Reducing the amount of reading we need to do from disk is extremely important since that is slow.

A very simplified view of the on-disk format is that we have a bunch of partition keys and each of those partition keys can have many cells. The cells have names and are stored sorted, smallest cell name first, biggest cell name last. For example:

dtcs_blog_2

To be able to reduce the number of SSTables Cassandra needs to look at during reads, metadata is stored with each SSTable. For example, it stores the min and max cell names in the SSTable (more details in CASSANDRA-5514). This is then used during reads, for example in the SSTable above, we could avoid reading from the SSTable if we query “all cells bigger than ‘j’ for partition key ‘id1’” since we know that that cell name can not be in the SSTable. We would have to read from it if we queried for “all cells bigger than ‘e’ for partition key ‘id1‘” though since the metadata is global for the SSTable (min cell name would be ‘a’ and max cell name would be ‘f’ for this SSTable).

For more background on how cells become rows etc, read this blog post.

Compaction

The main purposes of compaction are to reclaim disk space used (removing overwrites and deleted data) and to make sure that we can serve reads from as few SSTables as possible. They are of course related – if you reduce the amount of duplicate data, you are bound to hit fewer SSTables. What we are going to focus on here though is the case when we do no overwrites and only do deletes by TTL, meaning the main purpose of compaction is to make sure we serve reads from as few SSTables as possible and that we drop data that has expired.

Time series data

One common use case for Cassandra is time series data where you might have a table like this:

CREATE TABLE temperature (
   weatherstation_id text,
   event_time timestamp,
   temperature text,
   PRIMARY KEY (weatherstation_id,event_time)
);

And you insert new temperature readings at various timestamps. This gets laid out on disk in an SSTable like this (again, simplified):

dtcs_blog_3

Where event_time always increases (time between 2 different temperature readings will increase), which means that we don’t do any overwrites. Note that the global (for all partitions) biggest and smallest cell names in this SSTable will be approximately the same as the min and max timestamps for the data in this SSTable. Also note that it does not matter if we track event times as actual time or something else, as long as they sort after the cells that were inserted earlier  (timeuuids or sequence ids for example). We could also put a TTL on the temperature readings, meaning we would only keep the most recent data.

Historically compaction has not cared much about how old the data is when selecting SSTables to compact together – this means that we could mix new and old data in the same SSTables which causes a few problems for time series data:

  • We need to look in more SSTables for the most common query types (“give me the last day of data”) since all SSTables could contain data from the last day – we can’t use the optimization from CASSANDRA-5514 where we exclude SSTables based on the max/min cell names in an SSTable.
  • Dropping expired data gets difficult – since columns will expire at different times in the SSTable, we can’t use the optimizations from CASSANDRA-5228 where we are able to drop an entire SSTable if all data in the SSTable is expired.

DateTieredCompactionStrategy

The basic idea of DTCS is to group SSTables in windows based on how old the data is in the SSTable. We then only do compaction within these windows to make sure that we don’t mix new and old data.

The size of the compaction windows is configurable. The base_time_seconds option sets the size of the initial window and defaults to 1 hour. This means the data that was written in the last hour will be in that first window, and will be compacted with data from the same window. Any reads that want to read data from the latest hour will only have to look at SSTables that were compacted or flushed in that window. The older windows then get larger and larger, and finally we have max_sstable_age_days which is when we don’t compact the SSTables anymore.

dtcs_blog

In this example you can see that the time windows move as time passes. The SSTables are sorted by the age of the oldest data (min timestamp) in them, and the most recent data is on the right. In 1 above, none of the time windows contain 4 SSTables (configurable via min_threshold), meaning no compaction will be started. As time passes, we end up in state 2, where one of the time windows now contains 4 SSTables, and we start a compaction of the SSTables in that window. This creates a new sstable that has the same min/max timestamps as the total min/max timestamps of the compacted SSTables (unless the timestamps were from a tombstone that was dropped). In 3 we see time move again and we end up with 4 SSTables in a window which will be compacted together. This explanation is simplified, for a detailed post on how this works, check out this blog post: https://labs.spotify.com/2014/12/18/date-tiered-compaction/

Dropping expired data

If the data is inserted with TTL, and that TTL is the same for all inserts, all data in an SSTable will expire at approximately the same time since TTL is relative to the timestamp of the data. This makes it possible to optimize the way we drop tombstones – once all data in the SSTable is expired, lets drop the entire SSTable. See CASSANDRA-5228 for more details about how this works.

Out of order writes

Some times you might want to insert old data – for example if you need to backfill some data into the system, this would of course mess up the timestamps in the SSTables. There are a few options to solve this;

  • Block all live clients writing data, write the old data, flush SSTables and enable the live clients again.
  • Generate SSTables with the old data offline, using CQLSSTableWriter for example and then load those SSTables into the cluster.

Two other things can generate out of order data;

  • Hints – hints will send old data to your node, but only data that is max_hint_window_in_ms old – you could align this with your base_time_seconds to make sure you don’t get too big time spans in your flushed SSTables.
  • Repair – repair can send really old data to your node, but this is also fine, since you run DTCS on all your nodes in the cluster and repair will stream parts of SSTables, all the SSTables streamed will have short time spans, meaning DTCS will put them in the windows and compact them up if needed.

Note that there will almost always be some out of order writes due to clients not having the exact same times, this can generate a small time-overlap between the SSTables. See CASSANDRA-8243 for more details about this.

Switching to DateTieredCompactionStrategy

To change a table to use DTCS, you do this:

ALTER TABLE <table> WITH compaction = {'class': 'DateTieredCompactionStrategy', 'timestamp_resolution':'<resolution>', 'base_time_seconds':'3600', 'max_sstable_age_days':'365'};

The configuration parameters are optional:

  • timestamp_resolution: Defines what timestamp resolution you use when inserting data, defaults to ‘MICROSECONDS’ since that is the default when using CQL, but older clients might use milliseconds for example. You can use any resolution that Java TimeUnit accepts. Note that unless you do mutations with USING TIMESTAMP (or the driver equivalent) you should not change this!
  • base_time_seconds: This is the size of the first window, defaults to 3600 seconds (1 hour). The rest of the windows will be min_threshold (default 4) times the size of the previous window.
  • max_sstable_age_days: This is the cut-off when SSTables wont be compacted anymore, if they only contain data that is older than this value, they will not be included in compactions. This value should be set to some point where you won’t (frequently) read any data. In a monitoring system for example, you might only very rarely read data that is older than one year. This avoids write amplification by not recompacting data that you never read. Defaults to 365 days.

Note that DateTieredCompactionStrategy is new and some things might change in the future, report any issues you find on Cassandra Jira.

For all the implementation details of this compaction strategy, go read CASSANDRA-6602.

It is possible to try out new compaction strategies without switching for the entire cluster – check out this blog post for more information about that.

Summary

DateTieredCompactionStrategy is a compaction strategy specifically written for time series-like data, where data is mostly appended to existing partitions. The problem with leveled- and size-tiered compaction is that they don’t care about when the data was written, meaning they mix new and old data, and one characteristic of time series workloads is that you mostly want to read the most recent data. This forces Cassandra to read from many SSTables. Leveled compaction can give great read performance, but with a big write amplification cost – we will need to recompact data a lot. DTCS makes this better by only compacting together SSTables that contain data with timestamps that are close to each other, meaning for a query that requests the most recent data, we can greatly reduce the number of SSTables touched during a read. It also limits write amplification by having an option to stop compacting data that is old and rarely read.



Comments

  1. chetan conikee says:

    Please correct the syntactic error on this blog article

    replace ‘=’ with ‘:’

    ALTER TABLE TABLE WITH compaction = {‘class’: ‘DateTieredCompactionStrategy’, ‘timestamp_resolution’:’MICROSECONDS’, ‘base_time_seconds’:’3600′, ‘max_sstable_age_days’:’365′};

    1. Marcus Eriksson says:

      fixed, thanks

  2. Thanks for a great article on what appears to be a very useful feature. I’m wondering if you could please clarify a few points.

    It’s clear that a ‘min timestamp’ per SSTable is used to group SSTables for compaction. Is this timestamp the minimum WRITETIME of columns in the file, or does it somehow come from the timestamps that are part of a key (i.e., the event_time in the weather-station example above)? I’m guessing the former, due to simplicity.

    If the ‘min timestamp’ comes from the WRITETIME, then I’m guessing an out-of-order scenario (e.x., processing a data backlog for a lagged weather-station) would just increase the key range for some partitions in a file but won’t otherwise affect DTCS compaction so it wouldn’t be necessary to halt realtime writes during this time. Does this sound correct or am I missing something?

    Any help is much appreciated.

    1. Marcus Eriksson says:

      Yes, min timestamp is the actual minimum timestamp of the sstable

      Problem is if you insert old data with new timestamps, we increase the span of min/max cell names in the sstable meaning we have to include more sstables when doing reads. If you do out-of-order inserts you should use the “USING TIMESTAMP” clause to make sure that the data maps to the time it was created. But then you will need to block actual writes, otherwise the sstable min/max timestamps will be wrong and the sstable will be included in old compactions.

  3. Tomasz says:

    Hi,

    DTCS looks really promising, we’ve been waiting for sth like this.

    Is it possible to get min/max cell names (values of clustering column) from the given sstable file? sstablemetadata tool doesn’t provide such information.

  4. Bogdan Batranut says:

    Hello,

    Nice feature with a lot of potential, but I think that in my case I have hit a border case.

    I understand that the compaction is done according to writeTime() value. Well if we have a time series but the values are collections, then how will the compaction work? I see that writeTime is not available for collections or for parts of the primary key. I just assume that cassandra does not store timestamps for primary keys or collections. What will happen when you have a table like this :
    id : text
    ts : timestamp
    values: collection
    pk (id, ts)

    How will the compaction work?

    I see that in Cassandra 2.1.x user-defined types will be available. If one defines a type that contains a collection, how will the compaction be affected?

    Can you please write some more info on the topic and for some border cases?

    Thank you very much.

  5. kk says:

    Really new to Cassandra . So i have some basic doubts. Is there any way to set max_sstable_age_days to hour base means compaction limited to 3 hour

    1. james turner says:

      you can set it to a decimal value (e.g. 3hr = 0.125 days)

  6. Andreas says:

    How can I write SSTables with old timestamps using CQLSSTableWriter?

    I’ve tried and the timestamp is created as System.currentTimeMillis() * 1000 internally so unless you set the time on the computer creating the SSTable I can’t see how this can be done.

  7. Suresh says:

    I have a use case where I need to retain the data for 7 years, is DTCS the way to go? I assume in this case the older SSTables may grow big in size and wondering how would the query efficiency be if we would like to query couple of years back data. We also want to use TTL option to retain for 7 years. Any suggestions?

  8. bedinsky says:

    Which are disk space requirements for DTCS ?
    STC requires at least as much free disk space for compaction as the size of the largest column family. LCS needs much less space for compaction, only 10 * sstable_size_in_mb.
    Is DTCS like STC or worse ? Does it depends on max_sstable_age_days ?

  9. Hemalatha says:

    Need a clarification on base_time_seconds, the blog says:
    “This is the size of the first window, defaults to 3600 seconds (1 hour). The rest of the windows will be min_threshold (default 4) times the size of the previous window.”

    With default value of 3600 i.e., 1hr for base_time_seconds does it mean first compaction triggers at 1st hour, next at 4,16, 64 hours and so on?

    Can someone please throw some light on this.

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

© 2017 DataStax, All rights reserved. Tel. +1 (408) 933-3120 sales@datastax.com Offices

DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Solr, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.