DataStax Developer Blog

Leveled Compaction in Apache Cassandra

By Jonathan Ellis -  October 10, 2011 | 14 Comments

Introduction to Compaction

Cassandra’s log-structured storage engine enables Cassandra’s amazing performance and features like application-transparent compression by turning all updates into data files called sstables that are written sequentially to disk. No update-in-place is done (because that would require doing random i/o); instead, a new version of the columns being inserted or updated is written in the next sstable.

Figure 1: adding sstables with size tiered compaction

Figure 1: adding sstables with size tiered compaction

Thus, over time, many versions of a row may exist in different sstables. Each of these versions may have different sets of columns. If the sstables were allowed to accumulate, reading a row could require many seeks in different files to return a result.

To prevent read speed from deteriorating, compaction runs in the background, merging sstables together. (This can also be done performantly, without random i/o, because rows are sorted by primary key within in each sstable.)

Tiered Compaction

Figure 2: sstables under size-tiered compaction after many inserts

Figure 2: sstables under size-tiered compaction after many inserts

Cassandra’s size-tiered compaction stragety is very similar to the one described in Google’s Bigtable paper: when enough similar-sized sstables are present (four by default), Cassandra will merge them. In figure 1, each green box represents an sstable, and the arrow represents compaction. As new sstables are created, nothing happens at first. Once there are four, they are compacted together, and again when we have another four. Figure 2 shows what we expect much later, when the second-tier sstables have been combined to create third-tier, third tier combined to create fourth, and so on.

 

There are three problems with size-tiered compaction in update-heavy workloads:

  • Performance can be inconsistent because there are no guarantees as to how many sstables a row may be spread across: in the worst case, we could have columns from a given row in each sstable.
  • A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns will be merged out of existance; this is particularly noticeable when there is a high ratio of deletes.
  • Space can also be a problem as sstables grow larger from repeated compactions, since an obsolete sstable cannot be removed until the merged sstable is completely written.  In the worst case of a single set of large sstable with no obsolete rows to remove, Cassandra would need 100% as much free space as is used by the sstables being compacted, into which to write the merged one.

Leveled Compaction

Figure 3: adding sstables under leveled compaction

Figure 3: adding sstables under leveled compaction

Cassandra 1.0 introduces the Leveled Compaction Strategy, based on LevelDB from the Chromium team at Google.

Leveled compaction creates sstables of a fixed, relatively small size (5MB by default in Cassandra’s implementation), that are grouped into “levels.” Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.

In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4.

Figure 4: sstables under leveled compaction after many inserts

Figure 4: sstables under leveled compaction after many inserts

This solves the above problems with tiered compaction:

  • Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels — e.g., 7 for 10TB of data.
  • At most 10% of space will be wasted by obsolete rows.
  • Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.

Leveled compaction can be enabled by creating (or updating) a ColumnFamily with the compaction_strategy option set to LeveledCompactionStrategy. When updating an existing column family, reads and writes can continue as usual while leveling of existing sstables is performed in the background.

Considerations

Because leveled compaction makes the above guarantees, it does roughly twice as much i/o as size-tiered compaction. For primarily insert-oriented workloads, this extra i/o does not pay off in terms of the above benefits, since there are few obsolete row versions involved.

Leveled compaction ignores the concurrent_compactors setting. Concurrent compaction is designed to avoid tiered compaction’s problem of a backlog of small compaction sets becoming blocked temporarily while the compaction system is busy with a large set. Leveled compaction does not have this problem, since all compaction sets are roughly the same size. Leveled compaction does honor the multithreaded_compaction setting, which allows using one thread per sstable to speed up compaction. However, most compaction tuning will still involve using compaction_throughput_mb_per_sec (default: 16) to throttle compaction back.

Previously



Comments

  1. Mohit Anchlia says:

    Is it correct interpretation that every level can have any no. of ssTables? For eg: say key 1-100

    Level-0 – All new updates
    Level-1 – sstable 1 with keys 1-25 and sstable 2 with keys 25-50
    Level-3 – sstable 1 with keys 50-75 and sstable 2 with keys 75-100

    New sstables within each level would be generated depending on the updates occurring on the old keys?

    Or did I get completely wrong :)

  2. Jonathan Ellis says:

    In Cassandra’s implementaiton, L1 gets 50MB (~10 sstables of data), L2 gets 500MB/100 sstables, L3 gets 5GB, etc. New data is flushed to L0 and compacted up from there. The leveldb implementation notes at http://leveldb.googlecode.com/svn/trunk/doc/impl.html go into more detail.

  3. Samarth Gahire says:

    Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.

    As per my perception only 10% of disk space need to be free for the major compaction.

    Can you please elaborate?

  4. liang feng says:

    In http://leveldb.googlecode.com/svn/trunk/doc/impl.html ,you can find a rule:”We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).”
    I think “10x the sstable size” may come from this rule.

  5. manu zhang says:

    “Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). ”
    I wonder how this is calculated

  6. Jonathan Ellis says:

    Each level is allowed to grow to 10x the size of the previous level, so (slightly less than) 90% of the data will be in the largest level. Remember that within a level, data is guaranteed not to overlap across sstables, or put another way, a given row will be in at most one sstable.

  7. Gopinath says:

    Hi Jonathan,

    Few clarifications needed :

    1. Since the SStables are 5 MB each here, are the bloom filter, index files maintained for each sstable ??

    If that is the case, it would take time to scan through all the filters, indexes right ? how does it works here ?

    2. “Extra sstables are promoted to L2 (violet)”: in the above document :

    Is that something like SStables with (New Sstables)10,9,8,7,6,5,4,3,2,1(old Sstable) on L1. when a new sstables(11) is to be compacted from L0 to L1. So, sstables (1) from L1 is moved to L2 and compacted with L2 sstables and New SStable (11) is compacted to L1 sstables ?

    3. If the SSTable size was set to 5 MB, and if there are large row size more then 5 MB, how does it work out in this case ?

    4. Every time during Compaction, All the sstables in a specific level which require changes are participated in compaction when a new sstable to be compacted has new updates ? Then once the compaction is done, SStables are created with new changes and old sstables are discarded ? (in Update Environment)

    Thanks
    Gopinath

  8. Jonathan Ellis says:

    All metadata is maintained for sstables compacted by LCS, although the default bloom filter accuracy is reduced starting in 1.2. We use an IntervalTree to cut down the number of sstables that need to be checked.

    SSTables retain the level they are created in. If a level is full, new sstables resulting from compacting it will be placed in the next-higher level.

    An SSTable will contain a minimum of one row.

    New SSTables are always flushed to L0 and compact “upwards” from there.

  9. Gopinath says:

    Thanks Much Jonathan :)

    Have one more confusion, saw this in cassandra11.pdf doc.

    sstable_size_in_mb 5 : Sets the file size for leveled SSTables. A compaction is triggered when unleveled SSTables (newly flushed SSTable files in Level 0) exceeds 4 * sstable_size_in_mb.

    Do this mean, until 4 sstables are flushed to L0, its not compacted and promoted to L1 ?

    I thought, on an flush of memtable to L0, will imediately get compacted with SStables in L1.

    Please clarify, thanks so much

    Thanks
    Gopinath

  10. Jonathan Ellis says:

    A compaction will run whenever there is more data in Level N > 0 than desired (sstable_size_in_mb * 10**level), or whenever there is anything in L0. In practice, you will often get multiple tables flushed to L0 while it is busy compacting higher levels (especially before parallel compactions are allowed in 1.2), but in theory a single L0 sstable can definitely be compacted with L1.

  11. Gopinath says:

    Thanks Much Jonathan Ellis for your clear and quick explanation

  12. Oleg Dulin says:

    I don’t understand the 10x requirement for temporary compactions for leveled compactions. If my total CF size is 500G does it mean I will need 5Terabytes ? That makes no sense. Say my highest level SSTable will be 100G. Does it mean I need a terabyte for temporary compactions ? That’s worse than size-tiered. Can you please clarify ?

  13. Jonathan Ellis says:

    Leveled compaction restricts sstable size to 5MB or a single row, whichever is larger.

  14. Yang, Wei says:

    Hi Jonathan,
    I have some question for LeveledCompaction after reading related code in Cassandra-1.2:
    1. In the case that we get candidates from Level-L(L>=1) and pick up just one sstable(since we doesn’t found overlapping ones in Level-L+1), why not just promote this one instead of doing one sstable compaction?
    2. In our suitation, echo node has 12 disks(1.6T capacity), echo one with 500G data right now and the memtable size of cf A is 1G. How to tune sstable_size_in_mb? What about 4G size?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>