Leveled Compaction in Apache Cassandra
date: October 10, 2011
Introduction to Compaction
Cassandra's log-structured storage engine enables Cassandra's amazing performance and features like application-transparent compression by turning all updates into data files called sstables that are written sequentially to disk. No update-in-place is done (because that would require doing random i/o); instead, a new version of the columns being inserted or updated is written in the next sstable.
Figure 1: adding sstables with size tiered compaction.
Thus, over time, many versions of a row may exist in different sstables. Each of these versions may have different sets of columns. If the sstables were allowed to accumulate, reading a row could require many seeks in different files to return a result.
To prevent read speed from deteriorating, compaction runs in the background, merging sstables together. (This can also be done performantly, without random i/o, because rows are sorted by primary key within in each sstable.)
Figure 2: sstables under size-tiered compaction after many inserts
Cassandra's size-tiered compaction stragety is very similar to the one described in Google's Bigtable paper: when enough similar-sized sstables are present (four by default), Cassandra will merge them. In figure 1, each green box represents an sstable, and the arrow represents compaction. As new sstables are created, nothing happens at first. Once there are four, they are compacted together, and again when we have another four. Figure 2 shows what we expect much later, when the second-tier sstables have been combined to create third-tier, third tier combined to create fourth, and so on.
There are three problems with size-tiered compaction in update-heavy workloads:
- Performance can be inconsistent because there are no guarantees as to how many sstables a row may be spread across: in the worst case, we could have columns from a given row in each sstable.
- A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns will be merged out of existance; this is particularly noticeable when there is a high ratio of deletes.
- Space can also be a problem as sstables grow larger from repeated compactions, since an obsolete sstable cannot be removed until the merged sstable is completely written. In the worst case of a single set of large sstable with no obsolete rows to remove, Cassandra would need 100% as much free space as is used by the sstables being compacted, into which to write the merged one.
Figure 3: adding sstables under leveled compaction.
Cassandra 1.0 introduces the Leveled Compaction Strategy, based on LevelDB from the Chromium team at Google. Leveled compaction creates sstables of a fixed, relatively small size (5MB by default in Cassandra's implementation), that are grouped into "levels." Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.
In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4.
Figure 4: sstables under leveled compaction after many inserts.
This solves the above problems with tiered compaction:
- Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels -- e.g., 7 for 10TB of data.
- At most 10% of space will be wasted by obsolete rows.
- Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.
Leveled compaction can be enabled by creating (or updating) a ColumnFamily with the compaction_strategy option set to LeveledCompactionStrategy. When updating an existing column family, reads and writes can continue as usual while leveling of existing sstables is performed in the background.
Because leveled compaction makes the above guarantees, it does roughly twice as much i/o as size-tiered compaction. For primarily insert-oriented workloads, this extra i/o does not pay off in terms of the above benefits, since there are few obsolete row versions involved.
Leveled compaction ignores the concurrent_compactors setting. Concurrent compaction is designed to avoid tiered compaction's problem of a backlog of small compaction sets becoming blocked temporarily while the compaction system is busy with a large set.
Leveled compaction does not have this problem, since all compaction sets are roughly the same size. Leveled compaction does honor the multithreaded_compaction setting, which allows using one thread per sstable to speed up compaction. However, most compaction tuning will still involve using compaction_throughput_mb_per_sec (default: 16) to throttle compaction back.