Cassandra’s log-structured storage engine enables Cassandra’s amazing performance and features like application-transparent compression by turning all updates into data files called sstables that are written sequentially to disk. No update-in-place is done (because that would require doing random i/o); instead, a new version of the columns being inserted or updated is written in the next sstable.
Thus, over time, many versions of a row may exist in different sstables. Each of these versions may have different sets of columns. If the sstables were allowed to accumulate, reading a row could require many seeks in different files to return a result.
To prevent read speed from deteriorating, compaction runs in the background, merging sstables together. (This can also be done performantly, without random i/o, because rows are sorted by primary key within in each sstable.)
Cassandra’s size-tiered compaction stragety is very similar to the one described in Google’s Bigtable paper: when enough similar-sized sstables are present (four by default), Cassandra will merge them. In figure 1, each green box represents an sstable, and the arrow represents compaction. As new sstables are created, nothing happens at first. Once there are four, they are compacted together, and again when we have another four. Figure 2 shows what we expect much later, when the second-tier sstables have been combined to create third-tier, third tier combined to create fourth, and so on.
There are three problems with size-tiered compaction in update-heavy workloads:
Cassandra 1.0 introduces the Leveled Compaction Strategy, based on LevelDB from the Chromium team at Google.
Leveled compaction creates sstables of a fixed, relatively small size (5MB by default in Cassandra’s implementation), that are grouped into “levels.” Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.
In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4.
This solves the above problems with tiered compaction:
Leveled compaction can be enabled by creating (or updating) a ColumnFamily with the compaction_strategy option set to LeveledCompactionStrategy. When updating an existing column family, reads and writes can continue as usual while leveling of existing sstables is performed in the background.
Because leveled compaction makes the above guarantees, it does roughly twice as much i/o as size-tiered compaction. For primarily insert-oriented workloads, this extra i/o does not pay off in terms of the above benefits, since there are few obsolete row versions involved.
Leveled compaction ignores the concurrent_compactors setting. Concurrent compaction is designed to avoid tiered compaction’s problem of a backlog of small compaction sets becoming blocked temporarily while the compaction system is busy with a large set. Leveled compaction does not have this problem, since all compaction sets are roughly the same size. Leveled compaction does honor the multithreaded_compaction setting, which allows using one thread per sstable to speed up compaction. However, most compaction tuning will still involve using compaction_throughput_mb_per_sec (default: 16) to throttle compaction back.
Is it correct interpretation that every level can have any no. of ssTables? For eg: say key 1-100
Level-0 – All new updates
Level-1 – sstable 1 with keys 1-25 and sstable 2 with keys 25-50
Level-3 – sstable 1 with keys 50-75 and sstable 2 with keys 75-100
New sstables within each level would be generated depending on the updates occurring on the old keys?
Or did I get completely wrong
In Cassandra’s implementaiton, L1 gets 50MB (~10 sstables of data), L2 gets 500MB/100 sstables, L3 gets 5GB, etc. New data is flushed to L0 and compacted up from there. The leveldb implementation notes at http://leveldb.googlecode.com/svn/trunk/doc/impl.html go into more detail.
Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.
As per my perception only 10% of disk space need to be free for the major compaction.
Can you please elaborate?
In http://leveldb.googlecode.com/svn/trunk/doc/impl.html ,you can find a rule:”We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).”
I think “10x the sstable size” may come from this rule.