The write path to compaction
Cassandra delivers high availability for writing through its data replication strategy. Cassandra duplicates data on multiple peer nodes to ensure reliability and fault tolerance. Relational databases, on the other hand, typically structure tables to keep data duplication at a minimum. The relational database server has to do additional work to ensure data integrity across the tables. In Cassandra, maintaining integrity between related tables is not an issue. Cassandra tables are not related. Usually, Cassandra performs better on writes than relational databases.
- Logging data in the commit log
- Writing data to the memtable
- Flushing data from the memtable
- Storing data on disk in SSTables
Logging writes and memtable storage¶
When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk, providing configurable durability. The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even after hardware failure. The memtable is a write-back cache of data partitions that Cassandra looks up by key. The more a table is used, the larger its memtable needs to be. Cassandra can dynamically allocate the right amount of memory for the memtable or you can manage the amount of memory being utilized yourself. The memtable, unlike a write-through cache, stores writes until reaching a limit, and then is flushed.
Flushing data from the memtable¶
When memtable contents exceed a configurable threshold, the memtable data, which includes indexes, is put in a queue to be flushed to disk. You can configure the length of the queue by changing memtable_flush_queue_size in the cassandra.yaml. If the data to be flushed exceeds the queue size, Cassandra blocks writes. You can manually flush data from the memtable using the nodetool flush command. Typically, before restarting nodes, flushing the memtable is recommended to reduce commit log replay time. To flush the data, Cassandra sorts memtables by partition key and then writes the data to disk sequentially. The process is extremely fast because it involves only a commitlog append and the sequential write.
Storing data on disk in SSTables¶
The memtable data is flushed to
on disk using
sequential I/O. Data in the commit log is purged after its corresponding data in the
memtable is flushed to the SSTable.
Memtables and SSTables are maintained per table. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files.
For each SSTable, Cassandra creates these in-memory structures:
A list of primary keys and the start position of rows in the data file.
A subset of the partition index. By default 1 primary key out of every 128 is sampled.
Periodic compaction is essential to a healthy Cassandra database because Cassandra does not insert/update in place. In-place updates necessitate a random I/O operation, which is not performant. As inserts/updates occur, instead of overwriting the rows, Cassandra writes a new timestamped version of the inserted or updated data in another SSTable. You manage the accumulation of SSTables on disk manually or automatically using compaction.
Cassandra also does not delete in place. Instead, Cassandra marks data to be deleted using a tombstone. Data marked with a tombstone exists for a configured time period defined by the gc_grace_seconds value set on the table. You also manage the deletion of data from SSTables using compaction. This diagram depicts the compaction process:
Compaction merges the data in each SSTable data by partition key, selecting the latest data for storage based on its timestamp. After evicting tombstones and removing deleted data, columns, and rows, the compaction process sorts SSTables, and then consolidates tables into a single table.
Data input to SSTables is sorted to prevent random I/O during SSTable consolidation. After compaction, Cassandra uses the new consolidated SSTable instead of multiple old SSTables, fulfilling read requests more efficiently than before compaction. The old SSTable files are deleted as soon as any pending reads finish using the files. Disk space occupied by old SSTables becomes available for reuse.
Although no random I/O occurs, compaction can still be a fairly heavyweight operation. During compaction, there is a temporary spike in disk space usage and disk I/O because the old and new SSTables co-exist. To minimize deteriorating read speed, compaction runs in the background.
- Throttles compaction I/O to compaction_throughput_mb_per_sec (default 16MB/s).
- Requests that the operating system pull newly compacted partitions into the page cache when the key cache indicates that the compacted partition is hot for recent reads.
You can configure two types of compaction to run periodically: SizeTieredCompactionStrategy and LeveledCompactionStrategy.
SizeTieredCompactionStrategy is designed for write-intensive workloads, and LeveledCompactionStrategy for read-intensive workloads. You can manually start compaction using the nodetool compact command.