Apache Cassandra 1.2 Documentation

About writes

Cassandra delivers high availability for writing through its data replication strategy. Cassandra duplicates data on multiple peer nodes to ensure reliability and fault tolerance. Relational databases, on the other hand, typically structure tables to keep data duplication at a minimum. The relational database server has to do additional work to ensure data integrity across the tables. In Cassandra, maintaining integrity between related tables is not an issue. Cassandra tables are not related. Usually, Cassandra performs better on writes than relational databases.

About the write path

When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk, providing configurable durability.

The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even after hardware failure.

The more a table is used, the larger its memtable needs to be. Cassandra can dynamically allocate the right amount of memory for the memtable or you can manage the amount of memory being utilized yourself. When memtable contents exceed a configurable threshold, the memtable data, which includes secondary indexes, is put in a queue to be flushed to disk. You can configure the length of the queue by changing memtable_flush_queue_size in the cassandra.yaml. If the data to be flushed exceeds the queue size, Cassandra blocks writes. The memtable data is flushed to SSTables on disk using sequential I/O. Data in the commit log is purged after its corresponding data in the memtable is flushed to the SSTable.


../../_images/write-process_12.png

Memtables and SSTables are maintained per table. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a row is typically stored across multiple SSTable files.

For each SSTable, Cassandra creates these in-memory structures:

  • Primary index - A list of row keys and the start position of rows in the data file.
  • Index summary - A subset of the primary index. By default 1 row key out of every 128 is sampled.

How Cassandra stores data

In the memtable, data is organized in sorted order by row key.

For efficiency, Cassandra does not repeat the names of the columns in memory or in the SSTable. For example, the following writes occur:

write (k1, c1:v1)
write (k2, c1:v1 C2:v2)
write (k1, c1:v4 c3:v3 c2:v2)

In the memtable, Cassandra stores this data after receiving the writes:

k1 c1:v4 c2:v2 c3:v3
k2 c1:v1 c2:v2

In the commit log on disk, Cassandra stores this data after receiving the writes:

k1, c1:v1
k2, c1:v1 C2:v2
k1, c1:v4 c3:v3 c2:v2

In the SSTable on disk, Cassandra stores this data after flushing the memtable:

k1 c1:v4 c2:v2 c3:v3
k2 c1:v1 c2:v2
../../_images/flush_memtable_12.png

About secondary indexes updates

To update secondary indexes Cassandra appends data to the commit log, updates the memtable, and updates the secondary indexes. Writing to a table having a secondary index involves more work than writing to a table without a secondary index, but the update process has been improved in Cassandra 1.2. The need for a synchronization lock to prevent concurrency issues for heavy insert loads has been removed.

When a column is updated, the secondary index is updated. If the old column value was still in the memtable, which typically occurs when updating a small set of rows repeatedly, Cassandra removes the index entry; otherwise, the old entry remains to be purged by compaction. If a read sees a stale index entry before compaction purges it, the reader thread invalidates it.

As with relational databases, keeping indexes up to date is not free, so unnecessary indexes should be avoided.