Updates to Cassandra’s Commit Log in 2.2
date: June 19, 2015
The commit log implementation of Cassandra was ovehauled in 2.1, bringing multithreaded and memory-mapped writing and reducing the log overheads for dramatically improved throughput. In the next version we have improved upon this by adding compression and addressing some unexpected read traffic caused by the commit log.
Segment reuse and performance
Since version 1.1 a feature of the commit log infrastructure in Cassandra has been the ability to reuse segments. This is done in order to reduce fragmentation on the logging drive -- a number of commitlog segments will be kept reserved by the database for overwriting after the data they contain has been flushed, which means that most of the time the commit log will not need to allocate new space in order to write. This does not eliminate all fragmentation that can be caused by the log, as it will continue writing after its space quota has been reached while memtable flushes are in progress, and afterwards it will release the overallocated space. Still, since less space is allocated and freed, there is a lower chance of introducing fragmentation on the drive.
However, with the introduction of memory-mapped log writes in 2.1, during isolated commit log testing we saw that in some scenarios the commit log would read out the contents of the obsolete logs before overwriting them. This behaviour is not very surprising, as the first access to memory-mapped pages, be it a read or write, causes the page to be fetched into memory by the operating system, but is entirely unacceptable from the point of view of Cassandra's performance.
As this problem is caused by the combination of segment reuse and memory-mapped access, we had two possible solutions:
- stop using memory-mapped writes
- stop reusing segments
The former approach would solve most of the instances of this problem, but reads would still be performed by the operating system whenever a partially written page needs to be flushed to disk, to be able to combine the new data with the non-overwritten contents in the rest of the page. This will happen unfortunately often for the batch sync strategy, and bring with it additional latency for the most latency-sensitive method of using the log, thus it isn't a very good solution. Reusing segments, on the other hand, has very limited benefit that is realistically only relevant to spinning disk drives, and only when a dedicated drive is not available for the commit log; with the increasing usage of solid-state drives it makes little sense to prioritize this particular scenario. Dropping reuse ensures that the operating system knows the space written to by the log does not contain old data, which allows it to optimize away all fetches that could otherwise be necessary.
For Cassandra 2.2 we chose to no longer reuse segments, which should result in reduced page cache pressure by the commit log and improved real-world log performance.
New feature of the commit log in Cassandra 2.2 is the support of compression. Compression reduces the disk transfer and space requirements of the log at the expense of additional CPU processing. As it is primarily aimed at reducing resource usage by the log, the implementation does not attempt to utilize all processing resources to be able to maximize its throughput for fast drives. Compression can still result in performance gains even in this case due to the decrease in cache space, disk traffic and overhead caused by the log, as well as reduced memtable flush frequency. On spinning disks it can usually outperform uncompressed writes when paired with a fast compressor, resulting additionally in increased log throughput and decreased latency.
The compression is configured using the
commitlog_compression parameter in
cassandra.yaml which accepts a
parameters. LZ4, Snappy and Deflate compressors are provided with Cassandra, and pluggable
ICompressor implementations are supported. We recommend sticking with
LZ4Compressor, which provides reasonable compression rates and high throughput. For write-heavy workloads it may be preferable to set the sync period to several hundred milliseconds or use a smaller segment size to make sure compression times are short.
In our testing compression resulted in 6-12% improvement of write performance. It also smoothed out the pauses under heavy writes when Cassandra had to wait for the commitlog disk to catch up.