DataStax Developer Blog

New in DataStax Enterprise 2.1: Improved Compaction Strategy to Boost Analytical Processing Performance

By Piotr Kołaczkowski -  June 18, 2012 | 4 Comments

DataStax Enterprise provides a powerful replacement for the Hadoop File System: the Cassandra File System (CFS). CFS is a layer on top of Cassandra dedicated to store big files directly in the column-family based store. This architecture has several important advantages. First, it greatly simplifies maintenance of DSE. There is no need to install any separate components dedicated to storage. If you have installed Cassandra, you already have a distributed file system ready to keep your files. Second, you automatically get all the benefits of Cassandra’s distributed nature: excellent scalability, automatic failover, read/write anywhere capability, support for multiple data centers, and many, many more.

However, storing very big files in a distributed database can be a bit of a challenge. Although the initial solution of storing file system blocks in vanilla Cassandra worked just fine and has already provided great performance and scalability you know from earlier versions of DSE, after many experiments we found we could improve that even further. Cassandra comes with pluggable compaction strategy. Compaction serves two main purposes: iteratively reordering the data so they can be accessed faster, and getting rid of obsolete data, i.e. overwritten or deleted, in order to save space. By default, Cassandra offers two compaction strategies: Size-Tiered Compaction Strategy (STCS) used by default to compact data in the CFS keyspace and optional Leveled Compaction Strategy (LCS).
Although compaction is performed in background, it consumes CPU time and I/O bandwidth. Therefore, it is worth being optimized as much as possible.

CFS Compaction Strategy

DSE 2.1 provides a new compaction strategy designed specially for CFS keyspace(s). Files in CFS have two important properties that allow for creating better compaction strategy than STCS, optimised for general case. First, files in CFS are stored as lists of immutable data blocks. Immutability means it is not possible to change a data block once it has been saved to the filesystem. The Hadoop File System API does not allow for updating file contents. The only permitted modification is deleting the block. Block immutability guarantees that each block is present in at most one SSTable at a time. Second, blocks are large. 128 MB large or often larger. This means a few tens of thousands of blocks is enough to store a terabyte of data on a single node. Additionally, a single SSTable contains at most a few blocks and accessing a random block does not incur a huge relative overhead, as it would be if blocks stored only a kilobyte each. Those properties lead to the following conclusions:

  • We don’t need to compact to keep the number of SSTables low. It will be low anyway because of big size of data blocks. Besides, searching a key in an SSTable with a few blocks is crazy fast. We measured that checking a few thousand CFS SSTables for existence of a key can be done within single milliseconds, thanks to Cassandra Bloom filters.
  • If we don’t ever compact, how we get rid of deleted blocks then? Instead of deleting rows (blocks) while compacting, we can delete the whole SSTable if we are sure all blocks in it have been deleted. It is easy to check if block can be deleted, because there can be only one copy of the block, thanks to immutability. Compared to compaction, this can be treated as “zero cost” deletion. For typical CFS application which is running Hadoop jobs, this happens very often.
  • If we cannot delete the whole SSTable, because some blocks are still alive, but there are lots of other dead blocks taking up precious space, we do the compaction of that single SSTable by putting alive blocks into a new SSTable and dropping the old SSTable.

The CFS compaction strategy works as follows. When a new SSTable is added to the store (when you save a new file or append a file), it is indexed, and all row (block) keys can be very quickly listed by scanning the index. The index contains information about row sizes. If you deleted any files, the new SSTable would contain not only the new blocks, but also the tombstones for the deleted block. For each tombstone, an SSTable with a corresponding data block is found and the block is marked as deleted. By looking at the SSTable index file and the list of removed keys, it can be very quickly estimated, how much live data each SSTable contains and what the ratio of the size of the SSTable to the size of live data is. The ratio is called fragmentation factor. If the fragmentation factor ever exceeds max_sstable_fragmentation limit, which defaults to 5.0, the SSTable immediately becomes a candidate for compaction or dropping. This way SSTables are removed as soon as data becomes dead, without having first to copy dead row data over and over again from one SSTable to another, until it “meets” with the tombstone row.

Configuration and Usage

If you create a fresh database, you don’t need to do anything. All CFS keyspaces are configured with the new compaction strategy. If you move from earlier versions of DSE, you have to change compaction strategy on the CFS keyspace using the following commands:

USE cfs;
UPDATE COLUMN FAMILY sblocks WITH compaction_strategy='com.datastax.bdp.hadoop.cfs.compaction.CFSCompactionStrategy';

To set max_sstable_fragmentation parameter, use the following:

USE cfs;
UPDATE COLUMN FAMILY sblocks WITH compaction_strategy='com.datastax.bdp.hadoop.cfs.compaction.CFSCompactionStrategy'
AND compaction_strategy_options='[{max_sstable_fragmentation: 2.0}]';

Performance

Hadoop distribution comes with a set of benchmarks and examples, which are also included in the DSE distribution. We used TestDFSIO benchmark designed for stress testing of distributed file system to compare performance of DSE 2.0 to DSE 2.1. We were particularly interested in write performance and required storage size, because these are the factors most affected by compaction. We used a 4-node cluster with the following configuration of each node:

Processor: 2 x Intel Xeon E5620 @ 2.40GHz (8 physical cores, 16 threads)
Memory: 32 GB RAM
Storage: 8 HDD x 512 GB @ 10000 RPM
Network: Gigabit Ethernet

For each DSE distribution, we ran the TestDFSIO 6 times in a row. The TestDFSIO write benchmark writes a number of files to the distributed file system and measures the throughput. It was configured to write 1000 files of size 1GB each, which makes total of 1TB.

The benchmark results are illustrated in the charts below. The average throughput was better for DSE 2.1 than for DSE 2.0. This is caused mainly by the fact that the new compaction strategy puts less strain on I/O subsystem. We observed that Cassandra in DSE 2.0 was still compacting for some time after the test finished, contary to DSE 2.1, which became idle immediately.

Before each run the files created by previous runs were deleted by running the TestDFSIO program with the -clean option. After each run, we measured how much data was actually stored in the database. The numbers differed greatly in favor of DSE 2.1. Regardless of how many times the experiment was repeated, the amount of data in the database remained stable. The new compaction code had no problems discarding SSTables with dead data, while DSE 2.0 accumulated dead data from several runs. If you repeatedly run chained Hadoop jobs that create much intermediate, temporary data, using DSE 2.1 with the new compaction strategy is a huge win.



Comments

  1. Anand says:

    Are you planning on making this strategy available to the community cassandra?

  2. Piotr Kołaczkowski says:

    No, currently we are not. This strategy is dedicated to CFS, which is a component present only in DataStax Enterprise edition. This is not a general purpose strategy – it wouldn’t work properly if the assumptions described in the post are not met (about immutability and size of rows).

  3. Harry deng says:

    Can we use this strategy in Brisk ?

  4. David Strauss says:

    This would actually be extremely useful to us in the community releases because we do *exactly* the same thing as CFS in our hash-indexed content storage.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>