DataStax Developer Blog

What's new in Cassandra 1.0: Compression

By Pavel Yaskevich -  September 19, 2011 | 15 Comments

What’s new in Cassandra 1.0: Compression

Cassandra 1.0 introduces support for data compression on a per-ColumnFamily basis, one of the most-requested features since the project started. Compression maximizes the storage capacity of your Cassandra nodes by reducing the volume of data on disk. In addition to the space-saving benefits, compression also reduces disk I/O, particularly for read-dominated workloads.

Compression benefits

Besides data size, compression typically improves both read and write performance. Cassandra is able to quickly find the location of rows in the SSTable index, and only decompresses the relevant row chunks. This means compression improves read performance not just by allowing a larger data set to fit in memory, but it also benefits workloads where the hot data set does not fit into memory.

Unlike in traditional databases, write performance is not negatively impacted by compression in Cassandra. Writes on compressed tables can in fact show up to a 10 percent performance improvement. In traditional relational databases, writes require overwrites to existing data files on disk. This means that the database has to locate the relevant pages on disk, decompress them, overwrite the relevant data, and then compress them again – an expensive operation in both CPU cycles and disk I/O.

Because Cassandra SSTable data files are immutable (they are not written to again after they have been flushed to disk), there is no recompression cycle necessary in order to process writes. SSTables are only compressed once, when they are written to disk.

Overall, we are seeing the following results from enabling compression, depending on the data characteristics:

  • 2x-4x reduction in data size
  • 25-35% performance improvement on reads
  • 5-10% performance improvement on writes

When to use compression

Compression is best suited for ColumnFamilies where there are many rows, with each row having the same columns, or at least many columns in common. For example, a ColumnFamily containing user data such as username, email, etc., would be a good candidate for compression. The more similar the data across rows, the greater the compression ratio will be, and the larger the gain in read performance.

Compression is not as good a fit for ColumnFamilies where each row has a different set of columns, or where there are just a few very wide rows. Dynamic column families such as this will not yield good compression ratios.

Configuring compression on a ColumnFamily

When you create or update a column family, you can choose to make it a compressed column family by specifying the following storage properties:

  • compression_options: this is a container property for setting compression options on a column family. The compression_options property contains the following options:
    • sstable_compression: specifies the compression algorithm to use when compressing SSTable files. Cassandra supports two built-in compression classes: SnappyCompressor (Snappy compression library) and DeflateCompressor (Java zip implementation).Snappy compression offers faster compression/decompression while the Java zip compression offers better compression ratios. Choosing the right one depends on your requirements for space savings over read performance. For read-heavy workloads, Snappy compression is recommended.Developers can also implement custom compression classes using the org.apache.cassandra.io.compress.ICompressor interface.
    • chunk_length_kb: sets the compression chunk size in kilobytes. The default value (64) is a good middle-ground for compressing column families with either wide rows or with skinny rows. With wide rows, it allows reading a 64kb slice of column data without decompressing the entire row. For skinny rows, although you may still end up decompressing more data than requested, it is a good trade-off between maximizing the compression ratio and minimizing the overhead of decompressing more data than is needed to access a requested row.The compression chunk size can be adjusted to account for read/write access patterns (how much data is typically requested at once) and the average size of rows in the column family.

You can enable compression when you create a new column family, or update an existing column family to add compression later on. When you add compression to an existing column family, existing SSTables on disk are not compressed immediately. Any new SSTables that are created will be compressed, and any existing SSTables will be compressed during the normal Cassandra compaction process. (If necessary, you can force existing sstables to be rewritten and compressed by using the nodetool scrub tool.)

For example, to create a new column family with compression enabled using the Cassandra CLI, you would do the following:


[default@demo] CREATE COLUMN FAMILY users
WITH key_validation_class=UTF8Type
AND column_metadata = [
{column_name: name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
]
AND compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};

Conclusion

Compression in Cassandra 1.0 is an easy way to reduce storage volume requirements while increasing performance. Compression can be easily added to existing ColumnFamilies after an upgrade, and the implementation allows power users to tweak chunk sizes for maximum benefit.

Previously



Comments

  1. Nitish Korla says:

    I understand how avoiding recompression can help cassandra achieve faster writes over traditional databases, but how can compressed writes be faster by 10% than uncompressed writes? Is the write compression done while data is being written to commitlog/memtable or when memtables are flushed to the disk?

  2. Pavel Yaskevich says:

    Compression is done when memtables are flushed to the disk this means less I/O which can result in 5-10% performance increase.

  3. Samarth Gahire says:

    How to revert back if i dont want to use the compression anymore?

  4. Pavel Yaskevich says:

    The easiest way would be to use CLI and just set `compression_options` to `null` on your ColumnFamily if you don’t want compression anymore.

  5. Samarth Gahire says:

    I am using Sstable writer to create the sstables and loading them using sstableloader utility ,is there any way i can compress them before loading or while generating itself?

  6. Pavel Yaskevich says:

    I think when you stream those SSTables to the node using “sstableloader” utility it would be compressing them for you, please check, don’t see any other option.

  7. Samarth Gahire says:

    Thank you so much Pavel. When i load the sstable without setting the compression it took just 5 mins for the 8 GB of data, but when i set compression and load the same data it took 1 hour and 45 mins to load.
    Am i doing something wrong? …Or things are set like that in Cassandra-1.0.0

  8. Pavel Yaskevich says:

    Compression makes that process CPU-bound, it might not be a good option for you to use it on that server if you getting such slow import rate.

  9. Samarth Gahire says:

    I have tested the compression with 8 GB of data and it shows no size difference in data when compression is used and not used.
    After compaction the size of sstables is same in both the cases .Also it shows some compression info file with size of 550MB extra.So on my disk :
    data size with compression used > size without compression
    can you please elaborate.

  10. Pavel Yaskevich says:

    Can you please make sure that your ColumnFamily applies to the criteria listed in section “When to use compression”?

  11. Samarth Gahire says:

    “Compression is best suited for ColumnFamilies where there are many rows, with each row having the same columns”
    This is the criteria you have mentioned and our data exactly meets this, with each row having same 23 columns and 32 million such rows, 8 GB of data.
    Can you please explain?

  12. Pavel Yaskevich says:

    I can see only one explanation here – column blocks (column:value) could not be properly compressed probably because of values. Have you tried to set bigger “chunk_length_kb” ?

  13. Samarth Gahire says:

    Hey Pavel thank you so much,With chunk length of 128 kb I am able to compress the sstables now,It helped a lot.

    I would like to let you know one thing ,that we are using Sstable-Generation API and Sstable-Loader utility.As soon as cassandra people releases the new version I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important for us.I hope that u will look into this or convey my message to cassandra developers or let me know how i can contact them.

  14. Pavel Yaskevich says:

    Hi Samarth, you are welcome! I think the best (fastest) way would be to create an issue in https://issues.apache.org/jira/browse/CASSANDRA and describing problem you having.

  15. Samarth Gahire says:

    Hey Pavel,
    Thanks for the suggestion,I have created issue in cassandra JIRA and I can see the positive response from cassandra people.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>