Apache Cassandra 1.0 Documentation

Tuning Cassandra

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

Effective tuning depends not only on the types of operations your cluster performs most frequently, but also on the shape of the data itself. For example, Cassandra's memtables have overhead for index structures on top of the actual data they store. If the size of the values stored in the columns is small compared to the number of columns and rows themselves (sometimes called skinny rows), this overhead can be substantial. Thus, the optimal tuning for this type of data is quite different than the optimal tuning for a small numbers of columns with more data (fat rows).

Tuning the Cache

Cassandra's built-in key and row caches can provide very efficient data caching. Some Cassandra production deployments have leveraged Cassandra's caching features to the point where dedicated caching tools such as memcached could be completely replaced. Such deployments not only remove a redundant layer from the stack, but they also achieve the fundamental efficiency of strengthening caching functionality in the lower tier where the data is already being stored. Among other advantages, this means that caching never needs to be restarted in a completely cold state.

Cache tuning should be done using small, incremental adjustments and then monitoring the effects of each change. See Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family cache. With proper tuning, key cache hit rates of 85% or better are possible. Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. Whenever growth in the read load begins to impact your cache hit rates, you can add capacity to quickly restore optimal levels of caching.

How Caching Works

If both row and key caches are configured, the row cache will return results whenever possible. In the case of a row cache miss, the key cache may still provide a hit, assuming that it holds a larger number of keys than the row cache.

If a read operation hits the row cache, the entire requested row is returned without a disk seek. If a row is not in the row cache, but is present in the key cache, the key cache is used to find the exact location of the row on disk in the SSTable. If a row is not in the key cache, the read operation will populate the key cache after accessing the row on disk so subsequent reads of the row can benefit. Each hit on a key cache can save one disk seek per SSTable.

Configuring the Column Family Key Cache

The key cache holds the location of row keys in memory on a per-column family basis. High levels of key caching are recommended for most production scenarios. Turning this level up can optimize reads (after the cache warms) when there is a large number of rows that are accessed frequently.

The caching of 200,000 row keys is enabled by default. This can be adjusted by setting keys_cached on a column family. For example, using Cassandra CLI:

[default@demo] UPDATE COLUMN FAMILY users WITH keys_cached=205000;

Key cache performance can be monitored by using nodetool cfstats and examining the reported 'Key cache hit rate'. See also Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family key cache.

Configuring the Column Family Row Cache

The row cache holds the entire contents of the row in memory. In cases where rows are large or frequently modified/removed, row caching can actually be detrimental to performance. For this reason, row caching is disabled by default.

Row cache should remain disabled for column families with large rows or high write:read ratios. In these situations, row cache can very quickly consume a large amount of available memory. Note also that, when a row cache is operating efficiently, it keeps Java garbage compaction processes very active.

Row caching is best for workloads that access a small subset of the overall rows, and within those rows, all or most of the columns are returned. For this use case a row cache keeps the most accessed rows hot in memory, and can have substantial performance benefits.

To enable row cache on a column family, set rows_cached to the desired number of rows.

Row cache performance can be monitored by using nodetool cfstats and examining the reported 'Row cache hit rate'. See also Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family key cache.

Data Modeling Considerations for Cache Tuning

If your requirements permit it, a data model that logically separates heavily-read data into discrete column families can help optimize caching. Column families with relatively small, narrow rows lend themselves to highly efficient row caching. By the same token, it can make sense to separately store lower-demand data, or data with extremely long rows, in a column family with minimal caching, if any.

Row caching in such contexts brings the most benefit when access patterns follow a normal (Gaussian) distribution. When the keys most frequently requested follow such patterns, cache hit rates tend to increase. If you have particularly hot rows in your data model, row caching can bring significant performance improvements.

Hardware and OS Considerations for Cache Tuning

Deploying a large number of Cassandra nodes under a relatively light load per node will maximize the fundamental benefit from key and row caches.

A less obvious but very important consideration is the OS page cache. Modern operating systems maintain page caches for frequently accessed data and are very efficient at keeping this data in memory. Even after a row is released in the Java Virtual Machine memory, it can be kept in the OS page cache, especially if the data is requested repeatedly or no other requested data replaces it.

If your requirements allow you to lower JVM heap size and memtable sizes to leave memory for OS page caching, then do so. Ultimately, through gradual adjustments, you should achieve the desired balance between these three demands on available memory: heap, memtables, and caching.

Estimating Cache Sizes

nodetool cfstats can be used to get the necessary information for estimating actual cache sizes.

To estimate the key cache size for a single column family:

key cache size = (average('Key size in bytes') + 64)) * 'Key cache size' * 10-12

To estimate the row cache size:

row cache size = (average 'Row cache size in bytes' + 64) * 'Row cache size' * 10-12

The row cache for a column family is stored in native memory by default rather than using the JVM heap.

Tuning Write Performance (Memtables)

A memtable is a column family specific, in memory data structure that can be easily described as a write-back cache. Memtables are flushed to disk, creating SSTables whenever one of the configurable thresholds has been exceeded.

Effectively tuning memtable thresholds depends on your data as much as your write load. Memtable thresholds are configured per node using the cassandra.yaml properties: memtable_throughput_in_mb and commitlog_total_space_in_mb.

You should increase memtable throughput if:

  1. Your write load includes a high volume of updates on a smaller set of data
  2. You have steady stream of continuous writes (this will lead to more efficient compaction)

Note that increasing memory allocations for memtables takes memory away from caching and other internal Cassandra structures, so tune carefully and in small increments.

Tuning Java Heap Size

Because Cassandra is a database, it spends significant time interacting with the operating system's I/O infrastructure through the JVM, so a well-tuned Java heap size is important. Cassandra's default configuration opens the JVM with a heap size that is based on the total amount of system memory:

System Memory Heap Size
Less than 2GB 1/2 of system memory
2GB to 4GB 1GB
Greater than 4GB 1/4 system memory, but not more than 8GB

General Guidelines

Many users new to Cassandra are tempted to turn up Java heap size too high, which consumes the majority of the underlying system's RAM. In most cases, increasing the Java heap size is actually detrimental for these reasons:

  • In most cases, the capability of Java 6 to gracefully handle garbage collection above 8GB quickly diminishes.
  • Modern operating systems maintain the OS page cache for frequently accessed data and are very good at keeping this data in memory, but can be prevented from doing its job by an elevated Java heap size.

If you have more than 2GB of system memory, which is typical, keep the size of the Java heap relatively small to allow more memory for the page cache. To change a JVM setting, modify the cassandra-env.sh file.

Because MapReduce runs outside the JVM, changes to the JVM do not affect Hadoop operations directly.

Tuning Java Garbage Collection

Cassandra's GCInspector class will log information about garbage collection whenever a garbage collection takes longer than 200ms. If garbage collections are occurring frequently and are taking a moderate length of time to complete (such as ConcurrentMarkSweep taking a few seconds), this is an indication that there is a lot of garbage collection pressure on the JVM; this needs to be addressed by adding nodes, lowering cache sizes, or adjusting the JVM options regarding garbage collection.

Tuning Compaction

During normal operations, numerous SSTables may be created on disk for a given column family. Compaction is the process of merging multiple SSTables into one consolidated SSTable. Additionally, the compaction process merges keys, combines columns, discards tombstones and creates a new index in the merged SSTable.

Choosing a Column Family Compaction Strategy

Tuning compaction involves first choosing the right compaction strategy for each column family based on its access patterns. As of Cassandra 1.0, there are two choices of compaction strategies:

  • SizeTieredCompactionStrategy - This is the default compaction strategy for a column family, and prior to Cassandra 1.0, the only compaction strategy available. This strategy is best suited for column families with insert-mostly workloads that are not read as frequently. This strategy also requires closer monitoring of disk utilization because (as a worst case scenario) a column family can temporarily double in size while a compaction is in progress.
  • LeveledCompactionStrategy - This is a new compaction strategy introduced in Cassandra 1.0. This compaction strategy is based on (but not an exact implementation of) Google's leveldb. This strategy is best suited for column families with read-heavy workloads that also have frequent updates to existing rows. When using this strategy, you want to keep an eye on read latency performance for the column family. If a node cannot keep up with the write workload and pending compactions are piling up, then read performance will degrade for a longer period of time.

Setting the Compaction Strategy on a Column Family

You can set the compaction strategy on a column family by setting the compaction_strategy attribute. For example, to update a column family to use the leveled compaction strategy using Cassandra CLI:

[default@demo] UPDATE COLUMN FAMILY users WITH compaction_strategy=LeveledCompactionStrategy
AND compaction_strategy_options={sstable_size_in_mb: 10};

Tuning Options for Size-Tiered Compaction

For column families that use size-tiered compaction (the default), the frequency and scope of minor compactions is controlled by the following column family attributes:

These parameters set thresholds for the number of similar-sized SSTables that can accumulate before a minor compaction is triggered. With the default values, a minor compaction may begin any time after four SSTables are created on disk for a column family, and must begin before 32 SSTables accumulate.

You can tune these values per column family. For example, using Cassandra CLI:

[default@demo] UPDATE COLUMN FAMILY users WITH max_compaction_threshold = 20;

Note

Administrators can also initiate a major compaction through nodetool compact, which merges all SSTables into one. Though major compaction can free disk space used by accumulated SSTables, during runtime it temporarily doubles disk space usage and is I/O and CPU intensive. Also, once you run a major compaction, automatic minor compactions are no longer triggered frequently forcing you to manually run major compactions on a routine basis. So while read performance will be good immediately following a major compaction, it will continually degrade until the next major compaction is manually invoked. For this reason, major compaction is NOT recommended by DataStax.