Effective tuning depends not only on the types of operations your cluster performs most frequently, but also on the shape of the data itself. For example, Cassandra's memtables have overhead for index structures on top of the actual data they store. If the size of the values stored in the columns is small compared to the number of columns and rows themselves (sometimes called skinny rows), this overhead can be substantial. Thus, the optimal tuning for this type of data is quite different than the optimal tuning for a small numbers of columns with more data (fat rows).
Cassandra's built-in key and row caches can provide very efficient data caching. Some Cassandra production deployments have leveraged Cassandra's caching features to the point where dedicated caching tools such as memcached could be completely replaced. Such deployments not only remove a redundant layer from the stack, but they also achieve the fundamental efficiency of strengthening caching functionality in the lower tier where the data is already being stored. Among other advantages, this means that caching never needs to be restarted in a completely cold state.
Cache tuning should be done using small, incremental adjustments and then monitoring the effects of each change. See Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family cache. With proper tuning, key cache hit rates of 85% or better are possible. Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. Whenever growth in the read load begins to impact your cache hit rates, you can add capacity to quickly restore optimal levels of caching.
If both row and key caches are configured, the row cache will return results whenever possible. In the case of a row cache miss, the key cache may still provide a hit, assuming that it holds a larger number of keys than the row cache.
If a read operation hits the row cache, the entire requested row is returned without a disk seek. If a row is not in the row cache, but is present in the key cache, the key cache is used to find the exact location of the row on disk in the SSTable. If a row is not in the key cache, the read operation will populate the key cache after accessing the row on disk so subsequent reads of the row can benefit. Each hit on a key cache can save one disk seek per SSTable.
The key cache holds the location of row keys in memory on a per-column family basis. High levels of key caching are recommended for most production scenarios. Turning this level up can optimize reads (after the cache warms) when there is a large number of rows that are accessed frequently.
The caching of 200,000 row keys is enabled by default. This can be adjusted by setting keys_cached on a column family. For example, using Cassandra CLI:
[default@demo] UPDATE COLUMN FAMILY users WITH keys_cached=205000;
Key cache performance can be monitored by using nodetool cfstats and examining the reported 'Key cache hit rate'. See also Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family key cache.
The row cache holds the entire contents of the row in memory. In cases where rows are large or frequently modified/removed, row caching can actually be detrimental to performance. For this reason, row caching is disabled by default.
Row cache should remain disabled for column families with large rows or high write:read ratios. In these situations, row cache can very quickly consume a large amount of available memory. Note also that, when a row cache is operating efficiently, it keeps Java garbage compaction processes very active.
Row caching is best for workloads that access a small subset of the overall rows, and within those rows, all or most of the columns are returned. For this use case a row cache keeps the most accessed rows hot in memory, and can have substantial performance benefits.
To enable row cache on a column family, set rows_cached to the desired number of rows.
Row cache performance can be monitored by using nodetool cfstats and examining the reported 'Row cache hit rate'. See also Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family key cache.
If your requirements permit it, a data model that logically separates heavily-read data into discrete column families can help optimize caching. Column families with relatively small, narrow rows lend themselves to highly efficient row caching. By the same token, it can make sense to separately store lower-demand data, or data with extremely long rows, in a column family with minimal caching, if any.
Row caching in such contexts brings the most benefit when access patterns follow a normal (Gaussian) distribution. When the keys most frequently requested follow such patterns, cache hit rates tend to increase. If you have particularly hot rows in your data model, row caching can bring significant performance improvements.
Deploying a large number of Cassandra nodes under a relatively light load per node will maximize the fundamental benefit from key and row caches.
A less obvious but very important consideration is the OS page cache. Modern operating systems maintain page caches for frequently accessed data and are very efficient at keeping this data in memory. Even after a row is released in the Java Virtual Machine memory, it can be kept in the OS page cache, especially if the data is requested repeatedly or no other requested data replaces it.
If your requirements allow you to lower JVM heap size and memtable sizes to leave memory for OS page caching, then do so. Ultimately, through gradual adjustments, you should achieve the desired balance between these three demands on available memory: heap, memtables, and caching.
nodetool cfstats can be used to get the necessary information for estimating actual cache sizes.
To estimate the key cache size for a single column family:
key cache size = (average('Key size in bytes') + 64)) * 'Key cache size' * 10-12
To estimate the row cache size:
row cache size = (average 'Row cache size in bytes' + 64) * 'Row cache size' * 10-12
The row cache for a column family is stored in native memory by default rather than using the JVM heap.
A memtable is a column family specific, in memory data structure that can be easily described as a write-back cache. Memtables are flushed to disk, creating SSTables whenever one of the configurable thresholds has been exceeded.
Effectively tuning memtable thresholds depends on your data as much as your write load. Memtable thresholds are configured per node using the cassandra.yaml properties: memtable_throughput_in_mb and commitlog_total_space_in_mb.
You should increase memtable throughput if:
Note that increasing memory allocations for memtables takes memory away from caching and other internal Cassandra structures, so tune carefully and in small increments.
Because Cassandra is a database, it spends significant time interacting with the operating system's I/O infrastructure through the JVM, so a well-tuned Java heap size is important. Cassandra's default configuration opens the JVM with a heap size that is based on the total amount of system memory:
| System Memory | Heap Size |
|---|---|
| Less than 2GB | 1/2 of system memory |
| 2GB to 4GB | 1GB |
| Greater than 4GB | 1/4 system memory, but not more than 8GB |
General Guidelines
Many users new to Cassandra are tempted to turn up Java heap size too high, which consumes the majority of the underlying system's RAM. In most cases, increasing the Java heap size is actually detrimental for these reasons:
If you have more than 2GB of system memory, which is typical, keep the size of the Java heap relatively small to allow more memory for the page cache. To change a JVM setting, modify the cassandra-env.sh file.
Because MapReduce runs outside the JVM, changes to the JVM do not affect Hadoop operations directly.
Cassandra's GCInspector class will log information about garbage collection whenever a garbage collection takes longer than 200ms. If garbage collections are occurring frequently and are taking a moderate length of time to complete (such as ConcurrentMarkSweep taking a few seconds), this is an indication that there is a lot of garbage collection pressure on the JVM; this needs to be addressed by adding nodes, lowering cache sizes, or adjusting the JVM options regarding garbage collection.
During normal operations, numerous SSTables may be created on disk for a given column family. Compaction is the process of merging multiple SSTables into one consolidated SSTable. Additionally, the compaction process merges keys, combines columns, discards tombstones and creates a new index in the merged SSTable.
Tuning compaction involves first choosing the right compaction strategy for each column family based on its access patterns. As of Cassandra 1.0, there are two choices of compaction strategies:
You can set the compaction strategy on a column family by setting the compaction_strategy attribute. For example, to update a column family to use the leveled compaction strategy using Cassandra CLI:
[default@demo] UPDATE COLUMN FAMILY users WITH compaction_strategy=LeveledCompactionStrategy
AND compaction_strategy_options={sstable_size_in_mb: 10};
For column families that use size-tiered compaction (the default), the frequency and scope of minor compactions is controlled by the following column family attributes:
These parameters set thresholds for the number of similar-sized SSTables that can accumulate before a minor compaction is triggered. With the default values, a minor compaction may begin any time after four SSTables are created on disk for a column family, and must begin before 32 SSTables accumulate.
You can tune these values per column family. For example, using Cassandra CLI:
[default@demo] UPDATE COLUMN FAMILY users WITH max_compaction_threshold = 20;
Note
Administrators can also initiate a major compaction through nodetool compact, which merges all SSTables into one. Though major compaction can free disk space used by accumulated SSTables, during runtime it temporarily doubles disk space usage and is I/O and CPU intensive. Also, once you run a major compaction, automatic minor compactions are no longer triggered frequently forcing you to manually run major compactions on a routine basis. So while read performance will be good immediately following a major compaction, it will continually degrade until the next major compaction is manually invoked. For this reason, major compaction is NOT recommended by DataStax.