Effective tuning depends not only on the types of operations your cluster performs most frequently, but also on the shape of the data itself. For example, Cassandra’s memtables have overhead for index structures on top of the actual data they store. If the size of the values stored in the columns is small compared to the number of columns and rows themselves (sometimes called skinny rows), this overhead can be substantial. Thus, the optimal tuning for this type of data is quite different than the optimal tuning for a small numbers of columns with more data (fat rows).
Cassandra’s built-in key and row caches can provide very efficient data caching. Some Cassandra production deployments have leveraged Cassandra’s caching features to the point where dedicated caching tools such as memcached could be completely replaced. Such deployments not only remove a redundant layer from the stack, but they also achieve the fundamental efficiency of strengthening caching functionality in the lower tier where the data is already being stored. Among other advantages, this means that caching never needs to be restarted in a completely cold state.
Cache tuning should be done using small, incremental adjustments and then monitoring the effects of each change. See Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family cache. With proper tuning, key cache hit rates of 85% or better are possible. Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. Whenever growth in the read load begins to impact your cache hit rates, you can add capacity to quickly restore optimal levels of caching.
If both row and key caches are configured, the row cache will return results whenever possible. In the case of a row cache miss, the key cache may still provide a hit, assuming that it holds a larger number of keys than the row cache.
If a read operation hits the row cache, the entire requested row is returned without a disk seek. If a row is not in the row cache, but is present in the key cache, the key cache is used to find the exact location of the row on disk in the SSTable. If a row is not in the key cache, the read operation will populate the key cache after accessing the row on disk so subsequent reads of the row can benefit. Each hit on a key cache can save one disk seek per SSTable.
The key cache holds the location of row keys in memory on a per-column family basis. High levels of key caching are recommended for most production scenarios. Turning this level up can optimize reads (after the cache warms) when there is a large number of rows that are accessed frequently.
The caching of 200,000 row keys is enabled by default. This can be adjusted by setting keys_cached on a column family. For example, using Cassandra CLI:
[default@demo] UPDATE COLUMN FAMILY users WITH keys_cached=205000;
Key cache performance can be monitored by using nodetool cfstats and examining the reported ‘Key cache hit rate’. See also Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family key cache.
The row cache holds the entire contents of the row in memory. In cases where rows are large or frequently modified/removed, row caching can actually be detrimental to performance. For this reason, row caching is disabled by default.
Row cache should remain disabled for column families with large rows or high write:read ratios. In these situations, row cache can very quickly consume a large amount of available memory. Note also that, when a row cache is operating efficiently, it keeps Java garbage compaction processes very active.
Row caching is best for workloads that access a small subset of the overall rows, and within those rows, all or most of the columns are returned. For this use case a row cache keeps the most accessed rows hot in memory, and can have substantial performance benefits.
To enable row cache on a column family, set rows_cached to the desired number of rows. To enable off-heap row caching, set row_cache_provider to SerializingCacheProvider. For example, using Cassandra CLI:
[default@demo] UPDATE COLUMN FAMILY users WITH rows_cached=205000
AND row_cache_provider='SerializingCacheProvider';
Row cache performance can be monitored by using nodetool cfstats and examining the reported ‘Row cache hit rate’. See also Monitoring and Adjusting Cache Performance for more information about monitoring tuning changes to a column family key cache.
If your requirements permit it, a data model that logically separates heavily-read data into discrete column families can help optimize caching. Column families with relatively small, narrow rows lend themselves to highly efficient row caching. By the same token, it can make sense to separately store lower-demand data, or data with extremely long rows, in a column family with minimal caching, if any.
Row caching in such contexts brings the most benefit when access patterns follow a normal (Gaussian) distribution. When the keys most frequently requested follow such patterns, cache hit rates tend to increase. If you have particularly hot rows in your data model, row caching can bring significant performance improvements.
Deploying a large number of Cassandra nodes under a relatively light load per node will maximize the fundamental benefit from key and row caches.
A less obvious but very important consideration is the OS page cache. Modern operating systems maintain page caches for frequently accessed data and are very efficient at keeping this data in memory. Even after a row is released in the Java Virtual Machine memory, it can be kept in the OS page cache – especially if the data is requested repeatedly, or no other requested data replaces it.
If your requirements allow you to lower JVM heap size and memtable sizes to leave memory for OS page caching, then do so. Ultimately, through gradual adjustments, you should achieve the desired balance between these three demands on available memory: heap, memtables, and caching.
nodetool cfstats can be used to get the necessary information for estimating actual cache sizes.
To estimate the key cache size, multiply the reported ‘Key cache size’ for each column family by 10-12 to account for the difference in reported Java heap usage and actual size. Then, add the results for all column families.
To estimate the row cache size, multiply the reported ‘Row cache size’ for each column family (which is the number of rows in the cache) by the average size of each row. Then multiply that figure by 10-12 and sum the results over all column families.
A memtable is a column family specific, in memory data structure that can be easily described as a write-back cache. Memtables are flushed to disk, creating SSTables whenever one of the configurable thresholds has been exceeded.
Effectively tuning memtable thresholds depends on your data as much as your write load. Memtable thresholds are configured primarily by memtable_throughput_in_mb and memtable_operations_in_millions.
You should increase memtable throughput if:
Look instead at adjusting MemtableObjectCountInMillions if, as previously described, you have large numbers of skinny rows. Memtable flushes should be tuned using this value instead to avoid consuming too much memory with metadata.
Note that any upwards adjustment of memtable thresholds will take memory away from caching and other internal Cassandra structures, so tune carefully and in small increments.
Cassandra’s default configuration opens the JVM with a heap size of half of the available system memory. Many users new to Cassandra are tempted to turn this value up immediately to consume the majority of the underlying system’s RAM. Doing so in most cases is actually detrimental. The reason for this is that Cassandra, being essentially a database, spends a lot of time interacting with the operating system’s I/O infrastructure (via the JVM of course). Modern operating systems maintain disk caches for frequently accessed data and are very good at keeping this data in memory. Regardless of how much RAM your hardware has, you should keep the JVM heap size constrained by the following formula and allow the operating system’s file cache to do the rest:
(memtable_total_space_in_mb) + 1GB + (cache_size_estimate)
Cassandra’s GCInspector will log information about garbage collection whenever a garbage collection takes longer than 200ms. If garbage collections are occurring frequently and are taking a moderate length of time to complete (such as ConcurrentMarkSweep taking a few seconds), this is an indication that there is a lot of garbage collection pressure on the JVM; this needs to be addressed by adding nodes, lowering cache sizes, or adjusting GC options in the JVM.
During normal operations, numerous SSTables may be created on disk for a given column family. Compaction is the process of merging multiple SSTables into one. Additionally, the compaction process merges keys, combines columns, discards tombstones and creates a new index in the merged SSTable.
Minor compaction in Cassandra is performed regularly and automatically for each column family. The frequency and scope of minor compactions is controlled by the following column family attributes:
These parameters set thresholds for the number of similar-sized SSTables that can accumulate before a minor compaction is triggered. With the default values, a minor compaction may begin any time after four SSTables are created on disk for a column family, and must begin before 32 SSTables accumulate.
You can tune these values per column family. For example, using Cassandra CLI:
[default@demo] UPDATE COLUMN FAMILY users WITH max_compaction_threshold = 20;
Note
Administrators can also initiate a major compaction through nodetool, which merges all SSTables into one. Though major compaction can free disk space used by accumulated SSTables, during runtime it temporarily doubles disk space usage and is CPU intensive. Also, once you run a major compaction, automatic minor compactions are no longer triggered frequently forcing you to manually run major compactions on a routine basis. Major compaction is NOT recommended by DataStax.