Cassandra has offered built-in key and the row caches for a long time. As of Cassandra 1.1, cache tuning was completely revamped to make caches easier to use effectively. Caching is integrated with the database. One advantage of this integration is that Cassandra distributes cache data around the cluster for you. When a node goes down, the client can read from another cached replica of the data. The integrated architecture also facilitates troubleshooting because there is no separate caching tier, and cached data matches what’s in the database exactly. The integrated cache solves the cold start problem by virtue of saving your cache to disk periodically and being able to read contents back in when it restarts—you never have to start with a cold cache.
The key cache is a cache of the primary key index for a Cassandra table. Using the key cache instead of relying on the OS page cache saves CPU time and memory. However, enabling just the key cache results in disk (or OS page cache) activity to actually read the requested data rows.
The row cache is similar to a traditional cache like memcached: when a row is accessed, the entire row is pulled into memory (merging from multiple SSTables if necessary) and cached so that further reads against that row can be satisfied without hitting disk at all.
Typically, you enable either the key or row cache for a table. The main exception is for archive tables that are infrequently read. You should disable caching entirely for archive tables.
In the cassandra.yaml file, set these main caching options:
Unlike earlier Cassandra versions, cache sizes do not need to be specified per table. Just set caching to all, keys_only, rows_only, or none. For example, in CQL create a table to set caching to all:
CREATE TABLE users (
userid text PRIMARY KEY,
first_name text,
last_name text,
)
with caching = 'all';
Other caching options set in the cassandra.yaml are:
You can set the row_cache_provider to one of these options:
SerializingCacheProvider is the default and more memory-efficient option, between five and ten times more efficient for applications that are not blob-intensive. Using ConcurrentLinkedHashCacheProvider makes sense for use with update-heavy workloads because it updates data in place. SerializingCacheProvider, on the other hand, invalidates cached rows on update, and therefore, might not perform as well with update-heavy workloads.
Enable the key and row caches at the table level using CQL to set the caching option. Set the caching parameter to enable or disable caching on the keys or rows, or both of a table.
When both row and key caches are configured, the row cache returns results whenever possible. In the event of a row cache miss, the key cache might still provide a hit that makes the disk seek much more efficient. This diagram depicts two read operations on a table with both caches already populated.
One read operation hits the row cache, returning the requested row without a disk seek. The other read operation requests a row that is not present in the row cache but is present in the key cache. After accessing the row in the SSTable, the system returns the data and populates the row cache with this read operation.
Some tips for efficient cache use are:
Cassandra's memtables have overhead for index structures on top of the actual data they store. If the size of the values stored in the heavily-read columns is small compared to the number of columns and rows themselves, this overhead can be substantial. Rows having this type of data do not lend themselves to efficient row caching.
Make changes to cache options in small, incremental adjustments, then monitor the effects of each change using DataStax OpsCenter http://www.datastax.com/products/opscenter. In the event of high memory consumption, consider tuning data caches.