Apache Cassandra 1.2 Documentation

Enabling and configuring data caches

Cassandra has offered built-in key and the row caches for a long time. As of Cassandra 1.1, cache tuning was completely revamped to make caches easier to use effectively. Caching is integrated with the database. One advantage of this integration is that Cassandra distributes cache data around the cluster for you. When a node goes down, the client can read from another cached replica of the data. The integrated architecture also facilitates troubleshooting because there is no separate caching tier, and cached data matches what’s in the database exactly. The integrated cache solves the cold start problem by virtue of saving your cache to disk periodically and being able to read contents back in when it restarts—you never have to start with a cold cache.

About the key cache

The key cache is a cache of the primary key index for a Cassandra table. Using the key cache instead of relying on the OS page cache saves CPU time and memory. However, enabling just the key cache results in disk (or OS page cache) activity to actually read the requested data rows.

About the row cache

The row cache is similar to a traditional cache like memcached: when a row is accessed, the entire row is pulled into memory (merging from multiple SSTables if necessary) and cached so that further reads against that row can be satisfied without hitting disk at all.

Typically, you enable either the key or row cache for a table. The main exception is for archive tables that are infrequently read. You should disable caching entirely for archive tables.

Configuring caches

In the cassandra.yaml file, set these main caching options:

Unlike earlier Cassandra versions, cache sizes do not need to be specified per table. Just set caching to all, keys_only, rows_only, or none. For example, in CQL create a table to set caching to all:

CREATE TABLE users (
  userid text PRIMARY KEY,
  first_name text,
  last_name text,
)
with caching = 'all';

Other caching options set in the cassandra.yaml are:

You can set the row_cache_provider to one of these options:

  • SerializingCacheProvider
  • ConcurrentLinkedHashCacheProvider

SerializingCacheProvider is the default and more memory-efficient option, between five and ten times more efficient for applications that are not blob-intensive. Using ConcurrentLinkedHashCacheProvider makes sense for use with update-heavy workloads because it updates data in place. SerializingCacheProvider, on the other hand, invalidates cached rows on update, and therefore, might not perform as well with update-heavy workloads.

Enable the key and row caches at the table level using CQL to set the caching option. Set the caching parameter to enable or disable caching on the keys or rows, or both of a table.

How caching works

When both row and key caches are configured, the row cache returns results whenever possible. In the event of a row cache miss, the key cache might still provide a hit that makes the disk seek much more efficient. This diagram depicts two read operations on a table with both caches already populated.


../../_images/how-cache-works_12.png

One read operation hits the row cache, returning the requested row without a disk seek. The other read operation requests a row that is not present in the row cache but is present in the key cache. After accessing the row in the SSTable, the system returns the data and populates the row cache with this read operation.

Tips for efficient cache use

Some tips for efficient cache use are:

  • Store lower-demand data or data with extremely long rows in a table with minimal or no caching.
  • Deploy a large number of Cassandra nodes under a relatively light load per node.
  • Logically separate heavily-read data into discrete tables.

Cassandra's memtables have overhead for index structures on top of the actual data they store. If the size of the values stored in the heavily-read columns is small compared to the number of columns and rows themselves, this overhead can be substantial. Rows having this type of data do not lend themselves to efficient row caching.

Monitoring and adjusting cache performance

Make changes to cache options in small, incremental adjustments, then monitor the effects of each change using DataStax OpsCenter http://www.datastax.com/products/opscenter. In the event of high memory consumption, consider tuning data caches.