Apache Cassandra 1.0 Documentation

Monitoring a Cassandra Cluster

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

Understanding the performance characteristics of your Cassandra cluster is critical to diagnosing issues and planning capacity.

Cassandra exposes a number of statistics and management operations via Java Management Extensions (JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java applications and services. Any statistic or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.

During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter management console. With the same tools, you can perform certain administrative commands and operations such as flushing caches or doing a repair.

Monitoring Using DataStax OpsCenter

DataStax OpsCenter is a graphical user interface for monitoring and administering all nodes in a Cassandra cluster from one centralized console. DataStax OpsCenter is bundled with DataStax support offerings, or you can register for a free version licensed for development or non-production use.

OpsCenter provides a graphical representation of performance trends in a summary view that is hard to obtain with other monitoring tools. The GUI provides views for different time periods as well as the capability to drill down on single data points. Both real-time and historical performance data for a Cassandra or Brisk cluster are available in OpsCenter. OpsCenter metrics are captured and stored within Cassandra.


../../_images/opsc_perf_view.png

The performance metrics viewed within OpsCenter can be customized according to your monitoring needs. Administrators can also perform routine node administration tasks from OpsCenter. Metrics within OpsCenter are divided into three general categories: column family metrics, cluster metrics, and OS metrics. For many of the available metrics, you can choose to view aggregated cluster-wide information, or view information on a per-node basis.


../../_images/opsc_metric_options.png

Monitoring Using nodetool

The nodetool utility is a command-line interface for monitoring Cassandra and performing routine database operations. It is included in the Cassandra distribution and is typically run directly from an operational Cassandra node.

The nodetool utility supports the most important JMX metrics and operations, and includes other useful commands for Cassandra administration. This utility is commonly used to output a quick summary of the ring and its current state of general health with the ring command. For example:

# nodetool -h localhost -p 7199 ring
Address        Status   State   Load        Owns    Range                                      Ring
                                                    95315431979199388464207182617231204396
10.194.171.160 Down     Normal  ?           39.98   61078635599166706937511052402724559481     |<--|
10.196.14.48   Up       Normal  3.16 KB     30.01   78197033789183047700859117509977881938     |   |
10.196.14.239  Up       Normal  3.16 KB     30.01   95315431979199388464207182617231204396     |-->|

The nodetool utility provides commands for viewing detailed metrics for column family metrics, server metrics, and compaction statistics. Commands are also available for important operations such as decommissioning a node, running repair, and moving partitioning tokens.

Monitoring Using JConsole

JConsole is a JMX-compliant tool for monitoring Java applications such as Cassandra. It is included with Sun JDK 5.0 and higher. JConsole consumes the JMX metrics and operations exposed by Cassandra and displays them in a well-organized GUI. For each node monitored, JConsole provides these six separate tab views:

  • Overview - Displays overview information about the Java VM and monitored values.
  • Memory - Displays information about memory use.Threads - Displays information about thread use.
  • Classes - Displays information about class loading.
  • VM Summary - Displays information about the Java Virtual Machine (VM).
  • Mbeans - Displays information about MBeans.

The Overview and Memory tabs contain information that is very useful for Cassandra developers. The Memory tab allows you to compare heap and non-heap memory usage, and provides a control to immediately perform Java garbage collection.

For specific Cassandra metrics and operations, the most important area of JConsole is the MBeans tab. This tab lists the following Cassandra MBeans:

  • org.apache.cassandra.db - Includes caching, column family metrics, and compaction.
  • org.apache.cassandra.internal - Internal server operations such as gossip and hinted handoff.
  • org.apache.cassandra.net - Inter-node communication including FailureDetector, MessagingService and StreamingService.
  • org.apache.cassandra.request - Tasks related to read, write, and replication operations.

When you select an MBean in the tree, its MBeanInfo and MBean Descriptor are both displayed on the right, and any attributes, operations or notifications appear in the tree below it. For example, selecting and expanding the org.apache.cassandra.db MBean to view available actions for a column family results in a display like the following:


../../_images/jconsole_cf_options.png

If you choose to monitor Cassandra using JConsole, keep in mind that JConsole consumes a significant amount of system resources. For this reason, DataStax recommends running JConsole on a remote machine rather than on the same host as a Cassandra node.

Compaction Metrics

Monitoring compaction performance is an important aspect of knowing when to add capacity to your cluster. The following attributes are exposed through CompactionManagerMBean:

Attribute Description
CompletedTasks Number of completed compactions since the last start of this Cassandra instance
PendingTasks Number of estimated tasks remaining to perform
ColumnFamilyInProgress ColumnFamily currently being compacted. null if no compactions are in progress.
BytesTotalInProgress Total number of data bytes (index and filter are not included) being compacted. null if no compactions are in progress.
BytesCompacted The progress of the current compaction. null if no compactions are in progress.

Thread Pool Statistics

Cassandra maintains distinct thread pools for different stages of execution. Each of these thread pools provide statistics on the number of tasks that are active, pending and completed. Watching trends on these pools for increases in the pending tasks column is an excellent indicator of the need to add additional capacity. Once a baseline is established, alarms should be configured for any increases past normal in the pending tasks column. See below for details on each thread pool (this list can also be obtained via command line using nodetool tpstats).

Thread Pool Description
AE_SERVICE_STAGE Shows anti-entropy tasks
CONSISTENCY-MANAGER Handles the background consistency checks if they were triggered from the client's consistency level <consistency>
FLUSH-SORTER-POOL Sorts flushes that have been submitted
FLUSH-WRITER-POOL Writes the sorted flushes
GOSSIP_STAGE Activity of the Gossip protocol on the ring
LB-OPERATIONS The number of load balancing operations
LB-TARGET Used by nodes leaving the ring
MEMTABLE-POST-FLUSHER Memtable flushes that are waiting to be written to the commit log.
MESSAGE-STREAMING-POOL Streaming operations. Usually triggered by bootstrapping or decommissioning nodes.
MIGRATION_STAGE Tasks resulting from the call of system_* methods in the API that have modified the schema
MISC_STAGE  
MUTATION_STAGE API calls that are modifying data
READ_STAGE API calls that have read data
RESPONSE_STAGE Response tasks from other nodes to message streaming from this node
STREAM_STAGE Stream tasks from this node

Read/Write Latency Metrics

Cassandra keeps tracks latency (averages and totals) of read, write and slicing operations at the server level through StorageProxyMBean.

ColumnFamily Statistics

For individual column families, ColumnFamilyStoreMBean provides the same general latency attributes as StorageProxyMBean. Unlike StorageProxyMBean, ColumnFamilyStoreMBean has a number of other statistics that are important to monitor for performance trends. The most important of these are listed below:

Attribute Description
MemtableDataSize The total size consumed by this column family's data (not including meta data)
MemtableColumnsCount Returns the total number of columns present in the memtable (across all keys)
MemtableSwitchCount How many times the memtable has been flushed out
RecentReadLatencyMicros The average read latency since the last call to this bean
RecentWriterLatencyMicros The average write latency since the last call to this bean
LiveSSTableCount The number of live SSTables for this ColumnFamily

The first three Memtable attributes are discussed in detail on the Tuning Cassandra page.

The recent read latency and write latency counters are important in making sure that operations are happening in a consistent manner. If these counters start to increase after a period of staying flat, it is probably an indication of a need to add cluster capacity.

LiveSSTableCount can be monitored with a threshold to ensure that the number of SSTables for a given ColumnFamily does not become too great.

Monitoring and Adjusting Cache Performance

Careful, incremental monitoring of cache changes is the best way to maximize benefit from Cassandra's built-in caching features. Adjustments that increase cache hit rate are likely to use more system resources, such as memory. After making changes to the cache configuration, it is best to monitor Cassandra as a whole for unintended impact on the system.

For each node and each column family, you can view cache hit rate, cache size, and number of hits by expanding org.apache.cassandra.db in the MBeans tab. For example:


../../_images/jconsole_hitrate.png

Monitor new cache settings not only for hit rate, but also to make sure that memtables and heap size still have sufficient memory for other operations. If you cannot maintain the desired key cache hit rate of 85% or better, add nodes to the system and re-test until you can meet your caching requirements.

Row cache is disabled by default. Caching large rows can very quickly consume memory. Row cache rates should be increased carefully in small increments. If row cache hit rates cannot be tuned to above 30%, it may make more sense to leave row caching disabled.