Understanding the performance characteristics of your cluster is critical to correctly plannning capacity requirements and to diagnosing issues.
Recognizing this need, Cassandra has a number of attributes and management operations exposed via JMX. Knowing what these attributes mean and spending some time exploring them as you develop will make base lining, monitoring and tuning your Cassandra cluster significantly easier.
To get the most out of the available output, you should use a remote monitoring tool that supports JMX queries and has the ability to capture and store statistics over time.
Monitoring compaction performance is an important aspect of knowing when to add capacity to your cluster. The attributes exposed through CompactionManagerMBean are listed below:
| Attribute | Description |
|---|---|
| CompletedTasks | Number of completed compactions since the last start of this Cassandra instance |
| PendingTasks | Number of estimated tasks remaining to perform |
| ColumnFamilyInProgress | ColumnFamily currently being compacted. null if no compactions are in progress. |
| BytesTotalInProgress | Total number of data bytes (index and filter are not included) being compacted. null if no compactions are in progress. |
| BytesCompacted | The progress of the current compaction. null if no compactions are in progress. |
Cassandra maintains distinct thread pools for different stages of execution. Each of these thread pools provide statistics on the number of tasks that are active, pending and completed. Watching trends on these pools for increases in the pending tasks column is an excellent indicator of the need to add additional capacity. Once a baseline is established, alarms should be configured for any increases past normal in the pending tasks column. See below for details on each thread pool (this list can also be obtained via command line using nodetool tpstats).
| Thread Pool | Description |
|---|---|
| AE_SERVICE_STAGE | Shows anti-entropy tasks |
| CONSISTENCY-MANAGER | Handles the background consistency checks if they were triggered from the client’s consistency level <consistency> |
| FLUSH-SORTER-POOL | Sorts flushes that have been submitted |
| FLUSH-WRITER-POOL | Writes the sorted flushes |
| GOSSIP_STAGE | Activity of the Gossip protocol on the ring |
| LB-OPERATIONS | The number of load balancing operations |
| LB-TARGET | Used by nodes leaving the ring |
| MEMTABLE-POST-FLUSHER | Memtable flushes that are waiting to be written to the commit log. |
| MESSAGE-STREAMING-POOL | Streaming operations. Usually triggered by bootstrapping or decommissioning nodes. |
| MIGRATION_STAGE | Tasks resulting from the call of system_* methods in the API that have modified the schema |
| MISC_STAGE | |
| MUTATION_STAGE | API calls that are modifying data |
| READ_STAGE | API calls that have read data |
| RESPONSE_STAGE | Response tasks from other nodes to message streaming from this node |
| STREAM_STAGE | Stream tasks from this node |
Cassandra keeps tracks latency (averages and totals) of read, write and slicing operations at the server level through StorageProxyMBean.
For individual column families, ColumnFamilyStoreMBean provides the same general latency attributes as StorageProxyMBean. Unlike StorageProxyMBean, ColumnFamilyStoreMBean has a number of other statistics that are important to monitor for performance trends. The most important of these are listed below:
| Attribute | Description |
|---|---|
| MemtableDataSize | The total size consumed by this column family’s data (not including meta data) |
| MemtableColumnsCount | Returns the total number of columns present in the memtable (across all keys) |
| MemtableSwitchCount | How many times the memtable has been flushed out |
| RecentReadLatencyMicros | The average read latency since the last call to this bean |
| RecentWriterLatencyMicros | The average write latency since the last call to this bean |
| LiveSSTableCount | The number of live SSTables for this ColumnFamily |
The first three memtable attributes are discussed in detail on the Tuning page.
The recent read latency and write latency counters are important in making sure that operations are happening in a consistent manner. If these counters start to increase after a period of staying flat, it is probably an indication of a need to add cluster capacity.
LiveSSTableCount can be monitored with a threshold to ensure that the number of SSTables for a given ColumnFamily does not become too great.