DataStax OpsCenter Documentation

Definitions of alert metrics

From the Alerts area of OpsCenter Enterprise Edition, you can configure alert thresholds for a number of Cassandra cluster-wide, column family, and operating system metrics. This proactive monitoring feature is available only in OpsCenter Enterprise Edition.

Commonly watched alert metrics

OpsCenter provides the capability to configure alerts for the following most commonly watched Cassandra and system metrics.

Metric Definition
Node Down When a node is not responding to requests, it is marked as down.
Write Requests The number of write requests per second. Monitoring the number of writes over a given time period can give you and idea of system write workload and usage patterns.
Write Request Latency The response time (in milliseconds) for successful write operations. The time period starts when a node receives a client write request, and ends when the node responds back to the client.
Read Requests The number of read requests per second. Monitoring the number of reads over a given time period can give you and idea of system read workload and usage patterns.
Read Request Latency The response time (in milliseconds) for successful read operations. The time period starts when a node receives a client read request, and ends when the node responds back to the client.
CPU Usage The percentage of time that the CPU was busy, which is calculated by subtracting the percentage of time the CPU was idle from 100 percent.
Load Load is a measure of the amount of work that a computer system performs. An idle computer has a load number of 0 and each process using or waiting for CPU time increments the load number by 1.

Advanced Cassandra alert metrics

OpsCenter provides the ability to configure alerts for the following Cassandra metrics. These metrics are aggregated across all nodes in the cluster.

Metric Definition
Heap Max The maximum amount of shared memory allocated to the JVM heap for Cassandra processes.
Heap Used The amount of shared memory in use by the JVM heap for Cassandra processes.
JVM CMS Collection Count The number of concurrent mark-sweep (CMS) garbage collections performed by the JVM per second.
JVM ParNew Collection Count The number of parallel new-generation garbage collections performed by the JVM per second.
JVM CMS Collection Time The time spent collecting CMS garbage in milliseconds per second (ms/sec).
JVM ParNew Collection Time The time spent performing ParNew garbage collections in ms/sec.
Data Size The size of column family data (in gigabytes) that has been loaded/inserted into Cassandra, including any storage overhead and system metadata.
Compactions Pending The number of compaction operations that are queued and waiting for system resources in order to run. The optimal number of pending compactions is 0 (or at most a very small number). A value greater than 0 indicates that read operations are in I/O contention with compaction operations, which usually manifests itself as declining read performance.
Total Bytes Compacted The number of sstable data compacted in bytes per second.
Total Compactions The number of compactions (minor or major) performed per second.
Flush Sorter Tasks Pending The flush sorter process performs the first step in the overall process of flushing memtables to disk as SSTables. The optimal number of pending flushes is 0 (or at most a very small number).
Flushes Pending The flush process flushes memtables to disk as SSTables. This metric shows the number of memtables queued for the flush process. The optimal number of pending flushes is 0 (or at most a very small number).
Gossip Tasks Pending Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. In Cassandra, the gossip process runs once per second on each node and exchanges state messages with up to three other nodes in the cluster. Gossip tasks pending shows the number of gossip messages and acknowledgments queued and waiting to be sent or received. The optimal number of pending gossip tasks is 0 (or at most a very small number).
Hinted Handoff Pending While a node is offline, other nodes in the cluster will save hints about rows that were updated during the time the node was unavailable. When a node comes back online, its corresponding replicas will begin streaming the missed writes to the node to catch it up. The hinted handoff pending metric tracks the number of hints that are queued and waiting to be delivered once a failed node is back online again. High numbers of pending hints are commonly seen when a node is brought back online after some down time. Viewing this metric can help you determine when the recovering node has been made consistent again.
Internal Responses Pending The number of pending tasks from various internal tasks such as nodes joining and leaving the cluster.
Manual Repair Tasks Pending The number of operations still to be completed when you run anti-entropy repair on a node. It will only show values greater than 0 when a repair is in progress. It is not unusual to see a large number of pending tasks when a repair is running, but you should see the number of tasks progressively decreasing.
Memtable Post Flushers Pending The memtable post flush process performs the final step in the overall process of flushing memtables to disk as SSTables. The optimal number of pending flushes is 0 (or at most a very small number).
Migrations Pending The number of pending tasks from system methods that have modified the schema. Schema updates have to be propagated to all nodes, so pending tasks for this metric can manifest in schema disagreement errors.
Misc. Tasks Pending The number of pending tasks from other miscellaneous operations that are not ran frequently.
Read Requests Pending The number of read requests that have arrived into the cluster but are waiting to be handled. During low or moderate read load, you should see 0 pending read operations (or at most a very low number).
Read Repair Tasks Pending The number of read repair operations that are queued and waiting for system resources in order to run. The optimal number of pending read repairs is 0 (or at most a very small number). A value greater than 0 indicates that read repair operations are in I/O contention with other operations.
Replicate on Write Tasks Pending When an insert or update to a row is written, the affected row is replicated to all other nodes that manage a replica for that row. This is called the ReplicateOnWriteStage. This metric tracks the pending tasks related to this stage of the write process. During low or moderate write load, you should see 0 pending replicate on write tasks (or at most a very low number).
Request Responses Pending Streaming of data between nodes happens during operations such as bootstrap and decommission when one node sends large numbers of rows to another node. The metric tracks the progress of the streamed rows from the receiving node.
Streams Pending Streaming of data between nodes happens during operations such as bootstrap and decommission when one node sends large numbers of rows to another node. The metric tracks the progress of the streamed rows from the sending node.
Write Requests Pending The number of write requests that have arrived into the cluster but are waiting to be handled. During low or moderate write load, you should see 0 pending write operations (or at most a very low number).

Advanced column family alert metrics

OpsCenter provides the capability to configure alerts for the following column family metrics. Column family metrics provide a granular level of detail for certain Cassandra metrics as they relate to a particular column family.

Metric Definition
Local Writes The write load on a column family measured in operations per second. This metric includes all writes to a given column family, including write requests forwarded from other nodes.
Local Write Latency The response time in milliseconds for successful write operations on a column family. The time period starts when nodes receive a write request, and ends when nodes respond.
Local Reads The read load on a column family measured in operations per second. This metric includes all reads to a given column family, including read requests forwarded from other nodes.
Local Read Latency The response time in microseconds for successful read operations on a column family. The time period starts when a node receives a read request, and ends when the node responds.
CF: KeyCache Hits The number of read requests that resulted in the requested row key being found in the key cache.
CF: KeyCache Requests The total number of read requests on the row key cache.
CF: KeyCache Hit Rate The key cache hit rate indicates the effectiveness of the key cache for a given column family by giving the percentage of cache requests that resulted in a cache hit.
CF: RowCache Hits The number of read requests that resulted in the read being satisfied from the row cache.
CF: RowCache Requests The total number of read requests on the row cache.
CF: RowCache Hit Rate The key cache hit rate indicates the effectiveness of the row cache for a given column family by giving the percentage of cache requests that resulted in a cache hit.
Live Disk Used The current size of live SSTables for a column family. It is expected that SSTable size will grow over time with your write load, as compaction processes continue doubling the size of SSTables. Using this metric together with SSTable count, you can monitor the current state of compaction for a given column family.
Total Disk Used The current size of the data directories for the column family including space not reclaimed by obsolete objects.
SSTable Count The current number of SSTables for a column family. When column family memtables are persisted to disk as SSTables, this metric increases to the configured maximum before the compaction cycle is repeated. Using this metric together with live disk used, you can monitor the current state of compaction for a given column family.
Pending Reads/Writes The number of pending reads and writes on a column family. Pending operations are an indication that Cassandra is not keeping up with the workload. A value of zero indicates healthy throughput.
CF: Bloom Filter Space Used The size of the bloom filter files on disk.
CF: Bloom Filter False Positives The number of false positives, which occur when the bloom filter said the row existed, but it actually did not exist in absolute numbers.
CF: Bloom Filter False Positive Ratio The fraction of all bloom filter checks resulting in a false positive.

Advanced system alert metrics

OpsCenter provides the capability to configure alerts for the following operating system metrics:

As with any database system, Cassandra performance greatly depends on underlying systems on which it is running. To configure advanced system metric alerts, you should first have an understanding of the baseline performance of your hardware and the averages of these system metrics when the system is handling a typical workload.

Linux metrics

On Linux, you can configure alerts on memory, cpu and disk events.

Memory metrics on Linux

Metric Definition
Memory Free System memory that is not being used.
Memory Used System memory used by application processes.
Memory Buffered System memory used for caching file system metadata and tracking in-flight pages.
Memory Shared System memory that is accessible to CPUs.
Memory Cached System memory used by the OS disk cache.

CPU metrics on Linux

Metric Definition
Idle Percentage of time the CPU is idle.
Iowait Percentage of time the CPU is idle and there is a pending disk I/O request.
Nice Percentage of time spent processing prioritized tasks. Niced tasks are also counted in system and user time.
Steal Percentage of time a virtual CPU waits for a real CPU while the hypervisor services another virtual processor.
System Percentage of time allocated to system processes.
User Percentage of time allocated to user processes.

Disk metrics on Linux

Metric Definition
Disk Usage Percentage of disk space Cassandra uses at a given time.
Free Disk Space Available disk space in GB.
Used Disk Space Used disk space in GB.
Disk Read Throughput Average disk throughput for read operations in megabytes per second. Exceptionally high disk throughput values may indicate I/O contention.
Disk Write Throughput Average disk throughput for write operations in megabytes per second.
Disk Read Rate Averaged disk speed for read operations.
Disk Write Rate Averaged disk speed for write operations.
Disk Latency Average time consumed by disk seeks in milliseconds.
Disk Request Size Average size in sectors of requests issued to the disk.
Disk Queue Size Average number of requests queued due to disk latency.
Disk Utilization Percentage of CPU time consumed by disk I/O.

Windows metrics

On Windows, you can configure alerts on memory, cpu and disk events.

Memory metrics on Windows

Metric Definition
Available Memory Physical memory that is not being used.
Pool Nonpaged Physical memory that stores the kernel and other system data structures.
Pool Paged Resident Physical memory allocated to unused objects that can be written to disk to free memory for reuse.
System Cache Resident Physical pages of operating system code in the file system cache.

CPU metrics on Windows

Metric Definition
Idle Percentage of time the CPU is idle.
Privileged Percentage of time the CPU spends executing kernel commands.
User Percentage of time allocated to user processes.

Disk metrics on Windows

Metric Definition
Disk Usage Percentage of disk space Cassandra uses at a given time.
Free Disk Space Available disk space in GB.
Used Disk Space Used disk space in GB.
Disk Read Throughput Average disk throughput for read operations in megabytes per second. Exceptionally high disk throughput values may indicate I/O contention.
Disk Write Throughput Average disk throughput for write operations in megabytes per second.
Disk Read Rate Averaged disk speed for read operations.
Disk Write Rate Averaged disk speed for write operations.
Disk Latency Average time consumed by disk seeks in milliseconds.
Disk Request Size Average size of requests in KB issued to the disk.
Disk Queue Size Average number of requests queued due to disk latency.
Disk Utilization Percentage of CPU time consumed by disk I/O.

Mac OSX metrics

On Mac OSX, you can configure alerts on memory, cpu and disk events.

Memory metrics on Mac OSX

Metric Definition
Free Memory System memory that is not being used.
Used Memory System memory that is being used by application processes.

CPU metrics on Mac OSX

Metric Definition
Idle Percentage of time the CPU is idle.
System Percentage of time allocated to system processes.
User Percentage of time allocated to user processes

Disk metrics on Mac OSX

Metric Definition
Disk Usage Percentage of disk space Cassandra uses at a given time.
Free Space Available disk space in GB.
Used Disk Space Used disk space in GB.
Disk Throughput Average disk throughput for read/write operations in megabytes per second. Exceptionally high disk throughput values may indicate I/O contention.