DataStax Developer Blog

Pluggable metrics reporting in Cassandra 2.0.2

By Jonathan Ellis -  November 5, 2013 | 5 Comments

Guest post by Chris Burroughs

Starting in 1.1, Apache Cassandra began exposing its already bountiful internal metrics using the popular Metrics library. The number of metrics has since been greatly expanded in 1.2 and beyond. There are now a variety of metrics for cache size, hit rate, client request latency, thread pool status, per column family statistics, and other operational measurements.

You could always write some custom java code to send these metrics onto a system like graphite or ganglia for data storage and graphing. Starting in 2.0.2 plugable Metrics reporter support is built in.

Setup

  1. Grab your favorite reporter jar (such as metrics-graphite) and add it to the server’s lib
  2. Create a configuration file for the reporters following the sample.
  3. Start the server with -Dcassandra.metricsReporterConfigFile=yourCoolFile.yaml
  4. Happy Graphing!

A config file to send some basic metrics to a single local graphite server once a minute might look like:

graphite:
  -
    period: 60
    timeunit: 'SECONDS'
    hosts:
     - host: 'graphite-server.domain.local'
       port: 2003
    predicate:
      color: "white"
      useQualifiedName: true
      patterns:
        - "^org.apache.cassandra.metrics.Cache.+"
        - "^org.apache.cassandra.metrics.ClientRequest.+"
        - "^org.apache.cassandra.metrics.Storage.+"
        - "^org.apache.cassandra.metrics.ThreadPools.+"

You can specifically include or exclude groups of metrics. For example, detailed per column family metrics for a cluster with a single column family might be useful, while excluding them to avoid overwhelming the graphing system might be preferable for a cluster with hundreds of column families. See the metrics-reporter-config library for all of the configuration details.

Example Graphs

Data Load

Data Load graph

A simple stacked graph showing the amount of data stored in a cluster growing over the past month. Presumably new nodes will eventually be required if the trend continues.

Read Latency

client-vs-cf

Troubleshooting 95th percentile latency on a node that clients detected erratic behavior on. The top blue line is coordinator latency, while the bottom line is latency for satisfying read requests within this node’s range. The lack of correlation between the two implies the problem causing the large blue spikes lies elsewhere in the cluster and not with the coordinator that happens to be receiving client requests (or at the least that the problem does not have to do with local IO).

Cache Size

cache-size

For a newly bootstrapped node, both the number of entries in the RowCache and the graphite calculated derivative showing growth rate.



Comments

  1. Justin Sweeney says:

    Interesting post. The graphs of metrics coming out of this look promising. Did you notice any impact on Cassandra when reporting out these metrics to something like graphite?

  2. Chris Burroughs says:

    Reporters do their thing in separate threads that only wake up on the order of seconds. Their overhead should be minimal (even less than actually calculating the metrics in the first place).

  3. Peter Fales says:

    In the “Read Latency” graph, which metrics allow you to distinguish between “coordinator latency” and “requests within this node’s range?”

  4. Tom says:

    What would I need to do to use a reporter other than the 4 that are supported by metrics-reporter-config (console, cvs, ganglia and graphite)?

  5. Shekh says:

    metrics-reporter-config is not supporting prefix parameter any plan to add this ?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>