DataStax Developer Blog

A Look at the DataStax Enterprise 4.5 Performance Service

By Robin Schumacher -  July 7, 2014 | 0 Comments

When I was an RDBMS DBA, I practically lived in the performance diagnostic interfaces of the databases I managed. Oracle’s V$ views, SQL Server dynamic management views, and others were my second home and I knew them like the back of my hand.

When I arrived at DataStax about three years ago, one of the first things I did was check out what kind of performance info I could get out of Cassandra.  The good news is Cassandra makes a lot of raw metrics available through a JMX implementation. The bad news is the data is fairly simple and not organized in ways that grease the skids for easy performance analysis.

Although tools like DataStax OpsCenter do a great job of visually representing the JMX data in a very helpful manner, I along with many of our customers have wanted a better command-line accessible interface into Cassandra performance information – one that provides the type of diagnostic organization found in RDBMS’s like Oracle and SQL Server.

With DSE 4.5, I’m happy to say that’s now available.

Performance Service Basics

Starting in DataStax Enterprise (DSE) 3.2, we began the rollout of our DataStax Management Services, which are designed to transparently automate many aspects of database maintenance and management. The first two services that came out in 3.2 ensure that data is consistent across a cluster (automated repair) and that handling capacity planning activities – things like historical trend analysis and forecasting of future resource needs – are easy to do. These services can either be managed via the command line or visually through DataStax OpsCenter.

The new DSE 4.5 Performance Service is the next service in our suite of automatic management services, and as its name implies, it’s all about helping you tune and optimize DSE for fast performance.

The service maintains an intelligent performance diagnostic data dictionary (about 30 tables in 4.5) that is CQL-based, which means you can easily query performance data from any CQL-enabled tool (e.g. cqlsh, DataStax DevCenter, etc.) and get both high-level and detailed views of how well a cluster is running. You can also automatically capture the worst performing statements in a cluster and know exactly what commands and users are responsible for consuming the most resources.

Enabling the Performance Service

The Performance Service is not enabled by default in DSE, so to use it you’ll need to edit the dse.yaml file and adjust a number of configuration options that control what kind of statistics are collected. Currently, there are seven categories of metrics that the Performance Service can collect:

  • System level – supplies general memory, network, and thread pool statistics.
  • Cluster level – provides metrics at the cluster, data center, and node level.
  • Database level – provides drill down metrics at the keyspace, table, and table-per-node level.
  • Table histogram level – delivers histogram metrics for tables being accessed.
  • Object I/O level – supplies metrics concerning ‘hot objects’; data on what objects are being accessed the most.
  • User level – provides metrics concerning user activity, ‘top users’ (those consuming the most resources on the cluster) and more.
  • Statement level – captures queries that exceed a certain response time threshold along with all their relevant metrics.

For all categories except the slow query collection, you change the “enabled” option for each category from ‘false’ to ‘true’ and optionally adjust a refresh rate parameter that controls how frequently the objects are updated with new information (default is every 10 seconds).

To capture slow queries, you uncomment the cql_slow_log_threshold_ms parameter, optionally adjust it for the response time threshold desired (default is 100ms) as well as another parameter that controls how often data is purged from the table.

You can enable only those categories of metrics that you care most about, or all metrics if desired, with the understanding that more overhead on the system may be involved.

Querying Performance Service Tables

Once you’ve enabled one or more of the diagnostic collection categories and restarted the cluster, you can query the tables with CQL. All diagnostic tables are stored in the dse_perf keyspace.

Note that some of the tables have many columns (e.g. cluster_snapshot, table_snapshot, etc.) and so if you don’t use a tool like DevCenter that has a flexible data grid for query output to query the performance tables, but instead use cqlsh, you may want to set the utility to display one line per column so the displayed metrics are easier to read:

cqlsh:dse_perf> use dse_perf;
cqlsh:dse_perf> expand on;
cqlsh:dse_perf> select * from object_io;
 
@ Row 1
---------------+--------------------------
node_ip       | 127.0.0.1
keyspace_name | weatherdb
table_name    | weather_station
last_activity | 2014-06-24 08:24:11-0400
memory_only   | False
read_latency  | 104
total_reads   | 6
total_writes  | 33
write_latency | 51
 
@ Row 2
---------------+--------------------------
node_ip       | 127.0.0.1
keyspace_name | weatherdb
table_name    | weather_station_by_day
last_activity | 2014-06-24 08:24:01-0400
memory_only   | False
read_latency  | 0
total_reads   | 0
total_writes  | 30
write_latency | 45
 
(2 rows)

 

A Note on Data Persistence

While most of the performance tables that pertain to system, cluster, and database level statistics will always contain data, the diagnostic tables that contain user and object-related activity can be very transient, in that only the most recent user activity will be retained and available for query. Because user connections oftentimes quickly come and go, the data in these tables will only show current user actions on a cluster, but will not display historical user activity data.

Slow Query Tracing

As previously mentioned, you can easily find the needle-in-the-haystack “slow” queries that run on your cluster by enabling the slow log collection parameter, and adjusting it for whatever response time threshold you’d like along with how long you want the collected data to persist in the table. Once done, all you have to do is query the node_slow_log table find the poorest performing statements on your system:

cqlsh:dse_perf> select * from node_slow_log;
@ Row 1
-------------+-------------------------------------------------------------------------
node_ip     | 127.0.0.1
date        | 2014-06-24 00:00:00-0400
start_time  | c45fa7c0-fb9c-11e3-b5b2-e76149b7842c
commands    | select * from weather_station where station_id=2 and date=''2014-01-01'';
duration    | 183
parameters  | null
source_ip   | 127.0.0.1
table_names | weather_station
username    | anonymous

Conclusions

For more specifics on the Performance Service and a complete listing of all the diagnostic tables available and their columns, see the documentation for DSE 4.5.

Without question, the Performance Service will be a living, breathing feature that we will continue to expand and refine with your help, so if you haven’t already, download DSE 4.5, give the Performance Service a try, and let us know what enhancements you’d like to see via our forums on Stackoverflow, through our Support team, or by contacting us directly.



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>