DataStax Developer Blog

DataStax Enterprise Testing on Google Compute Engine

By Quentin Conner -  December 9, 2013 | 7 Comments

DataStax is proud to be partnering with Google on the recently announced general availability of Google Compute Engine.

We briefly described our test of DataStax Enterprise in a guest post on the Google Cloud Platform Blog. Here are more details on the test we ran.

We tested three scenarios, all with positive outcomes:

  1. Operational stability of 100 nodes spread across two physical zones
  2. Reconnecting persistent disks to an instance after rebooting/failure
  3. Disk performance under load

Summary of Findings

100 node dual zone scenario

After installing DSE 3.2, common administrative activities were performed on a 100 node cluster split across data centers (separate GCE zones). We installed DSE OpsCenter 4.0 to observe and monitor our cluster. Following the shakeout we let the cluster continue loading data at a moderate rate [1] to conduct a 72 hour longevity test. This helped us establish operational confidence and verify trouble-free operation.

This was a configuration to exercise common system administrator activities like building up a cluster, loading it with data, adding a second zone, etc.
We did not intend to quantify performance with the 100 node configuration. Rather, we wanted to qualitatively evaluate the functionality under every-day conditions.

This configuration, with a continuous, repeating workload served as the basis for our 72 hour longevity test, designed to verify the platform provides trouble-free operation. We chose a single client worker to generate a transaction arrival rate that created enough load to utilize, but not tax the system. [1]

Our experience was trouble-free and DSE “just worked”. Our 72-hour longevity test completed without issue. Dual zones effortlessly streamed data to accomplish
replication. OpsCenter monitored the whole operation with green indicators across the board.

Persistent Disk Scenario

One of the advantages of the GCE platform is it’s use of persistent disks. When an instance is terminated the data is still persisted and can be re-connected to a new instance. This gives great flexibility to Cassandra users. For instance, you can upgrade a node to a higher CPU/Memory limit without re-replicating the data or recover from the loss of a node without having to stream all of the data from other nodes in the cluster.

We tested this by terminating a node, creating a new one, and re-connecting the disk to the new node. We did this within the hinted handoff window. When the new node re-joined the ring it was operational without having to run a repair. Since the new node had a different IP address we did need to adjust some configuration and remove the old node from the cluster, but the data did not need to be streamed.

Disk Performance Under Load Scenario

For this test we created a three node cluster and generated load that approached the limit of the disk throughput . During the test we captured CF Histogram data as well as low level disk latency data. We were very interested to see how consistent the latency would be on Persistent Disks. Our tests showed good distribution of latency during the tests as long as our load did not exceed the throughput threshold (which varies in GCE by the size of your disk). 90% of the times were less than 8ms. The key to consistent latency in GCE will be sizing your cluster so that each node stays within the throughput limits.

Test Details

100 node dual zone tests

Persistent Disk and GCE Instance settings

  • 100 instances split between two Zones: us-central2-a and us-central2-b
  • Machine type: n1-standard-2 (2 vCPU, 7.5 GB memory)
  • O/S image: debian-7-wheezy-v20131014
  • Persistent disk: 2000Gb

DSE / Cassandra settings

  • 2 data centers (50 nodes each)
  • 256 Vnodes
  • GossipingPropertyFileSnitch

In this scenario we used cassandra-stress to generate 50 million unique records and insert those into the 100 node cluster. Our cluster is configured with two data centers (each in a separate GCE zone) so the column families, each configured with RF=3, will get replicated across the two data centers.


./bin/nodetool status | grep UN | awk '{print $2}' > /tmp/nodelist

./resources/cassandra/tools/bin/cassandra-stress -K 100 -t 50
-R org.apache.cassandra.locator.NetworkTopologyStrategy
--num-keys=50000000 --columns=20 -D /tmp/nodelist -O DC1:3,DC2:3
--operation=INSERT

CF Histogram output after stress-write

Graph generated from the output of:

./bin/nodetool cfhistogram "Keyspace1" "Standard1"

gce-100node-percentWritesLatencyHistogram

3 node, 1 test client stress write tests

Persistent Disk and Instance settings

  • 3 nodes
  • Machine type: n1-standard-8 (8 vCPU, 30 GB memory)
  • O/S image: debian-7-wheezy-v20131014
  • Persistent disk: 3000Gb

DSE / Cassandra settings

  • 256 Vnodes
  • DseSimpleSnitch
  • concurrent_writes: 64 (2x default)
  • commitlog_total_space_in_mb: 4096 (4x default)
  • memtable_flush_writers: 4 (4x default)
  • trickle_fsync: true

In this scenario we use one node and a replication factor of three to generate large quantities of write I/O. We generated 10 million records simultaneously on one node using the cassandra-stress utility that comes with DataStax Enterprise. The test client was running from one of the same nodes that were also serving DSE. This probably limits the absolute performance but was convenient and served well enough to fully load the Persistent Disks.

In this test there is only one data center and we are using RF=3 on the Cassandra Keyspace. This approach will put a copy of each row of data on each of the three nodes.


[Run on only one node]
./bin/nodetool status | grep UN | awk '{print $2}' > /tmp/nodelist
./resources/cassandra/tools/bin/cassandra-stress -D /tmp/nodelist --replication-factor 3 --consistency-level quorum --num-keys 10000000 -K 100 -t 100 --columns=50 --operation=INSERT

This graph shows write operations per second, the number of writes per second completed by the ring.

gce-3node-10M-stress-write-ops-sec-nov26-3Tb

This graph shows median latency, a figure of merit indicating how much time it takes to satisfy a write request (in milliseconds).

gce-3node-10M-stress-write-latency-nov26-3Tb

CF Histogram output after stress-write

First a graph generated from the output of:

./bin/nodetool cfhistograms “Keyspace1″ “Standard1″



3-node-3TBPD-latency-histogram

Second, a graph generated from the Google PD backend:

3node-3TB-sample-latency-histogram

Footnotes:

1. For those into details we’ve run the client at a 60 inserts/sec/node rate (6,000 records per second for the cluster).



Comments

  1. Tom Green says:

    Article mentions:
    ‘We did not intend to quantify performance with the 100 node configuration. Rather, we wanted to qualitatively evaluate the functionality under every-day conditions.”

    WHY ?
    The test is useless, if its not quantitative.
    Look at a similar test from MapR ….

    1. Quentin Conner Quentin Conner says:

      Thanks for taking a look at our work. Since this was our first time using GCE we wanted to walk before we ran. The 100 node test got us comfortable that the basic functions (provisioning, networking, firewalls) worked. This was intentionally not a quantitative test. No use clocking the race car on the track if you can’t get it out of the garage, so to speak. It is worth mentioning that we have done this same test with other cloud providers only to find out basic functionality was missing (like the ability to route traffic between regions). So to us there is value in some qualitative testing. Our goal in the performance test was not to find the limit of cluster size, but to characterize persistent disks under a heavy Cassandra load. This is why we chose a smaller cluster for the second test.

  2. Linda Wu says:

    Why was the 100 nodes test conducted with 20 columns and the
    3 node test conducted with 50 columns.

    Was there a reason for the inconsistent runs ?

    1. Quentin Conner Quentin Conner says:

      Good question. We did not set out to compare the results of the two tests.
      One tester used 20 columns and the other used 50. I’d say we made the choice mostly out of habit.

      Choosing the number of columns for cassandra-stress INSERT operations is essentially a lever to control the volume of data written per row. The default of 5 will generate roughly 500 bytes per row. 20 columns generates about 2000. 50 columns generates about 5000 characters (octets) per row.

  3. Stan Hu says:

    Thanks for the benchmarks. This is a good first step towards answering the question whether Cassandra can run on GCE. A few questions:

    1) Were these benchmarks done with the recently-announced Google Persistent Disks?

    2) Are you putting the commit log on the same persistent disk as the data directory? Is there a reason you did one way or another?

    3) Do you have a plot of how much CPU load were you seeing on average with the 3-node stress test?

    1. Jay Judkowitz says:

      Stan, can you reach out to me at my google.com e-mail address – my last name is the username. I’m happy to talk with you about your disk performance question and I’m sure we can help to get the best results.

    2. Quentin Conner Quentin Conner says:

      Yes, we made use of the Persistent Disk subsystem, with the latest algorithm from Google. The latest PD release scales up IOPS as the size of the PD increases. Larger PDs allow for higher IOPS.

      We did not create a separate PD for the commit log versus the SSTable (data) storage. This split volume approach can help when a particular use case saturates the I/O subsystem ad nauseum.

      I don’t have a plot of CPU usage over time, but I do have this “typical” output from iostat during the three node stress WRITE test.


      iostat -mx 10 5000
      avg-cpu: %user %nice %system %iowait %steal %idle
      31.66 8.07 16.99 19.84 0.00 23.44

      Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
      sdb 0.00 63.50 121.80 196.30 16.56 40.93 370.17 68.65 211.32 186.23 226.89 3.14 100.00

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>