How not to benchmark Cassandra
date: February 4, 2014
As Cassandra continues to increase in popularity, it's natural that more people will benchmark it against systems they're familiar with as part of the evaluation process. Unfortunately, many of these results are less valuable than one would hope, due to preventable errors.
Here are some of the most common mistakes to keep in mind when writing your own, or reading others'. Many of these apply more broadly than just Cassandra or even databases.
Testing on virtual machines
This one is actually defensible: if you deploy on VMs, then benchmarking on VMs is the most relevant scenario for you. So as long as you understand the performance hit you're taking, this can be a reasonable choice. However, care must be taken: noisy neighbors can skew results, especially when using smaller instance sizes. Even with larger instances, it's much more difficult than you think to get consistent results.
As an alternative, consider using a "bare metal cloud" like Softlayer or Storm on Demand.
Testing on shared storage
Here we are firmly into antipattern territory. Modern databases are uniformly designed around scaling out across commodity hardware; shared storage will become your bottleneck sooner rather than later and will muddy the water considerably as you attempt to test scaling the database.
Amazon's EBS is particularly bad: the network adds significant latency, penalizing systems like Cassandra that care about durability, because they pay that latency hit twice: first to append to the commitlog, and later when flushing the sstables.
There's not much point in testing a high-level system like a database if low level operations like random reads don't work. I've seen systems with readahead defaulting as high as 32MB -- meaning, if Cassandra requests a 1KB row, the kernel will actually do 32MB of disk i/o.
Do a sanity check: do you get reasonable numbers for dd and bonnie++? DataStax Enterprise's pre-flight check is also useful.
Follow the documentation on configuring disks. It's particularly important on spinning disks that the commitlog be on its own spindle or set of spindles to avoid contention with reads.
P.S. Of course you don't have swap enabled. Right?
Inadequate load generation
There are only two principles to doing load generation right:
- Feed Cassandra enough work
- Generate the workload on separate machines
That's it! But it's frequently done wrong, from the extreme case of a single-threaded client running on the same laptop as Cassandra, to more subtle problems with the Python Global Interpreter Lock. It seems that like binary search, it's surprisingly difficult to build a good load generator. If possible, avoid the temptation of rolling your own and use something battle-tested. Reasonable options include:
- cassandra-stress: shipped with Cassandra; as a rough rule of thumb you can expect to need one stress machine for every two nodes in your cassandra cluster (divided by replica count)
- stress-ng: a replacement for stress that is about twice as fast, so plan a 1:4 ratio of load:cassandra machines
- ycsb: a cross-platform nosql benchmarking tool, limited to simple tables and queries by primary key. Expect to need about a 1:2 ratio of ycsb workload generators to Cassandra machines.
(Of course, only fairly coarse differences show up with simple synthetic benchmarks like these. We are working on a workload recording tool that will allow replaying a live workload from one cluster in a different one, which will be much more useful to test upgrades or performance tuning against real-world conditions.)
Related to but distinct from load generation, it's tempting to benchmark with a small dataset because hey, it gets fast results! I've seen Cassandra benchmarked with as few as 100,000 rows. This just isn't enough to tell us anything interesting about the performance properties of modern systems.
There are a few inflection points that can tell you a lot about how a system will scale. One is when the dataset no longer fits in memory. Another is when the indexes no longer fit in memory. Smaller datasets may not be representative of what you will see as you push past those thresholds.
If you are impatient, you can artificially restrict the memory available to the system with something like this, but the closer you can come to the real conditions your application will face, the more relevant the results will be.
Measuring different things in different systems
Particularly when comparing different products, it's important to take care that the same thing is being measured across the board. Some mistakes I've seen include
- Comparing mongodb's non-durable write performance with Cassandra doing durable writes
- Comparing latency writing to Cassandra with latency buffering an HBase update locally (a YCSB bug)
- Comparing a Cassandra cluster (where each node can route requests) to a larger MongoDB cluster (with extra router nodes)
Poor benchmark hygiene
The cluster should be reset in between benchmarking runs, including dropping the page cache. Even when benchmarking pure inserts, data from earlier runs will "pollute" your results as compaction kicks in, so clean it out completely. (Of course it's legitimate to test compaction's effect on performance, but that should be done deliberately and not as part of bad hygiene.)
Tests against JVM-based systems like Cassandra need to be long enough to allow the JVM to "warm up" and JIT-compile the bytecode to machine code. cassandra-stress has a "daemon" mode to make it easy to separate the warm-up phase from results that you actually measure; for other workload generators, your best bet is to simply make enough requests that the warmup is negligible.
Flouting best practices
There are a lot of performance tuning dials available in both the JVM and Cassandra. These can be used to optimize performance for specific workloads as Shift recently illustrated so well, but in general the defaults reflect the collective experience of the Cassandra community and improving upon them takes both understanding and effort.
I'm reminded of a benchmark that set the Cassandra row cache size equal to the entire JVM heap size, starving everything else of memory. I can only guess at the thought process that went into that: cache is good, so more cache must be better? In fact, the row cache is a fairly specialized structure that is disabled by default for good reason.
When in doubt, ask for a sanity check!