Questions from the Tokyo Cassandra conference
I asked for questions ahead of my talk at the Tokyo Cassandra Conference (google translate), and got a lot of good ones — more than I could reasonably answer live. There were some good questions, so I’m making the answers available in public. Gemini has posted the Japanese translation in part 1 and part 2.
How is Cassandra different from other NoSQL databases?
Cassandra offers the best mix of performance, scalability, and high availability. The other options fall short in at least one of these dimensions.
One under-appreciated aspect of Cassandra’s performance is that its ColumnFamily data model allows using dynamic columns to precompute materialized views to allow complex queries that would otherwise require joins.
How does Cassandra deal with network and hardware failure?
Cassandra deals with failures in two ways: first, its failure detector recognizes when peers are down or unreachable.
Second, Cassandra’s dynamic snitch performs what is actually a more difficult task: determining when other nodes are experiencing degraded performance (e.g. because a disk is failing) and routing queries to other replicas instead.
Importantly, none of this is “special” code that only runs when a “master” node fails. There are no masters and no special nodes, making dealing with failure a normal and robust part of Cassandra.
What is the roadmap for future Cassandra releases?
In my opinion, Cassandra has a pretty good story on its core functionality now, and it’s time to start paying more attention to ease of use. I don’t mean operationally, but around making it easier to build applications on top of Cassandra.
What important performance considerations are there when deploying across multiple data centers?
The main one (assuming that your data centers are geographically remote) is that using the Quorum ConsistencyLevel is often not a good option, because coordinators in at least one and often all data centers will have to wait for round-trip latency between other datacenters to handle client requests.
Cassandra offers LOCAL_QUORUM as a compromise, to preserve strong consistency with other clients in the same datacenter. And of course, ConsistencyLevel.ONE is often all you need; I estimate that about 60% of applications use ONE exclusively.
What are your thoughts on strong consistency via R/W=One/All?
I don’t think this is a good general-purpose setting, because it means that any node going down renders your cluster unable to accept writes. That is why doing reads and writes both at Quorum is more popular for when strong consistency is necessary. However, I can imagine scenarios where it would be appropriate — e.g., if your writes are done as periodic bulk loading, and the rest of your workload is read-only.
Is it possible to use journaling filesystems like XFS or Btrfs? Can you mix different filesystems across different machines in a cluster?
Yes, and yes. Currently, DataStax recommends XFS for Cassandra. Anecdotally, I have heard reports of improved performance on btrfs, but I would be leery of production deployment of Btrfs until it stabilizes more.
Note that while filesystem choice is machine-local, which is to say, the rest of the cluster doesn’t know or care what it is, Cassandra does not support mixing different operating systems in a single cluster: you can have (for instance) a Windows cluster, and a Linux cluster, but not a mixed Windows/Linux cluster.
I would also note that while it’s fine to test different file systems in a staging environment, best practice for production is to keep things as homogenous as possible to narrow down the surface of what you need to troubleshoot when you are investigating anomalies.
Does Cassandra support multiple storage devices per server?
Can you give an example of troubleshooting a Cassandra cluster?
A few months ago, a customer had one of his Cassandra nodes run out of disk space. Examining the filesystem showed that most of the space was taken up by commitlog segments. So we had two things to figure out: why was this node handling a higher level of write volume than the rest of the cluster, and why were the commitlog segments not being deleted as expected?
The first thing we did was check his Cassandra log file. It showed an unexpectedly high amount of data being written, but otherwise nothing unusual. Looking at the logs of other nodes in the cluster showed that many of them started performing hinted handoff to the node in question before its write volume went through the roof.
Checking with the customer, that node had been down for an extended period of time prior to this. With the rest of the cluster continuing to be under high write load, a large number of hints to be sent when the node came back up made sense.
But why were the commitlog segments not being deleted? We had the customer enable debug logging on the CommitLog class. It turned out that there were some low-volume ColumnFamilies that received just a few writes per segment–not enough to make their memtables big enough to flush. With unflushed data still present, the commitlog segment was not eligible to be deleted.
We recommended that the customer reduce his memtable_flush_after_mins setting on the low-volume ColumnFamilies to make them flush more often. We also recommended reducing his max_hint_window_in_ms setting, so that hint replay would not be so extensive.
Finally, we created an issue to replace memtable_flush_after_mins with commitlog_total_space_in_mb, making it much easier to avoid this problem in the future. This feature is available starting with Cassandra 1.0.
What is leveled compaction and what kinds of workloads does it help?
The short answer is, leveled compaction provides better performance for read-heavy workloads that also involve many updates to the same rows (as opposed to inserting new ones).
The longer answer deserves its own blog post, but in the meantime, the slides and video from Ben Coverston’s Cassandra Summit lightning talk on leveled compaction are available here.
What management tools are there for performance tuning and recovery?
DataStax’s OpsCenter currently focuses on cluster management and monitoring. This will help with high-level performance tuning like cache sizing and i/o capacity. Backup and restore managment is on the roadmap.
Does Cassandra offer Hadoop integration?
Yes, Cassandra provides support for basic map/reduce, Pig, and Hive read and write access. DataStax also offers DataStax Enterprise, a commercial product that gives you Cassandra and Hadoop support in a single cluster, with no need for separate HDFS, job tracker, task tracker, or metastore.
How do you add and remove nodes in a Cassandra cluster?
It’s basically as simple as pointing the new node to an existing seed and starting it up. DataStax’s documentation covers the specifics under adding capacity.
What is snapshot used for?
Snapshots are great for giving you a point-in-time backup of your cluster. They can be combined with the incremental_backups setting in cassandra.yaml to give you very fine-grained control over restores with minimal overhead.
Snapshots are implemented using hard links, which is possible because of Cassandra’s log-structured storage engine. This makes them both fast and extremely space-efficient.
Snapshots can also be used with the bulk loader to load data into another cluster, e.g., so you can test a new version of your app against real data. We also plan to support doing Hadoop queries directly against snapshot data.
What hardware do you recommend for Cassandra deployments?
Other things being equal, you’ll want more cheaper machines than fewer large ones. This makes failure of a single machine less of an impact on your total capacity. So as a rule of thumb I’d say whatever you can reasonably get in a 1U box. Currently 8 cores and 32GB of ram is a sweet spot.
But you can adjust this depending on your workload. If you’re going to do a lot of Hadoop analysis, you will typically be disk space or i/o bound. If your workload is insert-heavy, you’ll be CPU bound. If you do a lot of random reads, you should go with SSDs instead of magnetic disks.
What performance tuning advice do you have at the OS, JVM, and Cassandra levels?
Use XFS. Use RAID0 or RAID10 (depending on if you will be disk-space bound), and leave a separate spindle (or two with RAID1) for the commitlog. Use the noop or deadline scheduler.
The Cassandra packages do a good job of setting the important JVM settings; the only thing you’ll usually want to change out of the box is, it will allocate half the ram on the machine to the JVM heap, but you should cap that at 8GB if you have a larger machine. (This will be the default behavior in Cassandra 1.0.)
We’ve also done our best to make the Cassandra defaults performant and self-tuning. I can’t think of anything you’d want to change immediately there. But once you start putting some load on it, you’ll want to look into tuning cache sizes.
Are there best practices for running a cluster on Amazon EC2?
Use L or XL instances with local storage: I/O performance is proportionately much worse on S and M sizes, and EBS is a bad fit for several reasons (see Erik Onnen’s excellent explanation).
RAID0 all the ephemeral disks and put both the commitlog and data on that volume. (This has been tested to work better than putting the commitlog on the root volume.)
How does Cassandra compare to OpenStack?
They are pretty different animals: Cassandra is a high-performance distributed database, and OpenStack is cloud management software. You could deploy Cassandra in an OpenStack cloud, or use Cassandra under the hood to store OpenStack configuration and monitoring data, but otherwise they’re not really related.
What considerations are there for running repair in a cluster with a few tens of servers?
The biggest consideration is that building the repair merkle trees is fairly heavyweight (for now), so you usually don’t want to run more than one simultaneously. Run repair on one machine, and when it finishes, start it on the next.
If you find that performance is adversely affected by repair, reduce the compaction_throughput_mb_per_sec setting in cassandra.yaml.