Common Mistakes and Misconceptions
date: October 11, 2013
For my first two years at DataStax I was in a position to speak directly to customers. On a daily basis I spent much of my time telling people what they need to do to ensure they were going to be successful with Cassandra. After a few years I became desensitized to the fact that these were just things people needed to know. In the past two weeks I visited again with a few customers and it was obvious that we need to talk much more about a few of these issues. If you're running in production, or you plan to run C* in production there are a few things you must know before going forward. Each of these topics probably deserves a blog post of its own, in addition I've taken some liberties in my explanations for the sake of brevity.
Repair has an unfortunate, but appropriate name. Repair doesn't really have an analog in the RDBMS world. It's closest relative is the file syncing tool called rsync. The job of repair is to inspect the local copy of the data, and compare the local copy with remote copies. When discrepancies are found the replicas then stream their differences between them, and then finally merge the copies. Once completed the replicas will be in sync for the data that was on the nodes when the repair task was started.
Great! you say, I want my data to be as consistent as possible, if replicas disagree we want to fix that ASAP, so let's run repair every day/hour/continuously. The problem with this is that repair is expensive. While running repair you're likely to see some increased latency. Repair actually needs to read the data that needs to be checked between the nodes. Even though it doesn't have to transfer that data over the network, it's still expensive to run this process (you can see this in the form of a validation compaction when running
We recommend that you run repair weekly. If something goes wrong you'll have a few days to remedy the situation. More than that is probably a waste of resources. If you think you have a use case where running repair makes more sense, let us know, or at least bring it up in the community forums, or have a conversation with us about it.
Read Repair Chance
One of the more confusing settings is read_repair_chance. Read repair is not directly related to repair, but both play a role in the overall anti-entropy system in Cassandra. This setting used to be in
cassandra.yaml, where it started out as 1. That is, at a consistency level of 1, for every read, we would check the other replicas to see if the thing data we just read is consistent with the other replicas. This was good, because if you ever read stale data, the next time you read the same row you would probably read something more up to date. The bad part about this was requiring every read to become RF reads (and typically your RF is set to at least 3). Meaning that reads happen more often, and require more IO. In newer versions of Cassandra the default for this value is 0.1, and it is set on a per-columnfamily basis. Which means 10% of your requests will trigger a background read repair. This is more than enough for typical scenarios. Also note that there are now two settings on the column family level for read repair. The first is
read_repair_chance which does what we discussed above. The second is
dclocal_read_repair_chance which will limit the repair to only the local datacenter avoiding potentially expensive repairs that cross data centers.
Cleanup's job is to scan all the data on a node and throw out any data that is no longer owned by that node. "Why would a node have data that it doesn't own?" you ask. Well this can happen when you change the topology of a cluster (add, move, or remove nodes). When this happens some nodes give up, or add to their range ownership. When ranges are given up the data is not automatically deleted. It sticks around until we explicitly tell Cassandra to clean itself up. This omission is intentional, and can help to prevent data loss in the case that a newly introduced node fails soon after bootstrapping.
Like in the repair scenario, the result is nearly always a good thing, but the process in getting there is is not without a cost. On more than a few occasions I've seen cleanups scheduled by well meaning administrators at regular intervals, in one example, every other day. Adding, moving, or removing nodes are usually not surprises. The are the result of deliberate processes and procedures in your production environment. When one of these things happens you should then schedule a repair to happen on your cluster for some time in the future (usually a week or two, one node at a time) rather than having a general cleanup operation rewrite all of your data in on the off chance that something might have happened requiring a cleanup. Your buffer cache is too important to clobber on a chance.
Cassandra uses a Log Structured Merge Tree (LSMT) for writes. That means that rows can be fragmented over time requiring more IO and CPU time for reads. Compaction is an optimization that amortizes this cost over time by merging rows in the background. In
SizeTieredCompactionStrategy (STCS) there's something called a major compaction that can be run through
nodetool. Major compactions will take all the SSTables for a Column Family and merge them into a single large SSTable, let's call this a 'Mega Table'. This will merge all the rows so that every read will only have to access a single SSTable before it returns. While this can be faster in the short term, if you continue to write data into the system it will actually exacerbate the issue, as that newly created 'mega' table will probably never again participate in a compaction again (or for a long time). This means that any tombstones or updates related to the rows in the mega table will not be cleaned up until you run another major compaction increasing your operational burden for a short term gain.
We recommend that with any compaction strategy you let it run its natural course. Over time we have made improvements to both STCS and the
LeveledCompactionStrategy and we hope to see improvements in compaction continue to progress with future versions of Cassandra. If you notice that reads are slowing down because of the IO required by compaction it may be time to consider increasing the size of your cluster.
In 2.0 there were several improvements to compaction, take a look at what's under the hood for 2.0
Looking back three years ago in the days of C* 0.6.x the Java platform was probably the biggest struggle from a tuning perspective. Bigger is better right? For a JVM heap that certainly is not the case. Generally you will see diminishing returns from Java heaps > 12 GB. Our general guidance for heap sizes in Cassandra is as small as possible, and no larger. Sometimes this takes the form of some trial and error in exercising your use case. Cassandra has come a long way in the last three years, we've changed to accommodate the JVM, and the JVM itself has improved. Just because your heap is small it doesn't mean the rest of your memory is going to waste. In fact, because Cassandra utilizes memory mapped IO the memory not allocated to the heap will be utilized by Cassandra as part of the buffer cache making IO operations that would normally have to go to disk read their data directly from memory.
Since 0.6 we've made a lot of improvements to how we manage memory in Cassandra. You can see some of the most recent developments in this article.
There are a few other topics here that could also be addressed. Tombstones, Secondary Indexes, and choosing the appropriate write consistency for operations. These are all things I would like to discuss in a future installment oriented around data modeling rather than operational topics as we have discussed here. I hope this helps!