Atomic batches in Cassandra 1.2
In Cassandra, batch allows the client to group related updates into a single statement. If some of the replicas for the batch fail mid-operation, the coordinator will hint those rows automatically.
But there is one failure scenario that the classic batch design does not address: if the coordinator itself fails mid-batch, you could end up with partially applied batches.
In the past Cassandra has relied on the client to deal with this by retrying the batch to a different coordinator. This is usually adequate since writes in Cassandra, including batches, are idempotent -- that is, performing the same update multiple times is harmless.
But, if the client and coordinator fail at the same time -- say, because the client is an app server in the same datacenter, and suffers a power failure at the same time as the coordinator -- then there is no way to recover other than manually crawling through your records and reconciling inconsistencies.
Some particularly sophisticated clients have implemented a client-side commitlog to handle this scenario, but this is a responsibility that logically belongs on the server. That is what atomic batches bring in Cassandra 1.2.
Using atomic batches
Batches are atomic by default starting in 1.2. Unfortunately, the price for atomicity is about a 30% hit to performance compared to the old non-atomic batches. So, we also provide BEGIN UNLOGGED BATCH for when performance is more important than atomicity guarantees.
1.2 also introduces a separate BEGIN COUNTER BATCH for batched counter updates. Unlike other writes, counter updates are not idempotent, so replaying them automatically from the batchlog is not safe. Counter batches are thus strictly for improved performance when updating multiple counters in the same partition.
(Note that we mean "atomic" in the database sense that if any part of the batch succeeds, all of it will. No other guarantees are implied; in particular, there is no isolation; other clients will be able to read the first updated rows from the batch, while others are in progress. However, updates within a single row are isolated.)
Under the hood
Atomic batches use a new system table, batchlog, defined as follows:
CREATE TABLE batchlog (
id uuid PRIMARY KEY,
When an atomic batch is written, we first write the serialized batch to the batchlog as the data blob. After the rows in the batch have been successfully written (or hinted), we remove the batchlog entry. (There are thus some similarities to how Megastore uses a Bigtable ColumnFamily as a transaction log, but atomic batches are much, much more performant than Megastore writes.)
The batchlog table is node-local, along with the rest of the system keyspace. Instead of relying on normal Cassandra replication, StorageProxy special-cases the batchlog. This lets us make two big improvements over a naively replicated approach.
First, we can dynamically adjust behavior depending on the cluster size and arrangement. Cassandra prefers to perform batchlog writes to two different replicas in the same datacenter as the coordinator. This is not for durability -- since we only need to record the batch until it's successfully written -- so much as fault tolerance: if one batchlog replica fails, we don't need to wait for it to timeout before retrying to another. But, if only one replica is available, Cassandra will work with that without requiring an operator to manually adjust replication parameters.
Second, we can make each batchlog replica responsible for replaying batches that didn't finish in a timely fashion. This saves the complexity of requiring coordination between coordinator and replicas, and having to "failover" replay responsibility if the coordinator is itself replaced. (Again, since writes are idempotent, having multiple replicas of the batch replayed occasionally is fine.)
We are also able to make some performance optimizations based on our knowledge of the batchlog's function. For instance, since each replica has local responsibility to replay failed batches, we don't need to worry about preserving tombstones on delete. So in the normal case when a batch is written to and removed from the batchlog in quick succession, we don't need to write anything to disk on memtable flush.
Atomic batches are feature complete in Cassandra 1.2beta1, which is available for download on the Apache site; we're projecting a final release by the end of the year.