Coming up in Cassandra 1.1: Row Level Isolation
date: February 21, 2012
While Apache Cassandra does not provide ACID properties (no complex transactions support), it still provides some useful atomicity guarantees.
More precisely, Cassandra has always provided row-level atomicity of batch mutations. This means that multiple batched writes to the same row are applied by nodes atomically. When doing
SET login='eric22' AND password='f3g$dq!'
Cassandra guarantees that the changes to login and password are either both applied or none are.
However, up to Cassandra 1.0, the isolation of such an update was not guaranteed. In other words, it is possible (during a very brief moment during the update) that a read like
SELECT login, password
returns the new login ('eric22') but not the new password ('f3g$dq!'). This changes in Cassandra 1.1 as row-level updates are now made in isolation. Cassandra 1.1 guarantees that if you update both the login and password in the same update (for the same row key) then no concurrent read may see only a partial update.
These atomicity and isolation guarantees apply to columns written under the same physical row, i.e. that are within the same column family and share the same partition key. For atomicity, the guarantee actually extends across column families (within the same keyspace): updates for the same partition key are persisted atomically even for different column families. This is not the case however for isolation (updates to different column families are not isolated).
Note that when we say that Cassandra persists row-level writes atomically, this applies to each node of the cluster individually; Cassandra does not provide any cluster-wide rollback mechanism. In the preceding example, the guarantee is that the new login cannot be persisted without the new password being persisted too (and vice-versa). It is however possible for both to be persisted even if the client operation end up with a timeout (because not enough nodes have acknowledge the write to satisfy the requested consistency level). It is up to the client to retry a failed write in such cases.
Internally, the row-level atomicity is guaranteed mainly by the commit log. Upon reception by the coordinator, each write query is transformed into a bunch of 'RowMutation'. Each of those RowMutation regroups all updates for a given row key (even for different column families). On every replica, each RowMutation is first serialized and written to the commit log as one mutation (individually checksummed for assessing integrity in case of failure). This ensures that on failure, that RowMutation is either replayed entirely (if it had been completely written in the commit log and isn't corrupted) or not at all. The other part of guaranteeing the atomicity of persistence comes from the fact that a given RowMutation is applied to one and only one memtable. It follows that the RowMutation (all the updates from a client query for a given row key) can only be persisted together or not at all.
To a large extent, the log-structured nature of Cassandra storage engine makes row-level isolation easier. Writes are applied to memtables that are then persisted as sstables which are immutable. Thus ensuring that a RowMutation is applied to the current memtable in isolation (of other writes and reads) is enough to ensure complete isolation. That is what was added to Cassandra 1.1, the application of RowMutation to memtables in isolation. Technically, we use SnapTree copy-on-write clone facilities: all the columns of a new mutation are applied to a non-visible (and thus isolated) copy of the in-memtable row they are applied to and then we atomically replace the original row with the new copy through a compare-and-set.
Cassandra guarantees that updates to the same row will be applied together, but not that they will be resolved the same way. Suppose that the original user row was inserted at time 100. We can easily construct an update that leaves us with a new login but not a new password:
BEGIN BATCH; UPDATE Users SET password='f3g$dq!' WHERE key='550e8400-e29b-41d4-a716-446655440000' USING TIMESTAMP 99; UPDATE Users SET login='eric22' WHERE key='550e8400-e29b-41d4-a716-446655440000' USING TIMESTAMP 101; APPLY BATCH;
Here, the login column will be updated since the new timestamp is higher than the old; the password column will not. (For equal timestamps, it depends.) See this post for more details on Cassandra's philosophy on conflict resolution.