DataStax Developer Blog

Lightweight transactions in Cassandra 2.0

By Jonathan Ellis -  July 23, 2013 | 10 Comments

Background

When discussing the tradeoffs between availability and consistency, we say that a distributed system exhibits strong consistency when a reader will always see the most recently written value.

It is easy to see how we can achieve strong consistency in a master-based system, where reads and writes are routed to a single master. However, this also has the unfortunate implication that the system must be unavailable when the master fails until a new master can take over.

Fortunately, you can also achieve strong consistency in a fully distributed, masterless system like Cassandra with quorum reads and writes. Cassandra also takes the next logical step and allows the client to specify per operation if he needs that redundancy — often, eventual consistency is completely adequate.

But what if strong consistency is not enough? What if we have some operations to perform in sequence that must not be interrupted by others, i.e., we must perform them one at a time, or make sure that any that we do run concurrently will get the same results as if they really were processed independently. This is linearizable consistency, or in ACID terms, a serial isolation level.

For example, suppose that I have an application that allows users to register new accounts. Without linearizable conistency, I have no way to make sure I allow exactly one user to claim a given account — I have a race condition analogous to two threads attempting to insert into a [non-concurrent] Map: even if I check for existence before performing the insert in one thread, I can’t guarantee that no other thread inserts it after the check but before I do.

Paxos

Again, we can easily see how to get linearizable consistency if we route all requests through a single master. In a fully distributed system, it is less obvious. Early attempts in Cassandra tried to address this by wrapping a lock around sensitive operations, e.g. with the Cages library or with Hector’s native locks. Unfortunately, the locking approach exposes you to unpleasant edge cases including lost updates in failure scenarios. We need to do better.

Fortunately, we can. The Paxos consensus protocol allows a distributed system to agree on proposals with a quorum-based algorithm, with no masters required and without the problems of two-phase commit. There are two phases to Paxos: prepare/promise, and propose/accept.

Paxos round trips

Prepare/promise is the core of the algorithm. Any node may propose a value; we call that node the leader. (Note that many nodes may attempt to act as leaders simultaneously! This is not a “master” role.) The leader picks a ballot and sends it to the participating replicas. If the ballot is the highest a replica has seen, it promises to not accept any proposals associated with any earlier ballot. Along with that promise, it includes the most recent proposal it has already received.

If a majority of the nodes promise to accept the leader’s proposal, it may proceed to the actual proposal, but with the wrinkle that if a majority of replicas included an earlier proposal with their promise, then that is the value the leader must propose. Conceptually, if a leader interrupts an earlier leader, it must first finish that leader’s proposal before proceeding with its own, thus giving us our desired linearizable behavior.

(For more details on Paxos itself, the best resource is Paxos made Simple.)

Extending Paxos

The alert reader will notice here that Paxos gives us the ability to agree on exactly one proposal. After one has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again. We need a way to “reset” the Paxos state for subsequent proposals. Most discussions of Paxos focus on using it to build an edit log. In our case, what we really want to do is move the accepted value into “normal” Cassandra storage. To do this, we add a third phase of commit/acknowledge:

Paxos round trips with commit

Finally, because we expose this functionality as a compare-and-set operation, we need to read the current value of the row to see if it matches the expected one. Our full sequence then looks like this:

Paxos with commit and current-value read

Thus, at the cost of four round trips, we can provide linearizability. That sounds like a high cost—perhaps too high, if you have the rare case of an application that requires every operation to be linearizable. But for most applications, only a very small minority of operations require linearizability, and this is a good tool to add to the strong/eventual consistency we’ve provided so far.

Lightweight transactions in CQL

Lightweight transactions can be used for both INSERT and UPDATE statements, using the new IF clause. Here’s an example of registering a new user:

INSERT INTO USERS (login, email, name, login_count)
values ('jbellis', 'jbellis@datastax.com', 'Jonathan Ellis', 1)
IF NOT EXISTS

And an an example of resetting his password transactionally:

UPDATE users
SET reset_token = null, password = ‘newpassword’
WHERE login = ‘jbellis’
IF reset_token = ‘some-generated-reset-token’

Some fine print:
  • The columns updated do NOT have to be the same as the columns in the IF clause.
  • Lightweight transactions are restricted to a single partition; this is the granularity at which we keep the internal Paxos state. As a corollary, transactions in different partitions will never interrupt each other.
  • If your transaction fails because the existing values did not match the one you expected, Cassandra will include the current ones so you can decide whether to retry or abort without needing to make an extra request.
  • ConsistencyLevel.SERIAL has been added to allow reading the current (possibly un-committed) Paxos state without having to propose a new update. If a SERIAL read finds an uncommitted update in progress, it will commit it as part of the read.
  • For details of how we deal with failures, see the comments and code.
  • Tickets for DC-local transactions, updating multiple rows in a transaction, and cqlsh support for retrieving the current value from an interrupted transaction are open and will be fixed for 2.0.1.


Comments

  1. Greg says:

    Have you considered the Raft protocol instead of Paxos?

  2. Jonathan Ellis says:

    We considered it, but Raft builds in master election as an essential step which makes it much less interesting.

    That said, the API we offer leaves us free to change the consensus implementation down the road.

  3. Julian Morrison says:

    The step of leader election in Raft doesn’t have to start after the transaction is requested (unlike the Paxos approach above), it would be a continuous background process, one node would already be leader when the transaction was requested, so the transaction could just be shipped to that leader.

  4. Jacek Furmankiewicz says:

    how does this work with Thrift? We use raw Astyanax APIs and never use CQL directly.

  5. Harish M says:

    How do you reliably determine the replica set for a key in order to run a paxos round? The replica set for a key can’t possibly stay constant if ring membership keeps changing (which happens in C*)?

  6. Jonathan Ellis says:

    Same answer as “do you write to the old replicas or the new ones when changing ownership” — both.

  7. DuyHai DOAN says:

    Jonathan.

    If we take the example you gave

    UPDATE users
    SET reset_token = null AND password = ‘newpassword’
    IF reset_token = ‘some-generated-reset-token’

    During the phase read/results, the leader will send something like (SELECT reset_token FROM users) and the quorum gives back the current value.

    During the propose/accept phase, the leader will send the real statement (UPDATE users
    SET reset_token = null AND password = ‘newpassword’).

    Is my interpretation correct ?

    If the answer is yes, what are the payloads exchanged by the leader and the quorum during the prepare/promise phase ? It is a kind of TimeUUID ?

    If the current leader is dead before the propose/accept phase, since the quorum did not commit to accept any proposal yet, future leader will be able to proceed with their own proposal, isn’t it ?

    Last but not least, since multiple paxos can occur at the same time, what happen if one node (n1) among the quorum already has a pending proposal (P1) whereas the others members have none ? The leader has to propose P1 as per protocol design but to which nodes of the quorum ? All nodes or only n1 ? And how could the leader ‘contact’ the other nodes which were involved in the previous P1 proposal to make them commit ?

    Sorry for asking a lot of questions.

  8. Nikolai Grigoriev says:

    I wish it would be possible to use the timestamp of the column when making an update…this would allow me to update the value only if the timestamp of the value is the same as it was when I fetched it :)

  9. Peter Fales says:

    I’m not following the password update example. Is there an implied missing step that sets the token to a non-null value? Or is reset_token a special value within Cassandra? Also, shouldn’t there be a “WHERE” clause so that only one user’s password is updated?

  10. Jonathan Ellis says:

    You’re right, I forgot the WHERE clause!

    The implied context is that the user has forgotten his password, so we created a reset_token and sent it to him. Now we want to make sure that reset_token only gets used once.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>