DataStax Developer Blog

When a timeout is not a failure: how Cassandra delivers high availability, part 1

By Jonathan Ellis -  August 15, 2012 | 10 Comments

We often talk about how Cassandra is designed from the ground up for fault-tolerance and availability. Today I want to discuss how this actually works in practice, and in particular what Cassandra can tell clients about request state when machines go down.

First, let’s start with a very basic diagram describing a non-distributed server:

Client makes a request and gets a response. Simple!

But what happens when the server goes down before it replies?

The client doesn’t know what happened. If it was trying to perform an update, it will have to retry the request when the server comes back up.

Everything else we’re going to cover here is just an extension of this to a distributed system.

First, let’s look at how things work when everything goes according to plan:

A client may talk to any node in a Cassandra cluster, whether or not it is a replica of the data being read or updated. The node a client talks to, the “coordinator,” is responsible for routing the request to appropriate replicas.

If the coordinator fails mid-request, we’re in a similar situation to what we had in the non-distributed case: the client is in the dark and has no choice but to retry. The only difference is that the client can reconnect to any node in the cluster immediately.

The interesting case is when a replica fails but the coordinator does not. There are actually two distinct scenarios here. In the first, the coordinator’s failure detector knows that the replica is down before the request arrives:

Since the coordinator knows the replica is down, it doesn’t even attempt to route the request to it. Instead, it immeditely responds to the client with an UnavailableException. This is the only time Cassandra will fail a write. (And even then, you can request that Cassandra allow the write with ConsistencyLevel.ANY.)

Let me say that again, because it’s the single greatest point of confusion I see: the only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request.

So what happens if the replica doesn’t fail until after the coordinator has forwarded the client’s request?

In this case, the coordinator replies with a TimedOutException. Starting in Cassandra 1.2, it will also include an acknowledged_by count of how many replicas succeeded. Also in 1.2, Cassandra provides different timeouts for reads, writes, and other operations such as truncate.

The coordinator is in the same situation the client was in during the single-server failure scenario: it doesn’t know whether the request succeeded or failed, so all it can tell the client is that the request timed out.

Remember that for writes, a timeout is not a failure. (To make this more clear, we considered renaming TimedOutException to InProgressException for Cassandra 1.2, but decided against it to preserve backwards compatibility.)

How can we say that since we don’t know what happened before the replica failed? The coordinator can force the results towards either the pre-update or post-update state. This is what Cassandra does with hinted handoff:

I labled “timeout response” step 5 in the previous diagram. Recording the hint is the missing step 4: the coordinator stores the update locally, and will re-send it to the failed replica when it recovers, thus forcing it to the post-update state that the client wanted originally.

In the next post, I’ll cover how we generalize this to multiple replicas.



Comments

  1. sumit thakur says:

    Hello,

    from last paragraphs, coordinator stores data locally if replica down .
    suppose replica will never come up then what happen with locally store data?

    thanks for this article and i waiting for next part.
    Sumit Thakur

  2. Jonathan Ellis says:

    When a node is removed from the cluster, other nodes excise any hints they had for it.

  3. Jonathan, delighted to see a series on this. I think it would be worth covering the consistency implications around quorum/all writes and subsequent reads under these kinds of failures.

  4. Jonathan Ellis says:

    I’d be glad to. (Part 2 is on the back burner until after JavaOne but it’s not forgotten.) Can you flesh out what you’re interested in, particularly?

  5. Samarth Gahire says:

    Thanks a lot for this post, it was really confusing and hard to implement a catch for data unavailable and timed out scenarios.This is very helpful.

  6. Nathaniel Troutman says:

    What happens when the coordinator is the replica when you get a TimedOutException?

    If I know what server is a replica for the key I’m about to mutate, so I connect to that server, make the mutation, but then get a TimedOutException, what does it mean in that case?

  7. Bill says:

    Is there a next post? I can’t find any link to it.

  8. Anne says:

    What exception will you get if the co-ordinator (not the replica) goes down during the operation in the above example?

    1. Jonathan Ellis says:

      Depends on your client. For the java driver, when the coordinator dies, we’ll apply a failover strategy that is just driven by the LoadBalancingPolicy: we’ll try the next node in the Iterator provided by this policy (see http://www.datastax.com/drivers/java/1.0/apidocs/com/datastax/driver/core/policies/LoadBalancingPolicy.html#newQueryPlan(com.datastax.driver.core.Query) ). This is done transparently by the driver, and doesn’t lead to an Exception. If the client code is interested in how the query has been executed, it can look as the ExecutionInfo (http://www.datastax.com/drivers/java/1.0/apidocs/com/datastax/driver/core/ExecutionInfo.html) found in the ResultSet.

  9. Yasin says:

    I am curious if this timeout-retransmission discipline is automated at the client side. Does the client wait for a certain time and automatically retransmit a request?
    In the fifth scenario, after a write request, a read request may return stale data. Is that right?
    Even though you say you will write a second article on this, I think you have not?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>