DataStax Developer Blog

Modern hinted handoff

By Jonathan Ellis -  December 11, 2012 | 5 Comments

Hinted Handoff is an optional part of writes in Cassandra, enabled by default, with two purposes:

  1. Hinted handoff allows Cassandra to offer full write availability when consistency is not required.
  2. Hinted handoff dramatically improves response consistency after temporary outages such as network failures.

This post applies to Cassandra 1.0 and later versions.

How it works

When a write is performed and a replica node for the row is either known to be down ahead of time, or does not respond to the write request, the coordinator will store a hint locally, in the system.hints table. This hint is basically a wrapper around the mutation indicating that it needs to be replayed to the unavailable node(s).

Once a node discovers via gossip that a node for which it holds hints has recovered, it will send the data row corresponding to each hint to the target. Additionally, it will check every ten minutes to see if there are any hints for writes that timed out during an outage too brief for the failure dectector to notice via gossip.

Hinted Handoff and ConsistencyLevel

A hinted write does not count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. If insufficient replica targets are alive to sastisfy a requested ConsistencyLevel, UnavailableException will be thrown with or without Hinted Handoff. (This is an important difference from Dynamo’s replication model; Cassandra does not default to sloppy quorum. But, see “Extreme write availability” below.)

To see why, let’s look at a simple cluster of two nodes, A and B, and a replication factor (RF) of 1: each row is stored on one node.

Suppose node A is down while we write row K to it with ConsistencyLevel.ONE. We must fail the write: recall that the ConsistencyLevel contract is that “reads always reflect the most recent write when W + R > RF, where W is the number of nodes to block for on write, and R the number to block for on reads.”

If we wrote a hint to B and call the write good because it is written “somewhere,” the contract would be violated because there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him.

Extreme write availability

For applications that want Cassandra to accept writes even when all the normal replicas are down (so even ConsistencyLevel.ONE cannot be satisfied), Cassandra provides ConsistencyLevel.ANY. ConsistencyLevel.ANY guarantees that the write is durable and will be readable once an appropriate replica target becomes available and receives the hint replay.

Performance

By design, hinted handoff inherently forces Cassandra to continue performing the same number of writes even when the cluster is operating at reduced capacity. So pushing your cluster to maximum capacity with no allowance for failures ia a bad idea. That said, Cassandra’s hinted handoff is designed to minimize the extra load on the cluster.

All hints for a given replica are stored under a single partition key, so replaying hints is a simple sequential read with minimal performance impact.

But if a replica node is overloaded or unavailable, and the failure detector has not yet marked it down, then we can expect most or all writes to that node to fail after write_request_timeout_in_ms, which defaults to 10s. During that time, we have to keep a hint callback alive on the coordinator, waiting to write the hint when the timeout is reached.

If this happens on many nodes at once this could become substantial memory pressure on the coordinator. So the coordinator tracks how many hints it is currently writing, and if this number gets too high it will temporarily refuse writes (with UnavailableException) whose replicas include the misbehaving nodes.

Operations

When removing a node from the cluster (with decommission or removetoken), Cassandra automatically removes hints targetting the node that no longer exists.

Cassandra will also remove hints for dropped tables.

Repair and the fine print

At first glance, it may appear that Hinted Handoff lets you safely get away without needing repair. This is only true if you never have hardware failure. Hardware failure means that

  1. We lose “historical” data for which the write has already finished, so there is nothing to tell the rest of the cluster exactly what data has gone missing
  2. We can also lose hints-not-yet-replayed from requests the failed node coordinated

With sufficient dedication, you can get by with “only run repair after hardware failure and rely on hinted handoff the rest of the time,” but as your clusters grow (and hardware failure becomes more common) performing repair as a one-off special case will become increasingly difficult to do perfectly. Thus, we continue to recommend running a full repair weekly.



Comments

  1. Hi,

    Nice explanation, but I’m left with one question: what happens when you have a replication level > 1 and only one node where the write must happen is down? I mean, suppose the token belongs to node 1, and a replica is going to be written in node 2, but node 1 is down. Is the record stored in node 2 and a hint stored on the coordinator node to be sent to node 1 when it comes up? Or is the hint not stored in that case and node 1 gets the data from node 2 once it comes up again?

    Also, if consistency level is set to ONE, in that case the write returns successfully, correct? I’m guessing so as it was written to node 2.

    Thanks

  2. Jonathan Ellis says:

    Hints are always stored on the coordinator. Yes, the write would succeed at ONE.

  3. Hi,

    There may be an error on inconsistency in the documentation? In this page:

    http://www.datastax.com/docs/1.1/dml/data_consistency

    Says:

    Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online.

    Furthermore, below that text that same artivle states:

    When a write is made, Cassandra attempts to write to all replicas for the affected row key. If a replica is known to be down before the write is forwarded to it, or if it fails to acknowledge the write for any reason, the coordinator will store a hint for it. The hint consists of the target replica, as well as the mutation to be replayed.

    It would seem that the first text is incorrect?

    Thanks,

  4. Hi,

    May there be an error in the documentation? In this page:

    http://www.datastax.com/docs/1.1/dml/data_consistency

    Says:

    Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online.

    Furthermore, below that text that same artivle states:

    When a write is made, Cassandra attempts to write to all replicas for the affected row key. If a replica is known to be down before the write is forwarded to it, or if it fails to acknowledge the write for any reason, the coordinator will store a hint for it. The hint consists of the target replica, as well as the mutation to be replayed.

    It would seem that the first text is incorrect?

    Thanks,

    1. Abhimanyu A says:

      http://wiki.apache.org/cassandra/HintedHandoff
      Here it says, in versions prior to 1.0, hints were written to live replicas. In version 1.0 hints were written in the coordinator.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>