Apache Cassandra 1.1 Documentation

About Data Consistency in Cassandra

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

In Cassandra, consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Cassandra extends the concept of eventual consistency by offering tunable consistency. For any given read or write operation, the client application decides how consistent the requested data should be.

In addition to tunable consistency, Cassandra has a number of built-in repair mechanisms to ensure that data remains consistent across replicas.

In this Cassandra version, many schema changes can take place simultaneously in a cluster without any schema disagreement among nodes. For example, if one client sets a column to an integer and another client sets the column to text, one or the another will be instantly agreed upon, which one is unpredictable.

The new schema resolution design eliminates delays caused by schema changes when a new node joins the cluster. As soon as the node joins the cluster, it receives the current schema with instanteous reconciliation of changes.

Tunable Consistency for Client Requests

Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writes.

About Write Consistency

When you do a write in Cassandra, the consistency level specifies on how many replicas the write must succeed before returning an acknowledgement to the client application.

The following consistency levels are available, with ANY being the lowest consistency (but highest availability), and ALL being the highest consistency (but lowest availability). QUORUM is a good middle-ground ensuring strong consistency, yet still tolerating some level of failure.

A quorum is calculated as (rounded down to a whole number):

(replication_factor / 2) + 1

For example, with a replication factor of 3, a quorum is 2 (can tolerate 1 replica down). With a replication factor of 6, a quorum is 4 (can tolerate 2 replicas down).

Level Description
ANY A write must be written to at least one node. If all replica nodes for the given row key are down, the write can still succeed once a hinted handoff has been written. Note that if all replica nodes are down at write time, an ANY write will not be readable until the replica nodes for that row key have recovered.
ONE A write must be written to the commit log and memory table of at least one replica node.
TWO A write must be written to the commit log and memory table of at least two replica nodes.
THREE A write must be written to the commit log and memory table of at least three replica nodes.
QUORUM A write must be written to the commit log and memory table on a quorum of replica nodes.
LOCAL_QUORUM A write must be written to the commit log and memory table on a quorum of replica nodes in the same data center as the coordinator node. Avoids latency of inter-data center communication.
EACH_QUORUM A write must be written to the commit log and memory table on a quorum of replica nodes in all data centers.
ALL A write must be written to the commit log and memory table on all replica nodes in the cluster for that row key.

About Read Consistency

When you do a read in Cassandra, the consistency level specifies how many replicas must respond before a result is returned to the client application.

Cassandra checks the specified number of replicas for the most recent data to satisfy the read request (based on the timestamp).

The following consistency levels are available, with ONE being the lowest consistency (but highest availability), and ALL being the highest consistency (but lowest availability). QUORUM is a good middle-ground ensuring strong consistency, yet still tolerating some level of failure.

A quorum is calculated as (rounded down to a whole number):

(replication_factor / 2) + 1

For example, with a replication factor of 3, a quorum is 2 (can tolerate 1 replica down). With a replication factor of 6, a quorum is 4 (can tolerate 2 replicas down).

Level Description
ONE Returns a response from the closest replica (as determined by the snitch). By default, a read repair runs in the background to make the other replicas consistent.
TWO Returns the most recent data from two of the closest replicas.
THREE Returns the most recent data from three of the closest replicas.
QUORUM Returns the record with the most recent timestamp after a quorum of replicas has responded.
LOCAL_QUORUM Returns the record with the most recent timestamp after a quorum of replicas in the current data center as the coordinator node has reported. Avoids latency of inter-data center communication.
EACH_QUORUM Returns the record with the most recent timestamp after a quorum of replicas in each data center of the cluster has responded.
ALL Returns the record with the most recent timestamp after all replicas have responded. The read operation fails if a replica does not respond.

Choosing Client Consistency Levels

Choosing a consistency level for reads and writes involves determining your requirements for consistent results (always reading the most recently written data) versus read or write latency (the time it takes for the requested data to be returned or for the write to succeed).

If latency is a top priority, consider a consistency level of ONE (only one replica node must successfully respond to the read or write request). There is a higher probability of stale data being read with this consistency level (as the replicas contacted for reads may not always have the most recent write). For some applications, this may be an acceptable trade-off. If it is an absolute requirement that a write never fail, you may also consider a write consistency level of ANY. This consistency level has the highest probability of a read not returning the latest written values (see hinted handoff).

If consistency is top priority, you can ensure that a read will always reflect the most recent write by using the following formula:

(nodes_written + nodes_read) > replication_factor

For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency.

Consistency Levels for Multiple Data Center Clusters

A client read or write request to a Cassandra cluster always specifies the consistency level it requires. Ideally, you want a client request to be served by replicas in the same data center in order to avoid latency. Contacting multiple data centers for a read or write request can slow down the response. LOCAL_QUORUM and EACH_QUORUM are used in multiple data center clusters of real-time Cassandra nodes using a rack-aware replica placement strategy (such as NetworkTopologyStrategy) and a properly configured snitch. These consistency levels will fail when using SimpleStrategy.

A consistency level of ONE is also fine for applications with less stringent consistency requirements. A majority of Cassandra users do writes at consistency level ONE. With this consistency, the request will always be served by the replica node closest to the coordinator node that received the request (unless the dynamic snitch determines that the node is performing poorly and routes it elsewhere).

Keep in mind that even at consistency level ONE or LOCAL_QUORUM, the write is still sent to all replicas for the written key, even replicas in other data centers. The consistency level just determines how many replicas are required to respond that they received the write.

Specifying Client Consistency Levels

Consistency level is specified by the client application when a read or write request is made. The default consistency level may differ depending on the client you are using.

For example, in CQL the default consistency level for reads and writes is ONE. If you wanted to use QUORUM instead, you could specify that consistency level in the client request as follows:

SELECT * FROM users USING CONSISTENCY QUORUM WHERE state='TX';

About Cassandra's Built-in Consistency Repair Features

Cassandra has a number of built-in repair features to ensure that data remains consistent across replicas. These features are:

  • Read Repair - For reads, there are two types of read requests that a coordinator can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. To ensure that frequently-read data remains consistent, the coordinator compares the data from all the remaining replicas that own the row in the background, and if they are inconsistent, issues writes to the out-of-date replicas to update the row to reflect the most recently written values. Read repair can be configured per column family (using read_repair_chance), and is enabled by default.
  • Anti-Entropy Node Repair - For data that is not read frequently, or to update data on a node that has been down for a while, the nodetool repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair should be run routinely as part of regular cluster maintenance operations.
  • Hinted Handoff - Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online. Hinted handoff ensures data consistency due to short, transient node outages. The hinted handoff feature is configurable at the node-level in the cassandra.yaml file

About Hinted Handoff Writes

Hinted handoff is an optional feature of Cassandra that reduces the time to restore a failed node to consistency once the failed node returns to the cluster. It can also be used for absolute write availability for applications that cannot tolerate a failed write, but can tolerate inconsistent reads.

When a write is made, Cassandra attempts to write to all replicas for the affected row key. If a replica is known to be down before the write is forwarded to it, or if it fails to acknowledge the write for any reason, the coordinator will store a hint for it. The hint consists of the target replica, as well as the mutation to be replayed.

If all replicas for the affected row key are down, it is still possible for a write to succeed when using a write consistency level of ANY. Under this scenario, the hint and write data are stored on the coordinator node but not available to reads until the hint is replayed to the actual replicas that own the row. The ANY consistency level provides absolute write availability at the cost of consistency; there is no guarantee as to when write data is available to reads because availability depends on how long the replicas are down. The coordinator node stores hints for dead replicas regardless of consistency level unless hinted handoff is disabled. A TimedOutException is reported if the coordinator node cannot replay to the replica. In Cassandra, a timeout is not a failure for writes.

Note

By default, hints are only saved for one hour after a replica fails because if the replica is down longer than that, it is likely permanently dead. In this case, you should run a repair to re-replicate the data before the failure occurred. You can configure this time using the max_hint_window_in_ms property in the cassandra.yaml file.

Hint creation does not count towards any consistency level besides ANY. For example, if no replicas respond to a write at a consistency level of ONE, hints are created for each replica but the request is reported to the client as timed out. However, since hints are replayed at the earliest opportunity, a timeout here represents a write-in-progress, rather than failure. The only time Cassandra will fail a write entirely is when too few replicas are alive when the coordinator receives the request. For a complete explanation of how Cassandra deals with replica failure, see When a timeout is not a failure: how Cassandra delivers high availability.

When a replica that is storing hints detects via gossip that the failed node is alive again, it will begin streaming the missed writes to catch up the out-of-date replica.

Note

Hinted handoff does not completely replace the need for regular node repair operations. In addition to the time set by max_hint_window_in_ms, the coordinator node storing hints could fail before replay. You should always run a full repair after losing a node or disk.