Apache Cassandra 1.2 Documentation

About internode communications (gossip)

Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about.

In Cassandra, the gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

About cluster membership and seed nodes

When a node first starts up, it looks at its configuration file to determine the name of the Cassandra cluster it belongs to and which node(s), called seeds, to contact to obtain information about the other nodes in the cluster. You must configure these cluster contact points for each node in the cassandra.yaml configuration file.

Configuring gossip settings

The gossip-related settings are:

Property Description
cluster_name Name of the cluster that this node is joining. Should be the same for every node in the cluster.
listen_address The IP address or hostname that other Cassandra nodes use to connect to this node. Should be changed from localhost to the public address for the host.
seed_provider A -seeds list is comma-delimited list of hosts (IP addresses) that gossip uses to learn the topology of the ring. Every node should have the same list of seeds. In multiple data-center clusters, the seed list should include a node from each data center.
storage_port The intra-node communication port (default is 7000). Must be the same for every node in the cluster.
initial_token Determines the range of data the node is responsible for in version 1.1 and earlier.
num_tokens Determines the ranges of data the node is responsible for in version 1.2 and later.

Purging gossip state on a node

Gossip information is persisted locally by each node to use immediately on node restart without having to wait for gossip. To clear gossip history on node restart (for example, if node IP addresses have changed), add the following line to the cassandra-env.sh file. This file is located in /usr/share/cassandra`` in packed installs or <install_location>/conf in binary installs.

-Dcassandra.load_ring_state=false

To prevent partitions in gossip communications, all nodes in a cluster must have the same list of seed nodes listed in their configuration file. This is most critical the first time a node starts up. By default, a node remembers other nodes it has gossiped with between subsequent restarts.

Note

The seed node designation has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they have any other special purpose in cluster operations beyond the bootstrapping of nodes.

About failure detection and recovery

Failure detection is a method for locally determining, from gossip state and history, if another node in the system is up or down. Cassandra uses this information to avoid routing client requests to unreachable nodes whenever possible. (Cassandra can also avoid routing requests to nodes that are alive, but performing poorly, through the dynamic snitch.)

The gossip process tracks state from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes communicated about secondhand, thirdhand, and so on). Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, or other conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. In Cassandra, configuring the phi_convict_threshold property adjusts the sensitivity of the failure detector. Use default value for most situations, but increase it to 12 for Amazon EC2 (due to the frequently experienced network congestion).

Node failures can result from various causes such as hardware failures and network outages. Node outages are often transient but can last for extended intervals. A node outage rarely signifies a permanent departure from the cluster, and therefore does not automatically result in permanent removal of the node from the ring. Other nodes will periodically try to initiate gossip contact with failed nodes to see if they are back up. To permanently change a node's membership in a cluster, administrators must explicitly add or remove nodes from a Cassandra cluster using the nodetool utility.

When a node comes back online after an outage, it may have missed writes for the replica data it maintains. Once the failure detector marks a node as down, missed writes are stored by other replicas for a period of time providing hinted handoff is enabled. If a node is down for longer than max_hint_window_in_ms (3 hours by default), hints are no longer saved. Because nodes that die may have stored undelivered hints, you should run a repair after recovering a node that has been down for an extended period. Moreover, you should routinely run nodetool repair on all nodes to ensure they have consistent data.

For more explanation about recovery, see Modern hinted handoff.