Apache Cassandra 1.1 Documentation

About Data Partitioning in Cassandra

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

A partitioner determines how data is distributed across the nodes in the cluster (including replicas). Basically, a partitioner is a hash function for computing the token (it's hash) of a row key. Each row of data is uniquely identified by a row key and distributed across the cluster by the value of the token.

Data Distribution in the Ring

In Cassandra, the total amount of data managed by the cluster is represented as a ring. The ring is divided into ranges equal to the number of nodes, with each node being responsible for one or more ranges of the data. Before a node can join the ring, it must be assigned a token. The token value determines the node's position in the ring and its range of data. Column family data is partitioned across the nodes based on the row key. To determine the node where the first replica of a row will live, the ring is walked clockwise until it locates the node with a token value greater than that of the row key. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). With the nodes sorted in token order, the last node is considered the predecessor of the first node; hence the ring representation.

For example, consider a simple four node cluster where all of the row keys managed by the cluster were numbers in the range of 0 to 100. Each node is assigned a token that represents a point in this range. In this simple example, the token values are 0, 25, 50, and 75. The first node, the one with token 0, is responsible for the wrapping range (76-0). The node with the lowest token also accepts row keys less than the lowest token and more than the highest token.


../../_images/ring_partitions.png

Understanding the Partitioner Types

When you deploy a Cassandra cluster, you must assign a partitioner and assign each node an initial_token value so each node is responsible for roughly an equal amount of data (load balancing). DataStax strongly recommends using the RandomPartitioner (default) for all cluster deployments.

To calculate the tokens for nodes in a single data center cluster, you divide the range by the total number of nodes in the cluster. In multiple data center deployments, you calculate the tokens such that each data center is individually load balanced. See Generating Tokens for the different approaches to generating tokens for nodes in single and multiple data center clusters.

Unlike almost every other configuration choice in Cassandra, the partitioner may not be changed without reloading all of your data. Therefore, it is important to choose and configure the correct partitioner before initializing your cluster. You set the partitioner in the cassandra.yaml file.

Cassandra offers the following partitioners:

About the RandomPartitioner

The RandomPartitioner uses tokens to help assign equal portions of data to each node and evenly distribute data from all the tables throughout the ring or other grouping, such as a keyspace. This is true even if the tables use different row keys, such as usernames or timestamps. Moreover, the read and write requests to the cluster are also evenly distributed and load balancing is simplified because each part of the hash range receives an equal number of rows on average. The RandomPartitioner (org.apache.cassandra.dht.RandomPartitioner) is the default partitioning strategy for a Cassandra cluster, and in almost all cases is the right choice. The RandomPartition distributes data evenly across the nodes using an MD5 hash value of the row key. The possible range of hash values is from 0 to 2127 -1.

When using the RandomPartitioner for single data center deployments, you calculate the tokens by dividing the hash range by the number of nodes in the cluster. For multiple data center deployments, you calculate the tokens per data center so that the hash range is evenly divided for the nodes in each data center. See About Partitioning in Multiple Data Center Clusters.

About the ByteOrderedPartitioner

Cassandra provides the ByteOrderedPartitioner (org.apache.cassandra.dht.ByteOrderedPartitioner) for ordered partitioning. This partitioner orders rows lexically by key bytes. You calculate tokens by looking at the actual values of your row key data and using a hexadecimal representation of the leading character(s) in a key. For example, if you wanted to partition rows alphabetically, you could assign an A token using its hexadecimal representation of 41.

Using the ordered partitioner allows range scans over rows. This means you can scan rows as though you were moving a cursor through a traditional index. For example, if your application has user names as the row key, you can scan rows for users whose names fall between Jake and Joe. This type of query is not possible with randomly partitioned row keys, since the keys are stored in the order of their MD5 hash (not sequentially). However, you can achieve the same functionality using column family indexes. Most applications can be designed with a data model that supports ordered queries as slices over a set of columns rather than range scans over a set of rows.

Unless absolutely required by your application, DataStax strongly recommends against using the ordered partitioner for the following reasons:

  • Sequential writes can cause hot spots: If your application tends to write or update a sequential block of rows at a time, then these writes are not distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing with timestamped data.
  • More administrative overhead to load balance the cluster: An ordered partitioner requires administrators to manually calculate token ranges based on their estimates of the row key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
  • Uneven load balancing for multiple column families: If your application has multiple column families, chances are that those column families have different row keys and different distributions of data. An ordered partitioner that is balanced for one column family may cause hot spots and uneven distribution for another column family in the same cluster.

About Partitioning in Multiple Data Center Clusters

The preferred replication placement strategy for multiple data center deployments is the NetworkTopologyStrategy, which calculates replica placement per data center. This strategy places the first replica for each row by the token value assigned to each node. It places additional replicas in the same data center by walking the ring clockwise until it reaches the first node in another rack. This means that you must calculate partitioner tokens so that the data ranges are evenly distributed for each data center, uneven data distribution within a data center may occur:

Uneven Data Distribution


../../_images/multidc_tokens_uneven.png

To ensure that the nodes for each data center have token assignments that evenly divide the overall range, each data center should be partitioned as if it were its own distinct ring. This averts having a disproportionate number of row keys in any one data center. However, you must avoid assigning tokens that may conflict with other token assignments elsewhere in the cluster. To make sure that each node has a unique token, see Generating Tokens.

Even Data Distribution


../../_images/multidc_tokens_even.png