Apache Cassandra 1.0 Documentation

About Data Partitioning in Cassandra

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

Data partitioning determines how data is distributed across the nodes in the cluster. Three factors are involved with data distribution:

  • A partitioner that determines which node to store the data on.
  • The number of copies of data, which is determined by the replica placement strategy.
  • The topology of the cluster, which is the number of nodes, the distribution of the nodes on racks, and the number of data centers.

Column family data is partitioned across the nodes based on the row key. To determine the node where the first replica of a row will live, the ring is walked clockwise until it locates the node with a token value greater than that of the row key. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). With the nodes sorted in token order, the last node is considered the predecessor of the first node; hence the ring representation.

For example, consider a simple 4 node cluster where all of the row keys managed by the cluster were numbers in the range of 0 to 100. Each node is assigned a token that represents a point in this range. In this simple example, the token values are 0, 25, 50, and 75. The first node, the one with token 0, is responsible for the wrapping range (76-0). The node with the lowest token also accepts row keys less than the lowest token and more than the highest token.


../../_images/ring_partitions.png

Understanding the Partitioner Types

Unlike almost every other configuration choice in Cassandra, the partitioner cannot be changed without reloading all of your data. Therefore, it is important to choose and configure the correct partitioner before initializing your cluster. You set the partitioner in the cassandra.yaml file.

Cassandra offers the following partitioners:

About the RandomPartitioner

The RandomPartitioner (org.apache.cassandra.dht.RandomPartitioner) is the default partitioning strategy for a Cassandra cluster, and in almost all cases is the right choice.

The RandomPartitioner uses consistent hashing to determine which node stores which row. Unlike naive modulus-by-node-count, consistent hashing ensures that when nodes are added to the cluster, the minimum possible set of data is effected.

To distribute the data evenly across the number of nodes, a hashing algorithm creates an MD5 hash value of the row key. The maximum possible range of hash values is 0 to 2 127 -1. Each node in the cluster is assigned a token that represents a hash value within this range and then owns the rows with a hash value less than its token number. The primary benefit of this approach is that once your tokens are set appropriately, data from all of your column families is evenly distributed across the cluster with no further effort. For example, one column family could be using user names as the row key and another column family timestamps, but the row keys from each individual column family are still spread evenly. This also ensures that read and write requests to the cluster are evenly distributed. Another benefit of using random partitioning is that is simplifies load balancing a cluster. Because each part of the hash range receives an equal number of rows on average, it is easier to correctly assign tokens to new nodes.

When using the RandomPartitioner for single data center deployments, tokens are calculated by dividing the hash range by the number of nodes in the cluster. For multiple data center deployments, tokens are calculated per data center so that the hash range is evenly divided for the nodes in each data center. See About Partitioning in Multiple Data Center Clusters.

About the ByteOrderedPartitioner

Cassandra provides the ByteOrderedPartitioner (org.apache.cassandra.dht.ByteOrderedPartitioner) for ordered partitioning. (The OrderPreservingPartitioner and CollatingOrderPreservingPartitioner are deprecated as of Cassandra 0.7.) This partitioner orders rows lexically by key bytes. You calculate tokens by looking at the actual values of your row key data and using a hexadecimal representation of the leading character(s) in a key. For example, if you wanted to partition rows alphabetically, you could assign an A token using its hexadecimal representation of 41.

Using the ordered partitioner allows range scans over rows. This means you can scan rows as though you were moving a cursor through a traditional index. For example, if your application has user names as the row key, you can scan rows for users whose names fall between Jake and Joe. This type of query is not possible with randomly partitioned row keys, since the keys are stored in the order of their MD5 hash (not sequentially). However, you can achieve the same functionality using column family indexes. Most applications can be designed with a data model that supports ordered queries as slices over a set of columns rather than range scans over a set of rows.

Unless absolutely required by your application, DataStax strongly recommends against using the ordered partitioner for the following reasons:

  • Sequential writes can cause hot spots: If your application tends to write or update a sequential block of rows at a time, then these writes are not distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing with timestamped data.
  • More administrative overhead to load balance the cluster: An ordered partitioner requires administrators to manually calculate token ranges based on their estimates of the row key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
  • Uneven load balancing for multiple column families: If your application has multiple column families, chances are that those column families have different row keys and different distributions of data. An ordered partitioner than is balanced for one column family may cause hot spots and uneven distribution for another column family in the same cluster.

About Partitioning in Multiple Data Center Clusters

The preferred replication placement strategy for multiple data center deployments is the NetworkTopologyStrategy, which calculates replica placement per data center. This strategy places the first replica for each row by the token value assigned to each node. It places additional replicas in the same data center by walking the ring clockwise until it reaches the first node in another rack. This means that you must calculate partitioner tokens so that the data ranges are evenly distributed for each data center, uneven data distribution within a data center may occur:

Uneven Data Distribution


../../_images/multidc_tokens_uneven.png

To ensure that the nodes for each data center have token assignments that evenly divide the overall range, each data center should be partitioned as if it were its own distinct ring. This averts having a disproportionate number of row keys in any one data center. However, you must avoid assigning tokens that may conflict with other token assignments elsewhere in the cluster. To make sure that each node has a unique token, see Calculating Tokens for a Multiple Data Center Cluster.

Even Data Distribution


../../_images/multidc_tokens_even.png