Technology•February 20, 2019

Consistent Hashing: Distributed Database Things to Know

Adron Hall

When I worked at Basho in 2013, I wrote about consistent hashing as part of a series called “Learning About Distributed Databases”. Today I’m kicking that back off after a few years (ok, after 5 or so years!) with this post on consistent hashing.

As with Riak, which I wrote about in 2013, Cassandra remains one of the core active distributed database projects alive today that provides an effective and reliable consistent hash ring for the clustered distributed database system. This hash function is an algorithm that maps data to variable length to data that’s fixed. This consistent hash is a kind of hashing that provides this pattern for mapping keys to particular nodes around the ring in Cassandra. One can think of this as a kind of Dewey Decimal Classification system where the cluster nodes are the various bookshelves in the library.

Ok, so maybe the Dewey Decimal system isn’t the best analogy. Does anybody even learn about that any more? If you don’t know what it is, please read up and support your local library.

Consistent hashing allows data distributed across a cluster to minimize reorganization when nodes are added or removed. These partitions are based on a particular partition key. The partition key shouldn’t be confused with a primary key either, it’s more like a unique identifier controlled by the system that would make up part of a primary key of a primary key that is made up of multiple candidate keys in a composite key.

For an example, let’s take a look at sample data from the DataStax docs on consistent hashing.

For example, if you have the following data:

NAME	AGE	CAR	GENDER
jim	36	camaro	M
carol	37	345s	F
johnny	12	supra	M
suzy	10	mustang	F

The database assigns a hash value to each partition key:

PARTITION KEY	MURMUR3 HASH VALUE
jim	-2245462676723223822
carol	7723358927203680754
johnny	-6723372854036780875
suzy	1168604627387940318

Each node in the cluster is responsible for a range of data based on the hash value.

Hash values in a four node cluster

DataStax Enterprise places the data on each node according to the value of the partition key and the range that the node is responsible for. For example, in a four node cluster, the data in this example is distributed as follows:

NODE	START RANGE	END RANGE	PARTITION KEY	HASH VALUE
1	-9223372036854775808	-4611686018427387904	johnny	-6723372854036780875
2	-4611686018427387903	-1	jim	-2245462676723223822
3	0	4611686018427387903	suzy	1168604627387940318
4	4611686018427387904	9223372036854775807	carol	772335892720368075

So there you go, that’s consistent hashing and how it works in a distributed database like Apache Cassandra , the derived distributed database DataStax Enterprise, or the mostly defunct (RIP) Riak. If you’d like to dig in further, I’ve also found Distributed Hash Tables interesting and also a host of other articles that delve into coding up a consistent hash table, respective ring, and the whole enchilada. Check out these articles for more information and details:

Simple Magic Consistent by Mathias Meyer @roidrage CTO of Travis CI. Mathias’s post is well written and drives home some good points.
Consistent Hashing: Algorithmic Tradeoffs by Damien Gryski @dgryski. This post from Damien is pretty intense, and if you want code, he’s got code for ya.
How Ably Efficiently Implemented Consistent Hashing by Srushtika Neelakantam. Srushtika does a great job not only of describing what consistent hashing is but also has drawn up diagrams, charts, and more to visualize what is going on. But that isn’t all, she also wrote up some code to show nodes coming and going. A really great post.

For more on distributed database things to know, subscribe to the blog--of course, the ole’ RSS feed works great too--and follow @CompositeCode on Twitter for blog updates.

The article was cross-posted from Adron's personal blog, Composite Code.

Distributed Database Things to Know

Discover more

ArchitectureApache Cassandra®

More Technology

View All

Introducing the DataStax AI Terraform Module

Technology • July 24, 2024

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

Learn More

Get Started for Free

Consistent Hashing: Distributed Database Things to Know

Adron Hall

Discover more

Share

Share

More Technology

Introducing the DataStax AI Terraform Module

DataStax AI PaaS Is Now Enhanced with State-of-the-Art Retrieval Embedding with NVIDIA NeMo Retriever Integration

The Hitchhiker's Guide to Vector Embeddings

Highly Accurate Retrieval for your RAG Application with ColBERT and Astra DB

One-stop Data API for Production GenAI