Get your copy of the O’Reilly Cassandra eBook: The Definitive Guide - Download FREE Today
Consistent Hashing: Distributed Database Things to Know
When I worked at Basho in 2013, I wrote about consistent hashing as part of a series called “Learning About Distributed Databases”. Today I’m kicking that back off after a few years (ok, after 5 or so years!) with this post on consistent hashing.
As with Riak, which I wrote about in 2013, Cassandra remains one of the core active distributed database projects alive today that provides an effective and reliable consistent hash ring for the clustered distributed database system. This hash function is an algorithm that maps data to variable length to data that’s fixed. This consistent hash is a kind of hashing that provides this pattern for mapping keys to particular nodes around the ring in Cassandra. One can think of this as a kind of Dewey Decimal Classification system where the cluster nodes are the various bookshelves in the library.
Ok, so maybe the Dewey Decimal system isn’t the best analogy. Does anybody even learn about that any more? If you don’t know what it is, please read up and support your local library.
Consistent hashing allows data distributed across a cluster to minimize reorganization when nodes are added or removed. These partitions are based on a particular partition key. The partition key shouldn’t be confused with a primary key either, it’s more like a unique identifier controlled by the system that would make up part of a primary key of a primary key that is made up of multiple candidate keys in a composite key.
For an example, let’s take a look at sample data from the DataStax docs on consistent hashing.
For example, if you have the following data:
The database assigns a hash value to each partition key:
|PARTITION KEY||MURMUR3 HASH VALUE|
Each node in the cluster is responsible for a range of data based on the hash value.
Hash values in a four node cluster
DataStax Enterprise places the data on each node according to the value of the partition key and the range that the node is responsible for. For example, in a four node cluster, the data in this example is distributed as follows:
|NODE||START RANGE||END RANGE||PARTITION KEY||HASH VALUE|
So there you go, that’s consistent hashing and how it works in a distributed database like Apache Cassandra, the derived distributed database DataStax Enterprise, or the mostly defunct (RIP) Riak. If you’d like to dig in further, I’ve also found Distributed Hash Tables interesting and also a host of other articles that delve into coding up a consistent hash table, respective ring, and the whole enchilada. Check out these articles for more information and details:
- Simple Magic Consistent by Mathias Meyer @roidrage CTO of Travis CI. Mathias’s post is well written and drives home some good points.
- Consistent Hashing: Algorithmic Tradeoffs by Damien Gryski @dgryski. This post from Damien is pretty intense, and if you want code, he’s got code for ya.
- How Ably Efficiently Implemented Consistent Hashing by Srushtika Neelakantam. Srushtika does a great job not only of describing what consistent hashing is but also has drawn up diagrams, charts, and more to visualize what is going on. But that isn’t all, she also wrote up some code to show nodes coming and going. A really great post.
For more on distributed database things to know, subscribe to the blog--of course, the ole’ RSS feed works great too--and follow @CompositeCode on Twitter for blog updates.
The article was cross-posted from Adron's personal blog, Composite Code.