|Understanding the architecture|
In Cassandra, data distribution and replication go together. This is because Cassandra is designed as a peer-to-peer system that makes copies of the data and distributes the copies among a group of nodes. Data is organized by table and identified by a primary key. The primary key determines which node the data is stored on. Copies of rows are called replicas. When data is first written, it is also referred to as a replica.
When your create a cluster, you must specify the following:
Details about how the consistent hashing mechanism distributes data across a cluster in Cassandra.
Consistent hashing partitions data based on the primary key. For example, if you have the following data:
|jim||age: 36||car: camaro||gender: M|
|carol||age: 37||car: bmw||gender: F|
|johnny||age: 12||gender: M|
|suzy||age: 10||gender: F|
Cassandra assigns a hash value to each primary key:
|Primary key||Murmur3 hash value|
Each node in the cluster is responsible for a range of data based on the hash value:
|Node||Murmur3 start range||Murmur3 end range|
Cassandra places the data on each node according to the value of the primary key and the range that the node is responsible for. For example, in a four node cluster, the data in this example is distributed as follows:
|Node||Start range||End range||Primary key||Hash value|