And documentation I've read elsewhere agrees with this post: within a dc replicas will be assigned according to the hash key and then around the ring, skipping the same rack. But I am not seeing this.
I start with a 2 node dc with 2 racks. These nodes are in fact virtual machines on two separate physical machines. Tokens are assigned with 0 and 85070591730234615865843651857942052864, for node 1 on rack1 and 2 on rack2 respectively. RF is 2. Nodes replicate to each other. And that's the point; my idea is to replicate across racks and thereby physical machines.
I then expand the cluster with 2 more nodes, one in each rack. The 2 new tokens that split the keyspaces in half are assigned to the new nodes: 42535295865117307932921825928971026432 to new node on rack 1 and 127605887595351923798765477786913079296 to new node on rack 2. In the ring (i.e. the token order) the first node is still the first, the new node in rack1 is now number 2, the old second node in rack2 is now number 3, and the new node in rack 2 is number 4.
After waiting for bootstrapping I find that node 1 now replicates to node 2 (same rack), node 2 to node 3 (across to the other rack), node 3 to node 4 (same rack), and node 4 to node 1 (across). This is not what I wanted or expected. I expected a key to get it's first replica in the node to which the key belongs (for example node1). Then the next replica would be the next in the ring which is not in the same rack. In this case that would be node 3 in rack2.
So I try to change the token assignements around. That's something I really would like to avoid since the old nodes have accumulated quite a bit of data. I switch tokens between node 2 on rack 1 and node 3 on rack 2. Now replication works as I want, across racks, and I believe that both racks now contain the full data so that I have fault tolerance and can loose a whole rack and still have all data. But it seems it's because replication simply follows the token order around the ring.
What am I missing?