Paul Cannon

The term&nbsp;eventual consistency&nbsp;often seems to bother newcomers to distributed&nbsp;data storage systems. Hopefully this post will be able to put a more concrete&nbsp;face on it.

(If you are already familiar with the nature of eventual consistency, you may&nbsp;want to&nbsp;<a href="https://legacy-datastax-corporate.pantheonsite.io/dev/blog/your-ideal-performance-consistency-tradeoff#skip-a-bit">skip a bit</a>.)

Eventual consistency refers to a strategy used by many distributed systems to&nbsp;improve query and update latencies, and in a more limited way, to provide&nbsp;stronger availability to a system than could otherwise be attained.

There are a lot of parameters which come into play when trying to predict or&nbsp;model the performance of distributed systems. How many nodes can die before&nbsp;data is lost? How many can die without affecting the usability of the system?&nbsp;Are there any single points of failure? Can the system be used if, for some&nbsp;period of time, half of the nodes can't see the other half? How fast can I get&nbsp;data out of the system? How fast can it accept new data?

Since Apache Cassandra's distributed nature is based on&nbsp;<a href="http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">Dynamo</a>, let us&nbsp;consider here Dynamo-style systems. These allow the user to specify how many&nbsp;nodes should get replicas of particular classes of data (the&nbsp;replication&nbsp;factor, commonly called&nbsp;<tt>N</tt>&nbsp;in the literature). At the same time, it also&nbsp;allows the user to specify the number of nodes which must accept a write&nbsp;before it is considered successful (<tt>W</tt>) and the number of nodes which are&nbsp;consulted for each read (<tt>R</tt>). By varying&nbsp;<tt>N</tt>,&nbsp;<tt>W</tt>, and&nbsp;<tt>R</tt>, one can&nbsp;obtain a wide variety of scenarios with different properties of availability,&nbsp;consistency, reliability, and speed.

For example, Abby's top priority is that data never, ever be lost or out of&nbsp;date, and she has determined that she wants her system to be able to tolerate&nbsp;the loss of two nodes without going down. She may want to go all the way up to&nbsp;<tt>N=5</tt>,&nbsp;<tt>W=3</tt>, and&nbsp;<tt>R=3</tt>. Since&nbsp;<tt>W+R &gt; N</tt>, any node set chosen for&nbsp;reading will always intersect with any node set chosen for writing, and so&nbsp;Abby's data is guaranteed to be consistent- even if she loses up to two nodes&nbsp;within a replication set. (Note that&nbsp;<tt>N</tt>&nbsp;is not the same as the number of&nbsp;nodes in the whole system; it's just a lower bound.)

Meanwhile, Bart's top priorities are speed, low hardware costs, and/or disk&nbsp;costs, and he doesn't care much if he incurs downtime or data loss when a&nbsp;hardware failure occurs. He might then want&nbsp;<tt>N=W=R=1</tt>. He is also&nbsp;guaranteed data consistency, as long as the nodes stay available and working,&nbsp;but he only keeps one copy of each piece of data, so he only needs 1/5 of the&nbsp;hardware that Abby would need for the same amount of data.

Charlotte's app demands speed above all else. Her data is vital to her&nbsp;business, but there is a vast amount of it (maybe, like Netflix, she will be&nbsp;seeing around a&nbsp;<a href="http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html">million writes per second</a>), and she needs to keep costs&nbsp;down as much as possible. So she does some math, and determines that she must&nbsp;have 90% of reads up-to-date within 50 ms, and 99.9% of reads up-to-date&nbsp;within 250 ms.

Is Cassandra a good fit for Charlotte? She is going to be running at a pretty&nbsp;large scale- let's say 1000 nodes. Her MTTF numbers indicate she should expect&nbsp;a hardware failure about every 10 days, so she can't afford to use&nbsp;<tt>N=1&nbsp;</tt>like Bart. She has so much data that&nbsp;<tt>N=5</tt>&nbsp;would be cost-prohibitive.&nbsp;<tt>N=3</tt>&nbsp;may be an option, but should she use&nbsp;<tt>R=W=2</tt>&nbsp;for full strong&nbsp;consistency, or&nbsp;<tt>R=W=1</tt>&nbsp;for faster and cheaper eventual consistency? Or&nbsp;maybe&nbsp;<tt>R=1, W=2</tt>&nbsp;or vice versa? Surely the "eventual" in eventual&nbsp;consistency means that she won't be able to meet her consistency requirements&nbsp;stated above, will she? Just how "eventual" will it be?

Up until now, this question would largely have been a matter of intuition,&nbsp;guesswork, or large-scale profiling. But some folks in the&nbsp;<a href="http://www.eecs.berkeley.edu/">EECS department at&nbsp;UC Berkeley</a>&nbsp;threw a whole bunch of math and simulation at the problem in an&nbsp;effort to get a more objective handle on "How Eventual is Eventual&nbsp;Consistency?" You can read a&nbsp;<a href="http://www.eecs.berkeley.edu/~pbailis/projects/pbs/">summary of their results here</a>, and a much&nbsp;deeper&nbsp;<a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-4.pdf">technical report here</a>.

The best part is that they also provided the world with an interactive demo,&nbsp;which lets you fiddle with&nbsp;<tt>N</tt>,&nbsp;<tt>R</tt>, and&nbsp;<tt>W</tt>, as well as parameters defining your system's read and write latency distributions, and gives you a&nbsp;nice graph showing what you can expect in terms of consistent reads after a&nbsp;given time.

<a href="http://www.eecs.berkeley.edu/~pbailis/projects/pbs/#demo" name="skip-a-bit">See the interactive demo here</a>.

This terrific tool actually runs thousands of&nbsp;<a href="http://en.wikipedia.org/wiki/Monte_carlo_simulation">Monte Carlo simulations</a>&nbsp;per&nbsp;data point (turns out the math to create a full, precise formulaic solution&nbsp;was too hairy) to give a very reliable approximation of consistency for a&nbsp;range of times after a write.

It even goes as far as to model parts of the anti-entropy provided by Dynamo&nbsp;(and Cassandra): expanding partial quorums, which refers to writes being sent&nbsp;to all&nbsp;<tt>N</tt>&nbsp;nodes in a replication set, even when only the first&nbsp;<tt>W</tt>&nbsp;nodes&nbsp;will be waited for. I.e., with&nbsp;<tt>N=3</tt>&nbsp;and&nbsp;<tt>W=1</tt>, the effective latency of a&nbsp;write will be the latency of whichever of three nodes is fastest, but the&nbsp;write is still sent to all the nodes, if available. Cassandra also provides&nbsp;ongoing&nbsp;read repair&nbsp;and&nbsp;Merkle tree data repair&nbsp;(when requested) as&nbsp;additional anti-entropy measures which increase consistency even more, but&nbsp;this model does not take those into account. So it's on the conservative side.

Let's investigate using the demo to determine Charlotte's cheapest mode of&nbsp;operation. The four sliders at the bottom allow specifying some various&nbsp;latency distributions:
<img alt="latency sliders" data-align="center" data-entity-type="file" data-entity-uuid="c96b4d1c-6491-40e3-9f61-a54d957c2d64" src="https://www.datastax.com/sites/default/files/inline-images/latency_sliders-2.png" />
These four latency measurements are modeled as following an&nbsp;<a href="http://en.wikipedia.org/wiki/Exponential_distribution">Exponential&nbsp;distribution</a>. If you determine that your latency distribution doesn't fit,&nbsp;you can pretty easily modify the PBS simulation code to get a better model.

The&nbsp;W&nbsp;latency represents the amount of time between a client issuing a&nbsp;write, and the write actually being received by a node. The&nbsp;A&nbsp;latency models&nbsp;the amount of time between a node receiving a write and the reception by the&nbsp;coordinator of the write acknowledgement.&nbsp;R&nbsp;is for the latency between&nbsp;issuing a read and the read arriving at a node, and&nbsp;S&nbsp;is the latency between&nbsp;a read arriving on a node and the response arriving at the coordinator.

Cassandra isn't tooled to give straightforward average or histogram values for&nbsp;each of these metrics, although the Berkeley authors did make a patch for&nbsp;Cassandra to do so (we've requested a copy and may incorporate it back into&nbsp;stock Cassandra). However, you can get reasonable approximations if you have&nbsp;things like metrics showing average read and write latencies from your client&nbsp;software's perspective, read and write latencies as derived from the&nbsp;StorageProxyMBean, and&nbsp;<tt>nodetool cfstats</tt>&nbsp;output.

The parameter for the exponential distribution is λ, the&nbsp;rate. The&nbsp;rate is the inverse of the mean, so you can divide 1 by your determined&nbsp;average for each latency metric in milliseconds to get λ. Then just&nbsp;move the slider in the demo to get as close as you can to that value (the&nbsp;slider controls are a little bit finicky, but there doesn't seem to be a whole&nbsp;lot of change in the output over small differences in λ).

Here are some reasonable values I plugged in for Charlotte (although your&nbsp;numbers may vary considerably based on application, network, hardware, etc):
<img alt="tuned sliders" data-align="center" data-entity-type="file" data-entity-uuid="9a3f88a6-c172-4208-9e4e-362e916d5ee6" src="https://www.datastax.com/sites/default/files/inline-images/tuned_sliders-1.png" />
Charlotte wants to see, first, if she can get away with&nbsp;<tt>R=W=1</tt>, to get the&nbsp;best possible read and write latencies and expected availability. So we tune&nbsp;the&nbsp;Replica Configuration&nbsp;sliders:
<img alt="replication configuration" data-align="center" data-entity-type="file" data-entity-uuid="e4e4a5d2-fde2-4c90-a541-6aca0713819f" src="https://www.datastax.com/sites/default/files/inline-images/replica_config-1.png" />
And she wants to see the probability of reading any data that's not the&nbsp;absolute latest version, so she sets&nbsp;Tolerable Staleness&nbsp;to 1.
<img alt="tolerable staleness" data-align="center" data-entity-type="file" data-entity-uuid="61ffa65b-1d7d-4adb-9e4d-753375abde99" src="https://www.datastax.com/sites/default/files/inline-images/tolerable_staleness-1.png" />
And voilà, myriad calculations are performed, and answers are given.
<img alt="results" data-align="center" data-entity-type="file" data-entity-uuid="63557299-b419-47eb-ae79-db353d471b40" src="https://www.datastax.com/sites/default/files/inline-images/results1-1.png" />
Wow- so it turns out that even with&nbsp;<tt>N=3</tt>&nbsp;and&nbsp;<tt>R=W=1</tt>, under "eventual&nbsp;consistency" semantics, Charlotte can expect a remarkable amount of "real&nbsp;consistency". The numbers exceed her requirements!

Go ahead and explore the demo. You'll find that often, the write latency (W)&nbsp;distribution is a particularly strong factor in determining ideal consistency.&nbsp;This works out great for Cassandra, which is absurdly fast at performing writes.

Keep in mind that the demo shows a conservative lower bound on consistency&nbsp;probabilities, and the actual distribution is likely to be noticeably higher,&nbsp;if the latencies are correct.

Your Ideal Performance: Consistency Tradeoff

Paul Cannon

Share

Share

More Technology

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

Making Astra DB easier for MongoDB developers

One-stop Data API for Production GenAI