This is a guest post by Peter G. Walen.
The CAP theorem has become a standard way to understand data management behavior in a distributed system. Fundamentally, the CAP theorem asserts that of the three aspects of data management that make up the CAP acronym — consistency (C), availability (A) and partition tolerance (P) — only two are possible at any moment in time.
In other words, if you want the latest information (consistency) from your data source immediately (availability), that information needs to be stored in a single, fail-safe location (partition tolerance). If you want to spread the data out over many nodes (partition tolerance), either the data will be accurate at some point in time but not immediately (consistency), or you can have data immediately but it might be stale, awaiting an update to the most recent state (availability).
To use the down and dirty explanation, when it comes to data management in a distributed system, you have a choice of good (consistent), fast (available) or distributed (partitioned) — pick two.
For testing large systems at scale, CAP can be a nightmare. It creates a moving target. The theorem asserts that if you want the latest data, you’ll need to wait. Or, if you don’t want to wait, your data might be stale. If this is the case, then how does one design tests in a distributed system and conduct them in a meaningful way?
Well, it’s not easy. The place to start is with a service-level agreement (SLA) that defines the required state of data according to the situation at hand.
Testing in Terms of the SLA
A service-level agreement (SLA) that defines a variety of data states in terms of situations provides the common understanding necessary so that developers, operations personnel and test practitioners can go about making sure a system works to expectations.
As the CAP theorem indicates, in a distributed system you can never have the latest data immediately. In some situations, this might be a show-stopper, but there are times when it’s not. For example, if you have an application that publishes recipes from a foodie website, having readers wait for a minute or two for a new recipe to post might be quite acceptable.
On the other hand, if you’re running an auction site that requires real-time bidding, the time users need to wait to see the last bid is critical and cannot be left to individual interpretation. In this case, there’s a big difference between a write-to-read timespan of 1 millisecond and 100 milliseconds. Thus, an SLA that describes exactly the expected timespan between the latest write and current read is essential; otherwise, development, operations, and testing personnel are flying blind.
The Degrees of Consistency and Availability
The good news is that when it comes to creating an SLA that describes performance metrics in terms of the constraints of the CAP theorem, you don’t have to reinvent the wheel. Companies have had to deal with these problems for a while.
In fact, there is a rubric created by Doug Terry at Microsoft Research that describes different degrees of availability and consistency for data within a given system.
Consistency | Performance | Availability | ||
Strong Consistency | See all previous writes | excellent | poor | poor |
Consistent Prefix | See the initial sequence of writes | acceptable | good | excellent |
Bounded Staleness | See all “old” writes | good | acceptable | poor |
Monotonic Reads | See increasing subset of writes | acceptable | good | good |
Read My Writes | See all writes performed by reader | acceptable | acceptable | acceptable |
Eventual Consistency | See subset of previous writes | poor | excellent | excellent |
Table 1: States of data availability and consistency, from strong to weak
Table 1 above provides language that you can use when creating an SLA for data access performance. To understand how the different types of data availability and consistency can be applied, Terry describes the degrees of data state in terms of the various data roles involved in a baseball game.
In a baseball game, there are a variety of parties who need access to data being generated by the game. However, each party does not need the same degree of availability and consistency. Table 2 below describes the parties and their requirements.
Official scorekeeper | Read My Writes |
Umpire | Strong Consistency |
Radio reporter | Consistent Prefix, Monotonic Reads |
Sportswriter | Bounded Staleness |
Statistician | Strong Consistency, Read My Writes |
Stat watcher | Eventual Consistency |
Table 2: Different aspects of data access illustrated by a baseball game
For example, an official scorekeeper will be keeping track of the result of each pitch, plays, outs and runs scored. Therefore, when scorekeepers enter this data into the scoring system, they need to be able to read any write they made immediately. Thus, we’ll say that the SLA for scorekeepers needs to support Read My Writes.
An umpire needs to be aware of strikes and balls in order to determine when an inning is over. However, the only time the umpire needs to know the score is at the top of the ninth inning. (If the home team is ahead when the bottom of the ninth inning is about to begin, there is no need to continue the game. The home team will be the winner no matter what.) In terms of the scope, the umpire needs to know the score immediately at the end of the top of the ninth inning.
Hence, the umpire’s SLA needs to prescribe Strong Consistency.
Radio reporters broadcast with a certain amount of lag time between when the game’s data was generated and when it’s announced. That’s the expectation of the audience listening in. If radio reporters are delayed a few seconds or even a minute when reporting the score, the world won’t come to an end; they can tolerate a bit of a delay getting information. However, what radio reporters cannot tolerate is the information coming in out of sequence. They have to see the results of each batter’s performance as it happens within the batting order. Therefore, the SLA for radio reporters needs to declare an availability and consistency state of Consistent Prefix and Monotonic Reads.
Sportswriters have a lot of flexibility because they will post an article after the game is over. Thus, the SLA for sportswriters needs to ensure that all the data is consistent and available at a predefined point after the game. Thus, the SLA for sportswriters needs to be Bounded Staleness. The data can be stale as of a given point in time, but it must be accurate.
Statisticians require the most up-to-date data possible, and they need to see the data they enter themselves immediately. They also need to see all data entered as soon as possible — not immediately, but as close to immediately as possible. The SLA for a statistician needs to indicate Strong Consistency and Read My Writes.
The stat watcher who periodically checks on the team’s season statistics has more flexibility. Numbers that are slightly out of date are acceptable and probably expected occasionally, so the stat watcher would be content with an SLA stipulating Eventual Consistency.
Putting It All Together
The SLAs described above for each role in a baseball game will vary according to the needs of the role. This analogy applies well to creating SLAs in the real world.
Not every role in your organization will need the same degree of data consistency and availability. Some parties will be able to wait for the latest data. Some can make do with what’s available. Some others will want the latest data instantaneously, or as near to instantaneously as possible.
The expense of support will vary according to the need at hand. Getting data at near-instantaneous speeds might be very expensive. Is it worth the expense? It very well might be under certain conditions. In order to make such a judgment, the business needs to know what it’s getting for the spend. This is the importance of the service level agreement.
It’s the job of development and operations to meet the conditions of the SLA. It’s the job of test practitioners to ensure the conditions are being met. It’s the job of the business to make sure that the funds are there to cover the costs at hand. But, no matter which department is in play, having a well-defined SLA is critical to achieving success.
Peter G. Walen has over 25 years of experience in software development, testing, and agile practices. He works hard to help teams understand how their software works and interacts with other software and the people using it. He is a member of the Agile Alliance, the Scrum Alliance and the American Society for Quality (ASQ) and an active participant in software meetups and frequent conference speaker.