How database companies keep their data straight

As developers tackle increasingly large problems, they have to store their data in more complex ways -- adding a constellation of computers to house it all.

But adding more computer hardware can lead to confusion when different parts of the network need to be accessed for any particular query, especially when speedy requests for data are so common. Each database update must be broadcast to all the computers -- sometimes sprawled across different datacenters -- before the update is complete.

Complex data requires complex solutions

Developers like to have a "single source of truth" when they build applications, one that is a record of essential information. This should be able to tell them the most current values at any time.

Delivering this consistency with one computer running a database is simple. When there are several machines running in parallel, defining a single version of the truth can become complicated. If two or more changes arrive on different machines in short succession, there's no simple way for the database to choose which came first. When computers do their jobs in milliseconds, the order of such changes can be ambiguous, forcing the database to choose who gets the airplane seat or the concert tickets.

The problem only grows with the size of tasks assigned to a database. More and more jobs require large databases that span multiple machines. These machines may be located in different datacenters around the world to improve response time and add remote redundancy. But the extra communication time required greatly increases complexity when the database updates arrive in close succession on different machines.

And the problem can't just be solved by handing everything over to a high-end cloud provider. Database services offered by giants like Amazon AWS, Google Cloud, and Microsoft Azure all have limits when it comes to consistency, and they may offer multiple variations of consistency to choose from.

To be sure, some jobs aren't affected by this problem. Many applications merely request that databases track slowly evolving and unchanging values -- like, say, the size of your monthly utility bill or the winner of last season's ball games. The information is written once, and all subsequent requests will get the same answer.

Other jobs, like tracking the number of open seats on an airplane, can be very tricky. If two people are trying to buy the last seat on the plane, they may both receive a response saying one seat is left. The database needs to take extra steps to ensure that seat is only sold once. (The airline may still choose to overbook a flight, but that's a business decision, not a database mistake.)

Databases work hard to maintain consistency when the changes are elaborate by bundling any number of complicated changes into single packages known as "transactions." If four people flying together want seats on the same flight, the database can keep the set together and only process the changes if there are four empty seats available, for example.

In many cases, database creators need to decide whether they wanted to trade consistency for speed. Is strong consistency worth slowing down the updates until they reach all corners of the database? Or is it better to plow ahead because the odds are low that any inconsistency will cause a significant problem? After all, is it really all that tragic if someone who buys a ticket five milliseconds later than someone else actually gets the ticket? You could argue no one will notice.

The problem only occurs in the sliver of time it takes new versions of the data to propagate throughout the network. The databases will converge on a correct and consistent answer, so why not take a chance if the stakes are low?

There are now several "eventually consistent" versions supported by different databases. The quandary of how best to approach the problem has been studied extensively over the years. Computer scientists like to talk about the CAP theorem, which describes the tradeoff between consistency, availability, and partitionability. It's usually relatively easy to choose any two of the three but hard to get all three in one working system.

Why is eventual consistency important?

The idea of eventual consistency evolved as a way to soften the expectations of accuracy in moments when it's hardest to deliver. This is just after new information has been written to one node but hasn't been propagated throughout the constellation of machines responsible for storing the data. Database developers often try to be more precise by spelling out the different versions of consistency they are able to offer. Amazon chief technology officer Werner Vogels described five different versions Amazon considered when designing some of the databases that power Amazon Web Services (AWS). The list includes versions like "session consistency," which promise consistency but only in the context of a particular session.

The notion is closely connected to NoSQL databases because many of these products began by promising only eventual consistency. Over the years, database designers have studied the problem in greater detail and developed better models to describe the tradeoffs with more precision. The idea still troubles some database administrators, the kind that wear both belts and suspenders to work, but users who don't need perfect answers appreciate the speed.

How are legacy players approaching this?

Traditional database companies like Oracle and IBM remain committed to strong consistency, and their main database products continue to support it. Some developers use very large computers with terabytes of RAM to run a single database that maintains a single, consistent record. For banking and warehouse inventory jobs, this can be the simplest way to grow.

Oracle also supports clusters of databases, including MySQL, and these may resort to supplying eventual consistency for jobs that require more size and speed than perfection.

Microsoft's Cosmos database offers five levels of guarantee, ranging from strong to eventual consistency. Developers can trade speed versus accuracy depending upon the application.

What are the upstarts doing?

Many of the emerging NoSQL database services explicitly embrace eventual consistency to simplify development and increase speed. The startups may have begun offering the simplest model for consistency, but lately they've been giving developers more options to trade away raw speed for better accuracy when needed.

Cassandra, one of the earliest NoSQL database offerings, now offers nine options for write consistency and 10 options for read consistency. Developers can trade speed for consistency according to the application's demands.

Couchbase, for instance, offers what the company calls a "tunable" amount of consistency that can vary from query to query. MongoDB may be configured to offer eventual consistency for read-only replicas for speed, but it can also be configured with a variety of options that offer more robust consistency. PlanetScale offers a model that balances consistent replication with speed, arguing that banks aren't the only ones that need to fight inconsistency.

Some companies are building new protocols that come closer to strong consistency. For example, Google's Spanner relies upon a very accurate set of clocks to synchronize the versions running in different datacenters. The database is able to use these timestamps to determine which new block of data arrived first. FaunaDB, on the other hand, uses a version of a protocol that doesn't rely on highly accurate clocks. Instead, the company creates synthetic timestamps that can help decide which version of competing values to keep.

Yugabyte has chosen to embrace consistency and partionability from the CAP theorem and trade away availability. Some read queries will pause until the database reaches a consistent state. CockroachDB uses a model that it says sometimes offers a serialized version of the data, but not a linearized one.

The limits of eventual consistency

For critical tasks, like those involving money, users are willing to wait for answers without inconsistencies. Eventually, consistent models may become acceptable for many data collection jobs, but they aren't appropriate for tasks that require a high degree of trust. When companies can afford to support large computers with plenty of RAM, databases that offer strong consistency are appropriate for any that control scarce resources.