CAP Theorem Proof, Narrow Definitions and Modern Context

This blog focuses on CAP theorem proof, the narrow definitions and the modern context i.e. the relevance of CAP theorem in system design and architecture today.

The CAP Theorem (also known as Brewer’s Theorem) is a fundamental principle in distributed systems. It states that a distributed data store can only provide two out of three guarantees at any given time:

Consistency: Every read receives the most recent write or an error (Linearizability).

Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.

Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.

Is the theorem still valid?

The short answer is yes, the mathematical proof remains valid, but its practical application has evolved.The "pick two" framing is often seen as too simplistic for following reasons:

The “P” is not optional

In any distributed system running over a network, partitions (failures) are inevitable. Therefore, you cannot choose "CA" (Consistency + Availability) because you cannot choose to ignore network failures. You are actually choosing between CP (Consistency during a partition) and AP (Availability during a partition).

The “PACELC” Extension

Engineers now often use the PACELC framework[4] to fill the gaps in CAP. While CAP only describes what happens during a failure (Partition), PACELC describes what happens during normal operation.

Partition: if there is a Partition, choose between Availability and Consistency.
- Else: when things are running normally, choose between Latency and Consistency.

History

It started as a "Conjecture". Eric Brewer (year 2000) presented the idea as Brewer's Conjecture. He didn't have a formal proof, he had observations from past experience of dealing with distributed systems that you simply couldn't have all three properties at once.

Formal Proof

CAP conjecture became a "Theorem" in 2002 when two researchers from MIT, Seth Gilbert and Nancy Lynch, published a formal mathematical proof [1]. They took Brewer's informal idea, gave strict definitions to "Consistency," "Availability," and "Partition Tolerance," and proved that it is impossible to achieve all three in an asynchronous network model.

The Proof (Gilbert & Lynch)

The proof is simple and follows proof by contradiction also known as indirect proof or the method of “reductio ad absurdum”, it is a common proof technique that is based on a very simple principle: something that leads to a contradiction can not be true, and if so, the opposite must be true. An example is that we have all studied in 10th standard in school while proving that square root of 2 is an irrational number (the proof assumes that it is a rational number and then reaches a situation which is absurd so by contradiction square root of 2 must be irrational) [2].

The CAP theorem proof follows the same pattern, it assumes that all 3 properties are possible and then reach a contradiction as follows:

Imagine a simple distributed system with two servers, Server A and Server B, both storing a value “v” that is currently 0.

The Partition (P): A network failure occurs between Server A and Server B. They can no longer talk to each other.
The Write: A client connects to Server A and writes a new value: “v” = 1.

If the system is Available (A), Server A must respond to the client with "Success."
Because of the Partition (P), Server A cannot tell Server B about this update.

The Read: A second client connects to Server B and asks for the value of “v”.

If the system is Available (A), Server B must respond to the client.
However, Server B still thinks “v”= 0 because it never received the update from Server A.

At this point, the system is forced into a contradiction:

If Server B returns 0, it is Available but not consistent(it gave the wrong, stale data).

If Server B waits for the network to fix itself so it can get the update, it is Consistent but Unavailable (it failed to give a prompt response).

Because this scenario can happen in any distributed system, Gilbert and Lynch proved that you cannot have all three (C, A, and P) simultaneously.

CAP Limitations or Narrow Definitions

Martin Kleppmann (author of Designing Data-Intensive Applications) argues that the CAP theorem has become a "misleading oversimplification"[3]. Here are the key points of his critique:

The Definitions are Too Narrow

He points out that for CAP to be a "theorem," you must use the precise mathematical definitions used in the 2002 proof, which don't align with how engineers use the words today:

Consistency (C) = Linearizability: This is a very specific, strict type of consistency. It is not the "C" in ACID (which refers to data integrity/rules). Most "consistent" databases (like Oracle or Postgres) actually use Snapshot Isolation, which is not linearizable, meaning they technically aren't "CP."

Availability (A) = Every node must respond: In CAP, if even one non-failing node cannot handle a request, the system is "unavailable." By this definition, most "highly available" systems (like a standard leader-follower setup) are technically "unavailable" during a partition. Basically CAP theorem and its proof treats availability as binary but in reality availability in modern times is a percentage on a scale between 0 and 100 for example 99.99% or 99.999%.

Partition Tolerance (P) = Not a choice: You cannot "choose" P. Networks fail. The only choice is what happens during a partition.

CAP Ignores Latency

A system can be "CAP-available" but take 30 seconds to respond. In the real world, a 30-second delay for a request is an outage. If your SLO (Service Level Objective) for P90 latency is say 1 second (it will be actually 100 ms for any respectable service dealing with millions of requests) but requests are taking 20 seconds then CAP theorem considers service to be available but in reality it will eventually converge to cascading failure in your cluster even in the face of auto scaling (there is a limit of auto scaling and hardware, it is not a panacea for bad system design). Because CAP doesn't account for performance or response times during "normal" (non-partitioned) operation, it misses the trade-off engineers care about most: Consistency vs. Latency.

Most Databases are "Neither" CP not AP

Under the strict definitions of the proof:

Single-leader databases (Postgres, MySQL, MongoDB) are often not A (because if the leader is partitioned, you can't write) and not C (because followers might have stale data).

They end up being "just P," which makes the "pick 2 out of 3" framework useless for describing them.

It Overlooks Other Faults

CAP focuses exclusively on network partitions. It says nothing about:

Node crashes

Disk corruption

Software bugs

Human error By focusing only on CAP, designers might ignore more common ways that systems fail.

Instead of a binary "pick two," Kleppmann suggests we should use more descriptive terms that actually explain a system's behavior, such as: Linearizability vs. Eventual Consistency Synchronous vs. Asynchronous replication PACELC (which adds the Latency trade-off during normal operation)

Summary

While the CAP theorem is over 20 years old, it remains a vital starting point for understanding how to build robust distributed systems, provided one understands its limitations as was proposed in original form. It can be used as a guide to weigh consequences of decisions.

References

Proof of CAP theorem by Seth Gilbert and Nancy Lynch https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf
Proof by contradiction (reductio ad absurdum) for square root 2 as irrational https://en.wikipedia.org/wiki/Square_root_of_2#Proofs_of_irrationality
Martin Kleppmann “Please stop calling databases CP or AP https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
Daniel Abadi’s PACELC framework https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html

Search This Blog

Gaurav Blog