Distributed Transactions: Commit Protocols
How distributed systems get every node to agree on whether a transaction committed. From two-phase commit through Paxos Commit, and the safety/liveness trade-offs each one makes.
Introduction
A transaction is one or more operations treated as a single logical change. The reason transactions exist is that the guarantees they provide (ACID) are what make concurrency and failure recovery tractable to reason about.
The four properties:
- Atomicity. All operations take effect, or none do.
- Consistency. Valid state to valid state, no half-states.
- Isolation. Concurrent transactions don’t see each other’s incomplete work.
- Durability. Committed work survives crashes.
On a single node these are well-understood. Lock, journal, fsync, done. The interesting question is what changes when the transaction spans multiple nodes.
What counts as a distributed transaction
Anything that has to commit atomically across two or more nodes. In practice that splits into two cases:
- Replicated updates. The same data lives on multiple replicas, and all of them have to apply the write.
- Cross-partition updates. Different pieces of data live on different nodes, and we want them to commit together. This is the common case; pretty much any sharded system has it.
What’s actually hard
Of the four ACID properties, two of them are mostly the same problem they were on one node.
Consistency is still a function of the data model: constraints, cascades, triggers, application-level invariants. Durability is still fsync; you just do it on every node that holds a piece of the transaction.
Atomicity and isolation are where distribution actually bites.
Atomicity because partial failure is now the default. A participant can crash after committing locally but before the coordinator hears about it. Another can ack the prepare and then fall over before commit. The protocol has to keep the cluster from disagreeing about whether the transaction happened.
Isolation because the participants don’t share a lock manager. Network asynchrony means one node’s “after T1” can overlap with another node’s “before T1”. The system has to do real work to make concurrent transactions look as if they ran one at a time.
Almost everything in this post is about those two.
Isolation
The gold standard is strict serializability: transactions look as if they ran one at a time, in some serial order, even though they really ran concurrently across many nodes. Everything weaker is a compromise that buys you performance at the cost of some specific anomaly creeping in.
Two families of techniques get used.
Pessimistic
Assume conflicts will happen and stop them before they do. Locking is the canonical version: a transaction grabs a lock on the rows it touches and holds it for the duration. Anyone else who wants the same row waits.
It pays for itself when contention is high. Aborts and restarts are expensive in any workload that’s already heavy, so blocking the conflicting transaction up front beats letting both run and tearing one down at commit. The cost is lock management overhead and the deadlock risk it brings.
Optimistic
Assume conflicts are rare. Let everything run, then check at commit time whether anyone stepped on anyone else’s writes. If they did, abort and restart.
This wins on read-heavy workloads and workloads where transactions touch mostly disjoint data, because the common case is a free pass through commit. It’s the wrong call when conflicts are common, since every restart is wasted work, and you can starve a transaction that keeps losing the race.
Two-phase locking (2PL)
The classic pessimistic protocol. Two kinds of locks:
- Write lock (exclusive). Taken when a transaction wants to modify a row. Nobody else, reader or writer, gets through.
- Read lock (shared). Taken on every read. Multiple readers can hold the shared lock simultaneously; writers wait.
What makes 2PL “two-phase” isn’t the lock types, it’s the shape of a transaction’s lock-set over time:
- Expanding phase. The transaction can acquire locks. It can’t release any yet.
- Shrinking phase. Once it releases its first lock, it can’t acquire any new ones.
The “no new locks after the first release” rule is what gives 2PL its serializability guarantee. The price is deadlock risk: two transactions can each hold a lock the other needs, and neither can let go. Detecting and breaking deadlocks is its own subsystem.
Snapshot isolation (MVCC)
MVCC takes the opposite tack: instead of locking, keep every version. Each transaction reads from a snapshot taken when it started. Writers never block readers because the reader is looking at an older version. Building Axis taught me how much MVCC pulls its weight here: readers never have to coordinate with writers at all, which is exactly what makes the SWMR design work.
The store tracks two timestamps per transaction: T_start (when it began) and T_end (when it committed). Each row has a chain of versions tagged with the commit timestamp of the transaction that wrote them.
- On read, the transaction sees the latest version whose commit timestamp is
< T_start. If it has already written that row itself in this transaction, it sees its own write. - On write, check whether anyone else committed an update to this row after
T_start. If yes, abort (this is the optimistic check that prevents lost updates). If no, install the new version.
The trade-off is well known: MVCC is fast for reads and rarely conflicts, but it isn’t quite serializable. The classic anomaly it leaves on the table is write skew: two transactions read overlapping data, each updates a disjoint piece of it, and the resulting state is one that no serial order could have produced. If you need to rule that out, you reach for serializable snapshot isolation (SSI) or fall back to locking.
Atomicity
Isolation gives us a way to reason about concurrent transactions on one node. Atomicity is the hard one: getting all the nodes to agree on whether the transaction happened.
Two-phase commit (2PC)
The classic answer. Two roles:
- Coordinator (transaction manager). Runs the protocol. Often one of the participants doubles as the coordinator.
- Participants (resource managers). The nodes that hold the data.
The protocol
Phase 1: prepare. Coordinator sends the transaction to every participant. Each participant runs it to the point just before commit, makes the changes durable (WAL), and votes yes or no.
Phase 2: commit/abort. Coordinator collects the votes. All yes → tell everyone to commit. Any no, or a timeout → tell everyone to abort. Participants ack, protocol ends.
Both sides log everything to a WAL so they can recover after a crash. The coordinator uses a timeout in phase 1 and treats silence as a no.
What goes wrong
A participant dies in phase 1. Coordinator times out, treats it as a no, the transaction aborts. Annoying, not unsafe.
A participant dies in phase 2. The remaining participants still get the decision. When the failed node comes back, it asks the coordinator what happened and applies it.
The coordinator dies between phases. This is the one that kills 2PC. It crashed after collecting “yes” votes but before sending the decision. The participants are stuck. They can’t commit unilaterally (maybe someone else voted no), and they can’t abort (maybe everyone voted yes and the coordinator’s intent was commit). They have to wait for the coordinator to come back. Until then, their locks are held, and the rows they prepared are blocked for any other transaction.
What it gives you
2PC is safe: every participant arrives at the same decision, so atomicity holds. It is not live: a coordinator crash can stall the protocol indefinitely. In production this shows up as: things mostly work, but every so often the system gets wedged for as long as it takes to manually recover a coordinator. That’s the price.
Three-phase commit (3PC)
3PC tries to fix the blocking problem by adding an intermediate step. Three phases:
- Prepare. Same as 2PC.
- Pre-commit. Coordinator tells everyone “I have all the yes votes, get ready to commit.” Participants ack.
- Commit. Coordinator sends the final commit/abort once it has all the pre-commit acks.
The reason this helps: if the coordinator dies between phase 2 and phase 3, every participant has the pre-commit message in hand, and they can elect a new coordinator who concludes the transaction.
The catch is the partial pre-commit case. If the coordinator dies in the middle of sending out pre-commit, some participants got the message and some didn’t. The ones that did, after a timeout, commit. The ones that didn’t abort. Now you have an inconsistent cluster, which is the failure mode 2PC was designed to prevent in the first place.
3PC trades safety for liveness. Real-world systems almost never deploy it as-is, because the asynchronous network assumption it depends on doesn’t hold in practice, so you take the correctness hit without getting the availability you wanted. It’s mostly a textbook stepping stone to the protocols below.
Quorum-based commit
The right way to fix 3PC’s flaw is to require quorums. Define:
- Commit quorum
Vc: how many nodes have to agree to commit. - Abort quorum
Va: how many to abort.
Pick them so Vc + Va > V, where V is the total number of participants. That inequality is what makes the protocol safe under partitions: it’s impossible for one side of a partition to have a commit quorum while the other has an abort quorum, because the two quorums would have to overlap.
In normal operation it looks like 3PC, except the coordinator waits for Vc pre-commit acks before issuing commit.
When the network partitions, each side runs a termination protocol. They elect a surrogate coordinator locally and ask “what does our side know?” If anyone in this partition already committed or aborted, that’s the answer. If Vc participants are in pre-commit, commit. If enough are still waiting on votes, abort. The quorum overlap rule guarantees the two sides can’t disagree.
When the partition heals, a merge protocol runs the same termination logic across the now-rejoined nodes.
This is the first protocol on the list that’s both safe under partitions and makes progress most of the time. It can still stall under pathological churn (a sequence of small partitions that never lets either quorum form), but you have to engineer a bad day to reproduce it.
Paxos commit
People sometimes conflate atomic commit with consensus, but they’re different problems. Atomic commit needs unanimous agreement (one no vote and the transaction aborts). Consensus needs only a majority to decide. So you can’t just point Paxos at the commit decision and call it done; the unanimity requirement isn’t in Paxos’s job description.
What you can do is use Paxos as the durable, fault-tolerant store for each participant’s vote. That’s Paxos Commit.
The blocking problem in 2PC is that the decision lives in one place: the coordinator’s memory. Lose the coordinator and the decision is gone. Paxos Commit fixes this by giving each Resource Manager its own Paxos instance for its vote. The vote is stored durably across acceptors, so any healthy node can read it. The transaction commits if and only if every Paxos instance decided YES.
Three roles:
- Resource Managers (RMs). Own the data, do the transactional work, and decide whether they can commit.
- Acceptors. Run the Paxos protocol. They don’t know anything about transactions; they just store proposed values durably and reach agreement on them.
- Proposers/Learners. Usually the RMs themselves, wearing different hats.
The flow:
- Each RM proposes its vote (yes/no) to its own Paxos instance.
- Acceptors run Paxos and decide on a value for each instance.
- RMs learn each other’s decided values.
- If an RM never proposes (it crashed before voting), any other RM can propose NO on its behalf, which breaks the wait.
- All instances decided YES → everyone commits.
- Otherwise → everyone aborts.
The clever bit, due to Lamport and Gray, is that you keep using Paxos for what it’s good at (durable agreement on one value) and just run N copies of it, one per participant. Plain 2PC is the degenerate version of this with zero replicated state. Once you’ve replicated the votes, the coordinator can die and any of the participants can finish the protocol on their own.
Summary
| Protocol | Safety | Liveness | Trade-off |
|---|---|---|---|
| 2PC | Yes | No | Blocks on coordinator failure |
| 3PC | No | Yes | Can violate atomicity during partitions |
| Quorum-based | Yes | Mostly | Complex, handles most failures |
| Paxos Commit | Yes | Yes | Higher latency, most robust |
Reach for these when transactions are short, you need strong consistency across nodes, and a brief stall during a coordinator failure is acceptable. When the work spans hours or days (and you can’t hold locks across that time) the saga pattern is the answer. That’s the next post.
References