Distributed Transactions: A Deep Dive

20 min read
Distributed SystemsTransactionsConsensus

Introduction

A transaction is a unit of work performed in a database system, representing a change potentially composed of multiple operations. Transactions provide guarantees that make reasoning about concurrent operations and failures significantly easier.

These guarantees are captured by the ACID properties:

  • Atomicity - Either all operations complete successfully or none take effect
  • Consistency - The database transitions from one valid state to another valid state
  • Isolation - Concurrent transactions don’t interfere with each other
  • Durability - Committed transactions persist even through failures

In a single-node database, implementing these guarantees is well-understood. But what happens when a transaction spans multiple nodes?


What is a Distributed Transaction?

A distributed transaction is a transaction that takes place across two or more different nodes. There are two main variants:

Replicated data updates - The same piece of data is updated across multiple replicas, and all replicas need to be updated atomically.

Cross-partition operations - Different pieces of data residing on different nodes need to be updated atomically. This is the more common use case for distributed transactions.


The Challenge

Implementing ACID guarantees becomes significantly harder in a distributed environment.

Consistency and Durability are relatively straightforward:

  • Consistency is achieved through database mechanisms like constraints, cascades, and triggers
  • Durability is achieved by persisting transactions to non-volatile storage at commit time

Atomicity and Isolation are the hard problems:

  • Atomicity - Partial failures make it difficult to guarantee all-or-nothing semantics. A node might crash after some participants have committed but before others have.
  • Isolation - Network asynchrony and concurrency make it harder to preserve isolation between transactions running on different nodes.

The rest of this post focuses on how distributed systems tackle these two challenges.


Isolation in Distributed Transactions

To be completely protected against anomalies, a system should be strictly serializable: transactions appear to execute one at a time in some serial order.

There are two main approaches to concurrency control:

Pessimistic Concurrency Control

Blocks a transaction if it’s expected to cause a violation, resuming it when safe. This is typically achieved through locking, where transactions acquire locks on data they process to prevent concurrent access.

Characteristics:

  • Incurs overhead from lock management
  • Performs well with many conflicting transactions
  • Reduces aborts and restarts, minimizing wasted work

Optimistic Concurrency Control

Delays checking until the end of the transaction. If a commit would violate any rules, the transaction is aborted and can be restarted.

Characteristics:

  • Lower overhead when conflicts are rare
  • Performs well with read-heavy workloads
  • Better when transactions touch different data

Two-Phase Locking (2PL)

Two-Phase Locking is a pessimistic concurrency control protocol that uses locks to prevent interference between concurrent transactions.

Lock Types

Write Locks (Exclusive)

  • Acquired when a record will be written (insert, update, delete)
  • Blocks both reads and writes from other transactions
  • Other transactions must wait for the lock to be released

Read Locks (Shared)

  • Acquired when a record is read
  • Allows multiple concurrent read locks
  • Blocks write operations from other transactions

How It Works

Transactions acquire and release locks in two distinct phases:

Expanding Phase - The transaction can acquire locks but cannot release any.

Shrinking Phase - The transaction can release locks but cannot acquire any new ones.

This two-phase structure guarantees serializable executions. However, it introduces the risk of deadlocks, where two transactions each hold a lock the other needs.


Snapshot Isolation (MVCC)

Multi-Version Concurrency Control (MVCC) is an optimistic technique that stores multiple versions of each record. Transactions see a consistent snapshot from when they started.

Key guarantee: All reads in a transaction see a consistent snapshot of the database, and the transaction commits only if no other transaction has updated the same data since the snapshot.

How It Works

The database tracks two timestamps for each transaction:

  • T_start - When the transaction started
  • T_end - When the transaction completed

For each data item, the database maintains versions with their commit timestamps.

On read: The transaction receives the last committed value from before T_start. If the transaction itself has already updated the value, it sees its own update.

On write: The system checks whether any other transaction has updated the same item after T_start. If so, the transaction aborts (to prevent lost updates). Otherwise, the write proceeds.

Trade-off: Snapshot isolation provides excellent read performance since readers never block writers. However, it’s not fully serializable and can allow certain anomalies like write skew.


Atomicity in Distributed Transactions

Now we turn to the harder problem: ensuring all-or-nothing semantics when a transaction spans multiple nodes.

Two-Phase Commit (2PC)

Two-Phase Commit is the classic protocol for distributed atomic commits. It involves two roles:

Coordinator (Transaction Manager) - Responsible for coordinating the protocol phases. May be one of the participants.

Participants (Resource Managers) - Nodes participating in the transaction.

The Protocol

Phase 1: Voting/Prepare

  1. Coordinator sends the transaction to all participants
  2. Participants execute the transaction up to the commit point
  3. Participants respond with a vote (Yes or No)

Phase 2: Commit/Abort

  1. Coordinator gathers all votes
  2. If all votes are Yes → instruct participants to commit
  3. If any vote is No → instruct all participants to abort
  4. Participants acknowledge, closing the protocol

Both coordinator and participants use write-ahead logging to recover from failures. The coordinator uses timeouts in the first phase, treating no response as a No vote.

Failure Scenarios

Participant failure in voting phase: The coordinator times out and treats it as a No vote. Transaction aborts.

Participant failure in commit phase: The protocol concludes without the failed participant. Upon recovery, the participant contacts the coordinator to learn the outcome.

Coordinator failure: This is the critical weakness. If the coordinator fails after collecting votes but before sending the decision, participants are blocked. They cannot safely commit (another participant might have voted No) or abort (all might have voted Yes).

Properties

2PC is a blocking protocol. It satisfies the safety property: all participants arrive at the same decision (atomicity). But it does not satisfy the liveness property: it may not always make progress.


Three-Phase Commit (3PC)

3PC attempts to address 2PC’s blocking problem by splitting the first round into two sub-rounds.

The Three Phases

Phase 1: Prepare - Same as 2PC voting phase.

Phase 2: Pre-commit - Coordinator communicates the vote results to participants. Participants acknowledge.

Phase 3: Commit - Coordinator sends final commit/abort after receiving acknowledgments.

Handling Coordinator Failure

If the coordinator fails during pre-commit, participants who received the pre-commit message can take over after a timeout, assuming all participants received the message.

The Problem

If the coordinator fails during pre-commit where only some participants received the message:

  • Participants who received pre-commit may unilaterally commit after timeout
  • Participants who didn’t receive it may abort
  • This violates atomicity, resulting in an inconsistent state

Properties

3PC increases availability by removing the single point of failure, but at the cost of correctness. It satisfies liveness (always makes progress) but can violate safety (atomicity).


Quorum-Based Commit Protocol

This protocol uses quorums to make progress even during network partitions while maintaining safety.

Core Concept

The protocol defines:

  • Commit quorum (Vc) - Minimum nodes needed to commit
  • Abort quorum (Va) - Minimum nodes needed to abort

The constraint: Vc + Va > V (total participants)

This ensures both quorums cannot form simultaneously on different sides of a partition.

Sub-Protocols

Commit Protocol - Used for normal transaction execution. Similar to 3PC, but the coordinator waits for Vc acknowledgments before committing.

Termination Protocol - Used during network partitions. Each partition elects a surrogate coordinator via leader election. The surrogate queries partition members and determines if they can complete the transaction:

  • If any participant has already committed/aborted → follow that decision
  • If enough participants are in pre-commit state → proceed with commit
  • If enough participants are waiting for votes → proceed with abort

Merge Protocol - Used when partitions heal. A leader is selected among partition leaders, then the termination protocol runs to reconcile state.

Properties

This protocol satisfies safety: all participants arrive at the same decision. It doesn’t guarantee liveness in extreme failure scenarios (continuous small partitions), but handles most common failures much better than 2PC or 3PC.


Paxos Commit

Distributed transactions and consensus are related but distinct problems:

  • Atomic commit requires unanimous agreement
  • Consensus requires only majority agreement

Paxos cannot replace atomic commit logic, but it can make atomic commit fault-tolerant.

The 2PC Blocking Problem

If the Transaction Manager crashes after collecting votes but before sending the decision:

  • Resource Managers are blocked
  • No one knows whether to commit or abort
  • The system can hang indefinitely

The Solution

Instead of the Transaction Manager storing the decision locally, we use Paxos to store it durably across multiple nodes.

Each Resource Manager runs its own Paxos instance to record its vote. The transaction commits if all Paxos instances decide YES.

Roles in Paxos Commit

Resource Managers (RMs)

  • Own the actual data (locks, rows, files)
  • Execute transactional work
  • Decide whether they can commit (YES/NO)

Paxos Acceptors

  • Store Paxos decisions durably
  • Ensure agreement on values
  • Know nothing about transactions, only answer Paxos messages

Proposers/Learners

  • Often the Resource Managers themselves
  • Propose values and learn decisions

The Flow

  1. Each RM proposes its vote to its Paxos instance
  2. Paxos acceptors choose YES or NO
  3. RMs learn decisions from other instances
  4. If an RM is unresponsive, another RM proposes NO for that instance
  5. All RMs commit only if all instances decided YES
  6. Otherwise, all RMs abort

Properties

Paxos Commit provides fault-tolerant atomic commit. Standard 2PC is essentially Paxos Commit with zero fault tolerance (f = 0).

The key insight: we use multiple Paxos instances (one per participant) rather than trying to use Paxos directly for the commit decision.


Long-Lived Transactions

Traditional transaction protocols assume short-lived transactions (milliseconds). But some operations span hours or days:

  • Batch jobs calculating reports over large datasets
  • Insurance claims requiring human input at various stages
  • Online orders from purchase to delivery

Holding locks for such durations would severely degrade system performance. This is where the Saga pattern comes in.


The Saga Pattern

A Saga is a sequence of local transactions T1, T2, …, Tn where each service performs its operation and initiates the next step. Unlike traditional distributed transactions, sagas don’t hold locks across services. Each local transaction commits immediately.

The key guarantee: either all transactions succeed, or compensating transactions undo any partial progress.

Key Concepts

Compensable Transactions Transactions that can be undone by executing a compensating transaction with the opposite effect. If step 5 fails, compensating transactions for steps 4, 3, 2, and 1 execute in reverse order.

Pivot Transaction The point of no return in a saga. Before the pivot, all transactions are compensable. After the pivot succeeds, the saga must complete. There’s no going back. The pivot can be:

  • The last compensable transaction
  • The first transaction that cannot be undone
  • An irreversible action (like sending an email or charging a card)

Retryable Transactions Transactions that follow the pivot. These are idempotent operations that will eventually succeed, even if they need multiple attempts. They ensure the saga reaches its final state.

Example: Order Processing Saga

Consider an e-commerce order:

  1. Create Order (compensable) → Compensation: Cancel Order
  2. Reserve Inventory (compensable) → Compensation: Release Inventory
  3. Process Payment (pivot) → Point of no return
  4. Ship Order (retryable) → Must eventually succeed
  5. Send Confirmation (retryable) → Must eventually succeed

If payment fails, we compensate by releasing inventory and canceling the order. If shipping fails after payment, we retry until it succeeds. We don’t reverse the payment.


Saga Implementation Approaches

There are two main ways to coordinate a saga: Choreography and Orchestration.

Choreography

In choreography, there’s no central controller. Each service knows what to do next and communicates through events. Services are loosely coupled and react to events from other services.

How it works:

  1. Service A completes its local transaction
  2. Service A publishes an event (e.g., “OrderCreated”)
  3. Service B listens for this event and triggers its local transaction
  4. Service B publishes its own event (e.g., “InventoryReserved”)
  5. The chain continues until completion or failure

On failure: When a service fails, it publishes a failure event. Other services listen for failure events and execute their compensating transactions.

Advantages:

  • Simple for workflows with few services
  • No single point of failure, responsibilities are distributed
  • Services remain loosely coupled
  • No additional coordination service to build and maintain

Disadvantages:

  • Difficult to track the overall workflow state
  • Adding new steps can be confusing (which service listens to what?)
  • Risk of cyclic dependencies between services
  • Hard to test since all services must run to simulate a transaction

Best for: Simple workflows with 3-5 services that don’t change frequently.


Orchestration

In orchestration, a central orchestrator (or saga coordinator) controls the entire workflow. It tells each service what to do and handles the responses.

How it works:

  1. Orchestrator receives a request to start the saga
  2. Orchestrator sends command to Service A
  3. Service A executes and responds with success/failure
  4. Orchestrator sends command to Service B
  5. The orchestrator continues until completion or failure

On failure: The orchestrator knows exactly which steps have completed. It sends compensating commands to services in reverse order.

Advantages:

  • Clear visibility into workflow state
  • Easy to add new steps by updating the orchestrator
  • No cyclic dependencies since the orchestrator manages the flow
  • Easier to test and debug
  • Clear separation of concerns, services don’t need to know about each other

Disadvantages:

  • Orchestrator is a single point of failure
  • Additional complexity in building the orchestrator
  • Orchestrator can become a bottleneck
  • Tighter coupling to the orchestrator service

Best for: Complex workflows, frequently changing processes, or when you need clear visibility into transaction state.


Data Anomalies in Sagas

Since sagas don’t provide isolation between services, several anomalies can occur:

Lost Updates One saga overwrites changes made by another saga without knowing about them.

Dirty Reads A saga reads data that another saga has modified but not yet completed. If that saga later compensates, the first saga made decisions based on invalid data.

Fuzzy Reads (Non-repeatable Reads) A saga reads the same data twice and gets different values because another saga modified it in between.


Countermeasures for Isolation

Since sagas sacrifice isolation for availability, applications use these techniques to mitigate anomalies:

Semantic Locks

Mark data as “in process” using an application-level flag or status field. Other sagas can see the data but know it might change.

Example: An order has status “PENDING” while the saga runs. Other services treat pending orders differently than confirmed ones. The final transaction sets status to “CONFIRMED” or “CANCELLED”.

Commutative Updates

Design updates so the order of execution doesn’t matter. Addition and subtraction are commutative: adding 10 then subtracting 5 gives the same result as subtracting 5 then adding 10.

Example: Instead of “set balance to X”, use “add Y to balance”. Multiple sagas can update the same balance without lost updates.

Pessimistic View

Reorder saga steps so updates happen in retryable transactions (after the pivot). This prevents dirty reads because by the time data is updated, the saga is committed to completing.

Reread Values

Before making critical updates, re-read the data to confirm it hasn’t changed. If it has, abort and restart the saga.

Version Files

Maintain a version number or timestamp on records. Updates only succeed if the version matches what was previously read. This is optimistic locking at the application level.

Pivot Transaction Strategy

Structure the saga so the pivot transaction delineates reversible from irreversible operations:

Before pivot: Operations that can fail and be compensated Pivot: The commit point, often an external action like payment After pivot: Operations that must succeed (retryable)

Example: A transaction that increases an account balance could cause problems if another saga reads the increased balance, but then this saga compensates and reverses it. Moving the balance update after the pivot ensures it never gets rolled back. Once we reach that point, we’re committed to completing.


Summary

Atomic Commit Protocols

For transactions requiring immediate consistency and strict atomicity:

ProtocolSafetyLivenessTrade-off
2PCYesNoBlocks on coordinator failure
3PCNoYesCan violate atomicity during partitions
Quorum-basedYesMostlyComplex, handles most failures
Paxos CommitYesYesHigher latency, most robust

The fundamental tension: perfect safety and perfect liveness cannot both be guaranteed in an asynchronous system where failures are possible (FLP impossibility).


Saga Pattern

For long-lived transactions where holding locks is impractical:

ApproachCoordinationComplexityVisibilityBest For
ChoreographyDistributed (events)LowerHarder to trackSimple, stable workflows
OrchestrationCentralizedHigherClear state trackingComplex, evolving workflows

Saga provides eventual consistency through compensation rather than strict atomicity. It always makes progress (liveness) but trades immediate consistency for availability.


Traditional vs Saga: When to Use Which

AspectAtomic Commit (2PC/Paxos)Saga Pattern
ConsistencyImmediate (strong)Eventual (compensating)
DurationShort-lived (ms to seconds)Long-lived (minutes to days)
CouplingTight (locks held)Loose (no cross-service locks)
Failure handlingRollbackCompensation
IsolationFull (serializable)Partial (anomalies possible)
AvailabilityLower (blocking possible)Higher (always progresses)
ComplexityProtocol complexityBusiness logic complexity

Use atomic commit when transactions are short-lived, strong consistency is required, and you can tolerate potential blocking during failures. This is the right choice when cross-service isolation is critical and you need guarantees that all participants see the same outcome immediately.

Use Saga when operations span longer durations (minutes to days), high availability is prioritized over immediate consistency, and your business logic can handle temporary inconsistency. Saga works well when services need to remain loosely coupled and you can design reasonable compensation logic for each step.