Distributed Transactions: Sagas

The saga pattern for long-lived business transactions. Coordinating multi-step workflows across services without holding locks, and handling failure through compensations.

distributed-systemssagasmicroservices

Apr 30, 2026 · 20 min

Long-lived transactions

Classical transactions are short. Open a connection, do a few reads and writes, commit. Milliseconds, usually. The mechanisms we use to keep concurrent work safe (locks, snapshots, serializability, atomic commit) all assume that shape. Hold a few locks for a few milliseconds, release them, the next one picks up. Throughput stays high because nobody waits long.

That model also assumes the things a transaction touches are reachable, fast, and inside the same system. Most real business work isn’t.

A few examples:

Booking a vacation: flight, hotel, car, payment, confirmation email. Each step hits a different third party. Airline, hotel chain, rental company, payment processor, email service. Goes through in seconds when everything is up, drags into minutes when any one of them is slow.
An insurance claim: customer files online, an adjuster reviews, maybe more documents are requested, claim approved or denied, payout issued. Days to weeks.
An e-commerce order: payment, inventory, fulfillment, courier dispatch, delivery, return window. Days.
A batch report scanning terabytes and joining across systems. An hour or more.

These are long-lived transactions (LLTs). They still want the same atomic-ish behavior a short transaction gives. Either the whole thing goes through, or the system ends up back where it started. They can’t get there the classical way, for three reasons.

Holding locks across hours kills throughput. A row lock held across a vacation booking blocks every other booking touching the same hotel for the duration. One slow transaction is annoying. Thousands at once and the lock manager is the bottleneck. The whole system grinds.

Most of the components aren’t your database. The airline’s reservation system, the payment processor, the courier. None of them share your transaction manager. There’s no 2PC across them, no shared lock manager, no rollback you can call. You’re hitting REST APIs, and each one commits when it commits.

Some steps wait on humans. A claim that needs an adjuster to review can’t hold any lock. The adjuster might come back tomorrow, or next week.

So we need a pattern that gives “all-or-nothing” without holding locks across the duration, works across systems we don’t control, and survives steps that take days. It’s called a saga.

Sagas

A saga is a sequence of local transactions T1, T2, …, Tn. Each one runs in some service, commits immediately, and moves on. There’s no global transaction holding them together. Every Ti has a compensating transaction Ci, ready to undo what Ti did if anything later fails.

The contract: either all of T1…Tn commit, or some prefix T1…Tk commits and then Ck…C1 run in reverse to undo it. No locks held between steps. The price is the saga doesn’t get isolation from other work touching the same data. We’ll come back to that.

Saga forward path with compensation.

Three kinds of step

Every step in a saga is one of three:

Compensable. Has an inverse you can run if a later step fails.
Pivot. The boundary. Before it, the saga can still abort. After it commits, the saga is committed too.
Retryable. Comes after the pivot. Must succeed eventually. Idempotent, retried with backoff until it does.

What “pivot” actually means

The usual explanation of pivot is “the point of no return: after this, you can’t roll back.” That’s misleading. Most steps in a real saga can be rolled back. A successful charge can be refunded. A shipped package can be recalled. So why have a pivot at all?

Pivot is about isolation between sagas running concurrently, not whether the step physically reverses.

Here’s the case that motivates it. Suppose one step in your order saga credits 2,000 points to the customer’s loyalty balance. That commits in your loyalty service, and any saga reading the balance now sees the new total. While your saga is still running, a second saga starts. The customer redeems some of those points for a discount on another item in their cart. The redemption reads the new balance, takes 1,500 off it, and commits.

Now your order saga fails downstream and runs the compensation, debiting 2,000 back to undo the credit. The customer’s balance is now negative. They’ve redeemed against a reward that no longer exists, and there’s no clean fix. The system is in a state no serial order of those two sagas could have produced.

So we put a pivot just before the points credit. After the pivot, the saga is committed and the remaining steps just retry until they succeed. The credit never rolls back, so other sagas can read the new balance without worrying it’ll disappear.

Where you put the pivot is a design call. A step belongs after the pivot if its effects, once committed, are visible to other sagas that might act on them. Roll those effects back and you’ve left those sagas operating on data that’s no longer real. Payment is usually the pivot for exactly this reason. The loyalty, billing, and analytics services all read what it did, and there’s no way to take that back.

In practice a saga runs as one of two patterns: choreography (services react to each other’s events) or orchestration (a coordinator drives every step).

Choreography

In choreography, services don’t talk to each other directly. Each one publishes a domain event when something happens in its local transaction, and subscribes to events from others that matter to it. There’s no coordinator. Each event lives on its own topic, and any service that cares about it can subscribe. The workflow is the union of all those subscriptions.

Take a cab booking. Four services and two external actors run the saga:

ride-svc owns the ride state machine. The booking record, the driver-side state transitions (accept, start, complete), the completion event.
driver-svc matches available drivers to incoming rides and broadcasts pending rides to driver apps.
payment-svc charges the customer’s card after the ride completes.
notification-svc sends status updates to both parties.

The two actors are the rider (who requests the ride) and the driver (who accepts and runs it). Their app actions are the external triggers that drive each service’s local work.

Happy path

The rider taps “request ride.” ride-svc records the booking and publishes RideRequested to its ride events topic. driver-svc is subscribed to ride events. It picks up the request, finds available drivers nearby, and broadcasts the ride to their apps (that’s the “show rides” step).

A driver taps “accept” on their app. The driver-side actions (accept, start, complete) go directly to ride-svc, which moves the ride through its state machine: assigned, in-progress, completed. driver-svc isn’t on the critical path for these state transitions; its job was matching.

When the ride completes, ride-svc publishes RideCompleted to ride events. payment-svc subscribes to ride events and watches for RideCompleted. It captures the customer’s card and publishes PaymentCompleted to its payment events topic. notification-svc subscribes to payment events and sends two notifications in parallel: one to the driver (with their earnings) and one to the rider (with the receipt).

Cab booking choreography. Happy path top to bottom; the no-driver branch splits at step 4.

When no driver is available

driver-svc couldn’t match the ride within the matching window, so it publishes DriverNotFound to its driver events topic. ride-svc is also subscribed to driver events, picks up the event, and cancels the booking. notification-svc tells the rider there are no drivers available.

That cancellation is the saga’s compensation for the create-booking step. No card has been touched and no ride has happened, so there’s nothing else to undo. The branch is visible on the figure above as the step-4 split: Show rides on the happy side, DriverNotFound on the failure side.

Payment failure is retry, not compensation

Payment runs after the ride completes. If the card declines at that point, there’s no saga compensation to run, because you can’t undo the ride. The driver drove the customer. The customer was delivered. The time and cost are already spent.

What runs instead is a recovery process:

payment-svc retries the charge with backoff. Cards declined for transient reasons (network blips, temporary fraud holds) often clear on a second or third attempt.
After repeated failures, payment-svc falls back to alternate payment methods on file.
notification-svc subscribes to PaymentFailed and asks the rider to update their payment details.
The booking is flagged for manual recovery or collections.

This is the honest case for post-completion failures in a real saga. Reconciliation and escalation, not rollback. The textbook “run the inverse” model only applies when nothing irreversible has happened yet, like the no-driver branch above.

Strengths

Autonomy. Each service is owned by a different team, with its own retry logic, unaware of the others.
Easy to extend. Adding a new subscriber doesn’t change any existing service. A new consumer just starts listening.
No single point of failure. If notification-svc is down, the rest of the saga keeps going.

Trade-offs

Implicit workflow. It only exists as a subscription graph spread across services and topics. No single place shows what state a saga is in.
Cycles can happen accidentally. A publishes X, B reacts and publishes Y, A reacts to Y and publishes X.
Evolution is multi-place. Adding a new step often touches event contracts in several services.
Debugging across log streams. Tracing a saga means correlating events across multiple topics and services.

Choreography fits when the event topology and contracts are stable. The trade-offs above scale directly with the number of topics, the rate at which event contracts change, and the fan-out from each event.

Orchestration

In orchestration, a single coordinator drives the entire workflow. The orchestrator knows the saga’s steps and their order. It issues a command to a service, waits for the reply, records the result, then decides what to do next. Services don’t subscribe to anything. They expose request-response APIs that the orchestrator calls.

Take a travel booking. The customer wants to book a trip: flight, hotel, car rental, payment, and confirmation. The orchestrator drives it:

booking-orchestrator holds the saga’s state and decides each step.
flight-svc reserves seats with the airline.
hotel-svc reserves a room with the chain.
car-svc reserves a rental.
payment-svc charges the customer’s card.
notification-svc sends the confirmation (or, on failure, a failure message).

Every interaction is a command from the orchestrator to a service, followed by the service’s reply. The orchestrator updates its saga state on each reply before issuing the next command.

Happy path

The customer submits the booking. booking-orchestrator creates a saga record (one row in its own database, holding the state machine) and issues command 1: flight-svc.reserve(...). flight-svc contacts the airline API and replies { ok: true, flightRef: "..." }. The orchestrator records the ref, marks step 1 complete, and issues command 2: hotel-svc.reserve(...). Reply, record, advance. Command 3: car-svc.reserve(...). Command 4: payment-svc.charge(...). Command 5: notification-svc.send(...) to confirm the booking.

Each command is a synchronous request-response interaction (function call, RPC, REST request, whatever the transport is). The orchestrator waits for the reply before issuing the next command, and it knows exactly where the saga is at any point because it tracks the state itself.

Orchestration: travel booking happy path.

When a reservation fails

The orchestrator runs commands 1 and 2 successfully (flight and hotel reserved). Command 3 hits car-svc.reserve(...), which replies with a failure: { ok: false, reason: "no inventory" }. The orchestrator now has two committed reservations and has to unwind them. It runs the compensations in reverse order of commitment:

Command 4 (compensation): hotel-svc.releaseReservation(hotelRef) to release the room.
Command 5 (compensation): flight-svc.releaseReservation(flightRef) to release the seats.

Then command 6: notification-svc.sendFailure(...) to tell the customer the booking couldn’t be completed. payment-svc is never invoked, because the saga aborted before reaching the charge step.

The orchestrator’s role is what makes the cascade work. It knows the reverse order to compensate in because it tracked which services committed and in what order. No service knows about any of the others.

Orchestration: car-svc fails. Compensations on flight-svc and hotel-svc run in reverse, then notification-svc sends the failure message. payment-svc is never invoked.

Strengths

Explicit saga state. The orchestrator’s database has one row per saga that says exactly where it is. Customer support can answer “where is my booking?” from a single query.
One-place evolution. Adding or removing a step is a change inside the orchestrator, not across event contracts.
Easy debugging. The saga state and the orchestrator’s logs show the entire journey.
Real branching is natural. Conditional logic and decision trees live in one place that can be tested as a unit.

Trade-offs

Single point of failure. If the orchestrator goes down, no saga progresses. Running ones stall, new ones can’t start.
Logic accretes in the orchestrator. “If hotel-service replies X then call …” starts as a state machine and grows into a sprawling decision tree.
Tight coupling. Services are coupled to the orchestrator through their command APIs. Changing a command’s contract propagates back.

Orchestration fits when the workflow has real branching, evolves frequently, or requires queryable saga state.

Choosing between them

The hint is usually in the workflow itself.

Reach for choreography when:

The workflow is stable and rarely changes.
Services are owned by different teams that want to ship independently.
Most failures are non-cascading (one service can retry on its own).
You want easy extensibility, adding new subscribers without coordinating across teams.

Reach for orchestration when:

The workflow is complex and has real branching.
The workflow evolves often.
You need to query saga state (customer support asks “where is my booking?” and somebody needs an answer).
Failures cascade and need careful sequencing of compensations.
Operations needs a single place to debug and monitor.

In practice, large systems often use both. The customer’s order saga might be orchestrated end-to-end (you want to track its state), but each step might fan out to several teams via choreography (when payment completes, analytics, marketing, and audit all subscribe).

	Choreography	Orchestration
Coordination	Distributed, via events	Centralised, in the orchestrator
Visibility into state	Hard (state is implicit in subscriptions)	Easy (orchestrator owns the saga record)
Evolution	Touches event contracts in several services	Change the orchestrator
Failure handling	Each subscriber reacts to its events	Orchestrator drives compensations
Coupling	Loose	Tight (services know the orchestrator’s API)
Single point of failure	None	The orchestrator
Best for	Stable workflows with autonomous services	Complex, evolving, state-heavy workflows

Anomalies sagas can let through

Because each step commits as it goes, a saga doesn’t isolate itself from other sagas touching the same data. The classic concurrency anomalies all become possible.

Here’s a concrete example. Two customers are placing orders for the same item at roughly the same time. The store has one unit left.

The first order’s inventory step runs, reserves the unit, and commits. Stock is now zero.
The second order’s inventory step runs a moment later, reads zero stock, and rejects the order. That customer is told “out of stock.”
The first order continues to the payment step. The customer’s card is declined.
The first order’s compensation runs and releases the reservation. Stock goes back to one unit.

The system is in a state no serial order could have produced. The second customer was told “out of stock,” but if the two orders had run one at a time (either order, either first), the second one would have succeeded. The “out of stock” answer was a temporary fact based on the first order’s reservation, which has since been undone.

This is a dirty read that leaked into a permanent decision. Three variants come up:

Dirty reads. A saga reads data another saga has touched but hasn’t finished compensating. Decisions made on that read can become permanently wrong, as above.
Lost updates. Two sagas write to the same row. The second reads, doesn’t notice the first’s commit happened, and overwrites it. The first saga’s update is lost.
Fuzzy reads. A saga reads the same row twice and sees different values because another saga committed in between.

Countermeasures

A handful of standard tricks for blunting these.

Semantic locks. Mark a row “in process” with an application-level flag, and have other sagas refuse to touch flagged rows. It’s a lock, just at a higher level of abstraction. The final step of the saga clears the flag.

Commutative updates. Design the operation so the order doesn’t matter. balance += delta instead of balance = new_value. Sagas can interleave freely because every interleaving produces the same final state.

Pivot reordering. Move steps that have hard-to-unwind effects after the pivot. The harder-to-undo work happens in the retryable phase, where it can’t be rolled back. This is the structural countermeasure introduced in the pivot section above.

Reread before commit. Right before a critical write, re-read the row to confirm it still looks the way you assumed when you decided to write. Cheap, catches most of the damage.

Version numbers. Stamp each row with a version. Updates carry the version they expected. If the row’s version has moved on, the update fails and the saga decides whether to retry, abort, or take a different path. This is optimistic concurrency control at the application layer.

Each of these has a cost. Semantic locks bring back some of the contention sagas were adopted to avoid. Commutative updates constrain what the operation can be. Pivot reordering means more of the saga has to be retryable. Reread and version checks add latency. The right combination depends on which anomalies you can tolerate and which would cause real damage in your domain.

When sagas are wrong

Not every distributed transaction wants a saga. If you need real isolation between concurrent operations (no anomalies, no dirty reads, no lost updates), the saga pattern can’t give you that. The whole premise is “we’ll commit as we go and clean up after.” For some domains (financial settlement, regulatory reporting, anything with strict serialisable guarantees), the cleanup model is a non-starter.

In those cases, you reach back for one of the atomic commit protocols (2PC, Paxos Commit) and accept their trade-offs: tight coupling and possible blocking, in exchange for real isolation.

	Atomic commit	Saga
Consistency	Strong	Eventual (after compensations)
Duration tolerated	Milliseconds to seconds	Minutes to days
Coupling	Tight (locks across nodes)	Loose (no cross-service locks)
Failure handling	Rollback	Compensation
Isolation	Full	Anomalies possible
Availability	Lower (blocking on coordinator failure)	Higher (always progresses)
Complexity	Protocol-level (in the commit infra)	Business-logic-level (in your application)

Reach for atomic commit when transactions are short and you need strong consistency. Reach for sagas when the work spans long durations, availability matters more than immediate consistency, and the business can tolerate (or compensate for) intermediate states being visible.

References