SAGA vs Two-Phase Commit (2PC) — when does each fit?
Both solve "distributed transactions" but make opposite trade-offs.
2PC — Two-Phase Commit
A coordinator asks every participant "can you commit?" (phase 1), then if all say yes, sends COMMIT to all (phase 2). Strong consistency. All participants are locked until phase 2 completes.
SAGA
Each service does its work locally, then triggers the next. No coordinator-held locks. Eventual consistency with explicit compensation on failure.
Side-by-side
| 2PC | SAGA | |
|---|---|---|
| Consistency model | Strong (linearizable) | Eventual |
| Locks held during transaction | Yes — across all participants | No — only local per step |
| Failure when coordinator dies | Participants stuck waiting | Saga state durably persisted, resumes |
| Latency | High (round trips × participants) | Each step is local + a message |
| Scalability | Poor — locks limit throughput | Excellent — fully async |
| Compensation needed | No — rollback is automatic | Yes — each step pairs with an undo |
| Best for | A few participants under your control | Microservices at scale |
| Modern usage | Banking core, ERP | E-commerce, travel, SaaS workflows |
Why 2PC fell out of favor
- Coordinator is a single point of failure. When it dies mid-phase-2, participants are blocked in "prepared" state, unable to commit OR roll back. Recovery requires manual intervention.
- Locks block other transactions. Inventory rows locked for the duration kill throughput.
- Network partitions are unrecoverable. CAP theorem says you can't have both consistency and availability under partition; 2PC picks consistency, becomes unavailable.
- Cross-database support is uneven. XA transactions work on commercial DBs (Oracle, SQL Server) but are weak/absent on cloud-native stores (MongoDB, DynamoDB, Cassandra).
Why SAGA wins for microservices
- Each service stays independent — no shared coordinator
- Locks are local and brief
- Network partitions reduce to "stuck saga", recoverable when partition heals
- Works with any storage tech — relational, document, key-value
- Aligns with how distributed systems actually behave (asynchronous, partial failures)
When 2PC still wins
- Banking core ledgers — strict no-intermediate-state required by regulation
- All participants under one team's operational control with mature ops
- Workflow latency budget is high (back-office batch)
- Transaction count is modest (10s/sec, not 10k/sec)
What about distributed transactions in cloud databases?
Modern cloud DBs offer internal distributed transactions via consensus protocols (Spanner, CockroachDB, YugabyteDB). These give 2PC-like guarantees with better failure handling. BUT they only span data inside the same DB cluster — you still need SAGA when spanning multiple services.
Interview rule of thumb
If the question asks about a workflow across multiple services with different teams, languages, or storage — answer SAGA. If the question is about a single database (even sharded) — answer "use the DB's transaction". If the question mentions banking, ATM, ledger, regulatory — mention 2PC plus the fact that it requires careful operational maturity.