What happens if a compensating transaction fails in a SAGA?
Compensation failure is the worst-case scenario in saga design. You have committed prior steps that need undoing, but the undo is now blocked. The system is in a partially-completed state with no automatic path back.
The three-layer recovery strategy
Layer 1 — Retry with exponential backoff
Most compensation failures are transient (network blip, downstream restart). Retry aggressively.
await Policy
.Handle<TransientException>()
.Or<HttpRequestException>()
.Or<TimeoutException>()
.WaitAndRetryAsync(
retryCount: 8,
sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)) + Jitter(),
onRetry: (ex, ts, attempt, _) =>
_log.LogWarning(ex, "compensation retry {N} after {Delay}", attempt, ts))
.ExecuteAsync(() => _payments.RefundAsync(orderId, amount, ct));
Most transient failures resolve within 8 retries (~4 minutes with exponential backoff).
Layer 2 — Dead-letter table
If retries exhaust, persist the failure with full context. The saga state machine moves to a "stuck" state.
CREATE TABLE saga_failures (
saga_id UUID PRIMARY KEY,
failed_step TEXT NOT NULL,
error_message TEXT NOT NULL,
payload JSONB NOT NULL,
attempts INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ,
resolution TEXT
);
Operations dashboards query this table. Alerts fire when row count > 0.
Layer 3 — Human intervention with a runbook
Some failures require manual action. Every saga should have a documented runbook per failure type:
## Runbook: Payment refund failed
1. Open the dead-letter row in admin / saga_failures
2. Check Stripe dashboard for the original charge
3. Issue a manual refund via Stripe console
4. Mark resolution = 'manual_refund_via_stripe_<txn_id>'
5. Update saga state to 'compensated_manually'
The "pivot transaction" technique
When compensation is truly impossible, redesign the workflow so the last reversible step acts as a "point of no return". Past that, no compensation is needed because no step after it can fail in a way that requires undoing earlier steps.
✅ Reserve stock → Charge card → COMMIT POINT → Hand off to warehouse → Send email
^^^^^^^^^^^^^^^^^^^^^^^^
everything past here either
succeeds or alerts ops, but
never triggers compensation
The semantic compensation technique
When literal undo is impossible, design a forward-only "compensation" that semantically corrects the situation.
| Original | Cannot undo | Forward correction |
|---|---|---|
| Sent "order confirmed" email | Cannot unsend | Send "order cancelled" email |
| Pushed "your package shipped" notification | Cannot unpush | Push "shipping cancelled — refund processing" |
| Charged 9.99 + 0.30 fee | Cannot reverse the fee | Refund 9.99, eat the fee, log as cost-of-doing-business |
The customer sees a corrected state, even though strictly speaking nothing was reversed.
What you must NEVER do
- Silently swallow the failure. Lost money, partial orders, no audit. Customer support nightmare.
- Roll back forward steps via untested code paths. Compensation code must be exercised in tests.
- Skip idempotency checks. Retrying a "release stock" 5 times might release 5x the stock without idempotency.
- Block the saga forever waiting. Compensations must have a timeout + escalation path.
Production design checklist
- Saga state durably persisted before each step + before each compensation attempt
- All compensations are idempotent
- Dead-letter queue + dashboard + alerts
- Runbook per saga type, kept up to date with the saga code
- Postmortems for every dead-letter event — the goal is to drive that count to zero
- Manual replay tool — ops can re-trigger a stuck compensation after fixing the root cause
Interview-grade summary
"Compensation failure is the failure mode that distinguishes a serious distributed-systems engineer from someone who copy-pasted the saga template. You need three layers: aggressive retries for transient failures, a dead-letter table for permanent ones, and a runbook for human intervention. The pivot-transaction technique reduces the surface where compensation matters at all."