What happens if the Outbox relay crashes mid-batch?
The whole point of the outbox pattern is crash-safe at every point. If the relay dies mid-batch, no events are lost — when it (or another instance) restarts, the same events are still in the outbox waiting.
The two relay states
sent_at IS NULL → not yet published, will be retried
sent_at IS NOT NULL → published successfully, eligible for cleanup
A crash never leaves an event in an intermediate state.
The exact race condition — and why it's safe
Imagine the relay picked up 10 events, published 6 to the bus, then crashed before updating the DB.
- 6 events were published. Consumers received them.
- 4 events never made it.
- All 10 rows still have
sent_at IS NULLin the DB.
After restart:
- The relay picks up the same 10 events again
- Publishes all 10 → 6 are duplicates, 4 are new
- Consumers handle duplicates via idempotency
Net result: at-least-once delivery preserved, no events lost.
Why duplicates are unavoidable here
The DB update (SET sent_at = now()) and the bus publish are themselves a dual-write. They cannot be made atomic. So either:
- Publish first, then mark sent → if mark-sent fails, duplicate on restart
- Mark sent first, then publish → if publish fails, EVENT IS LOST
The pattern always picks "publish first" — because losing events is worse than duplicating them, and consumers can be made idempotent.
Code that handles this correctly
private async Task DispatchBatchAsync(CancellationToken stop)
{
using var scope = _scopes.CreateScope();
var db = scope.ServiceProvider.GetRequiredService<AppDb>();
var batch = await db.OutboxEvents
.FromSqlRaw(@"
SELECT * FROM outbox_events
WHERE sent_at IS NULL
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED")
.ToListAsync(stop);
foreach (var ev in batch)
{
try
{
await _bus.PublishAsync(ev.EventType, ev.Payload, stop);
ev.SentAt = DateTimeOffset.UtcNow;
// do NOT save here — batch the saves at the end
}
catch (Exception ex)
{
ev.Attempts += 1;
ev.LastError = ex.Message;
// do NOT throw — keep processing the rest of the batch
}
}
// ONE save at the end. If we crash before this, on restart we re-publish.
// The lock from FOR UPDATE SKIP LOCKED is released on connection close.
await db.SaveChangesAsync(stop);
}
Key details:
FOR UPDATE SKIP LOCKED— multi-instance safety. If one relay locks rows 1-100, another relay skips them and locks 101-200.- Batch save at the end — keeps the DB transaction tight; lock held briefly.
- Per-event try/catch — one bad event doesn't kill the batch.
What about idempotency on the consumer side?
The outbox pattern gives at-least-once delivery. Consumers MUST handle duplicates:
public async Task Handle(OrderPlaced ev, CancellationToken ct)
{
// Inbox pattern — record that we've seen this event id
var inserted = await db.Inbox.AddIfNotExistsAsync(ev.EventId);
if (!inserted) return; // duplicate — already processed
await ProcessAsync(ev);
}
Together, Outbox (producer) + Inbox (consumer) = effectively-once message delivery.
Crash failure modes — handled
| Crash point | Result | Recovery |
|---|---|---|
| Before any publish | No events published yet; outbox rows still there | Next relay picks them up |
| After some publishes, before save | Duplicates on next run; idempotent consumers dedup | Automatic |
| During DB save | Connection dropped, DB rolls back the SaveChanges → no rows marked sent → retry | Automatic |
| After save, before delete (in janitor) | sent_at set, but row not deleted | Janitor catches it next cycle |
| During bus publish | Polly retries; if exhausted, attempts increments, last_error captured | Dead-letter after N attempts |
Multi-replica relays
Run 2-3 relay instances for HA. FOR UPDATE SKIP LOCKED ensures they don't double-publish the same event under normal conditions.
# Kubernetes deployment
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # keep at least 1 alive during deploys
Production health checks
- Liveness probe: relay process is alive
- Readiness probe: relay can reach DB AND bus
- Custom metric:
outbox_unsent_countandoutbox_max_age_seconds— alert at thresholds - Logs: every publish + every retry, with correlation IDs
If you see outbox_max_age_seconds > 60 on a healthy system, the relay is stuck. Page on-call.