What happens if the Outbox relay crashes mid-batch?

Question

Randhir Jassal · Accepted Answer

The whole point of the outbox pattern is crash-safe at every point. If the relay dies mid-batch, no events are lost — when it (or another instance) restarts, the same events are still in the outbox waiting.

The two relay states

sent_at IS NULL    → not yet published, will be retried
sent_at IS NOT NULL → published successfully, eligible for cleanup

A crash never leaves an event in an intermediate state.

The exact race condition — and why it's safe

Imagine the relay picked up 10 events, published 6 to the bus, then crashed before updating the DB.

6 events were published. Consumers received them.
4 events never made it.
All 10 rows still have sent_at IS NULL in the DB.

After restart:

The relay picks up the same 10 events again
Publishes all 10 → 6 are duplicates, 4 are new
Consumers handle duplicates via idempotency

Net result: at-least-once delivery preserved, no events lost.

Why duplicates are unavoidable here

The DB update (SET sent_at = now()) and the bus publish are themselves a dual-write. They cannot be made atomic. So either:

Publish first, then mark sent → if mark-sent fails, duplicate on restart
Mark sent first, then publish → if publish fails, EVENT IS LOST

The pattern always picks "publish first" — because losing events is worse than duplicating them, and consumers can be made idempotent.

Code that handles this correctly

private async Task DispatchBatchAsync(CancellationToken stop)
{
    using var scope = _scopes.CreateScope();
    var db = scope.ServiceProvider.GetRequiredService<AppDb>();

    var batch = await db.OutboxEvents
        .FromSqlRaw(@"
            SELECT * FROM outbox_events
            WHERE sent_at IS NULL
            ORDER BY created_at
            LIMIT 100
            FOR UPDATE SKIP LOCKED")
        .ToListAsync(stop);

    foreach (var ev in batch)
    {
        try
        {
            await _bus.PublishAsync(ev.EventType, ev.Payload, stop);
            ev.SentAt = DateTimeOffset.UtcNow;
            // do NOT save here — batch the saves at the end
        }
        catch (Exception ex)
        {
            ev.Attempts += 1;
            ev.LastError = ex.Message;
            // do NOT throw — keep processing the rest of the batch
        }
    }

    // ONE save at the end. If we crash before this, on restart we re-publish.
    // The lock from FOR UPDATE SKIP LOCKED is released on connection close.
    await db.SaveChangesAsync(stop);
}

Key details:

FOR UPDATE SKIP LOCKED — multi-instance safety. If one relay locks rows 1-100, another relay skips them and locks 101-200.
Batch save at the end — keeps the DB transaction tight; lock held briefly.
Per-event try/catch — one bad event doesn't kill the batch.

What about idempotency on the consumer side?

The outbox pattern gives at-least-once delivery. Consumers MUST handle duplicates:

public async Task Handle(OrderPlaced ev, CancellationToken ct)
{
    // Inbox pattern — record that we've seen this event id
    var inserted = await db.Inbox.AddIfNotExistsAsync(ev.EventId);
    if (!inserted) return;   // duplicate — already processed

    await ProcessAsync(ev);
}

Together, Outbox (producer) + Inbox (consumer) = effectively-once message delivery.

Crash failure modes — handled

Crash point	Result	Recovery
Before any publish	No events published yet; outbox rows still there	Next relay picks them up
After some publishes, before save	Duplicates on next run; idempotent consumers dedup	Automatic
During DB save	Connection dropped, DB rolls back the SaveChanges → no rows marked sent → retry	Automatic
After save, before delete (in janitor)	sent_at set, but row not deleted	Janitor catches it next cycle
During bus publish	Polly retries; if exhausted, attempts increments, last_error captured	Dead-letter after N attempts

Multi-replica relays

Run 2-3 relay instances for HA. FOR UPDATE SKIP LOCKED ensures they don't double-publish the same event under normal conditions.

# Kubernetes deployment
replicas: 2
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0   # keep at least 1 alive during deploys

Production health checks

Liveness probe: relay process is alive
Readiness probe: relay can reach DB AND bus
Custom metric: outbox_unsent_count and outbox_max_age_seconds — alert at thresholds
Logs: every publish + every retry, with correlation IDs

If you see outbox_max_age_seconds > 60 on a healthy system, the relay is stuck. Page on-call.

What happens if the Outbox relay crashes mid-batch?

The two relay states

The exact race condition — and why it's safe

Why duplicates are unavoidable here

Code that handles this correctly

What about idempotency on the consumer side?

Crash failure modes — handled

Multi-replica relays

Production health checks

What happens if the Outbox relay crashes mid-batch?

The two relay states

The exact race condition — and why it's safe

Why duplicates are unavoidable here

Code that handles this correctly

What about idempotency on the consumer side?

Crash failure modes — handled

Multi-replica relays

Production health checks

Outbox vs Inbox pattern — what is the difference?

What is the Outbox pattern and what problem does it solve?

How do you handle event ordering in the Outbox pattern?

Outbox vs Inbox pattern — what is the difference?

What is the Outbox pattern and what problem does it solve?

How do you handle event ordering in the Outbox pattern?

Related questions

Related questions