How We Deliver 15 Million Webhooks a Day Without Losing a Single Event

A webhook looks like the easiest feature you'll ever build: something happens, you POST it to the customer's URL. Then you ship it, and reality arrives — the customer's endpoint is down, or slow, or returns 500, or times out, or your own process restarts mid-send. Multiply that by 15 million events a day across thousands of endpoints you don't control, and "just POST it" becomes one of the hardest reliability problems in your system.

This is the design we run on Mattrx, our multi-tenant marketing-analytics SaaS, to deliver ~15 million webhook events per day — campaign.completed, conversion.tracked, budget.threshold.crossed, report.ready — to customer-configured endpoints. The first version was a synchronous POST inside the request handler. It lost events on every deploy and turned one slow customer into an outage for everyone. This post is everything we changed, and why.

TL;DR

Aspect	Naive sync POST (before)	Outbox + queue + workers (after)
Durability	events lost on crash/deploy	persisted before delivery, never lost
API latency	blocked on the customer's endpoint	decoupled; API p95 unaffected
Retries	none	exponential backoff + jitter, 8 attempts / ~24h
Isolation	one slow customer stalls everyone	per-tenant partitioning + concurrency caps
Giving up	fails the API call	dead-letter queue + circuit breaker
Security	ad hoc	HMAC-SHA256 signed, HTTPS, timestamped
Duplicates	unhandled	stable event id; customers de-dupe

~15M events/day ≈ 175/sec average, with peaks 5–10× (~1,500–1,730/sec).
Outbox pattern → zero events dropped after a committed change (we used to lose thousands per deploy).
Decoupling kept API p95 at 120 ms — synchronous webhooks had spiked it into seconds behind slow endpoints.
First-attempt delivery ~96%; ~99.98% eventual after retries.
8 retry attempts over ~24h with exponential backoff + full jitter.
~0.02% permanently fail → dead-letter queue → per-customer status + alert.
Per-tenant queue partitioning + concurrency caps → no noisy-neighbour starvation.
HMAC-SHA256 signatures (per-tenant secret) + timestamp → integrity and replay protection.
Circuit breaker auto-disables ~40 chronically-dead endpoints/day after 20 consecutive failures.
Peak in-flight deliveries ~1,400 (Little's law: 1,730/s × ~0.8s).

The one mental shift: you don't control the endpoints, so you cannot prevent failure — you can only make failure survivable. Persist before you deliver, retry with discipline, isolate the slow from the fast, and make giving up a first-class, observable outcome.

The running example: Mattrx

Mattrx is a real system — Angular 19 front end, .NET 9 / ASP.NET Core back end (Clean Architecture + CQRS), Azure SQL, Azure App Service. Kafka handles ingestion; Azure Service Bus already carries our report-command queue; Event Grid wires the reactive paths. 110k MAU, ~3,200 req/sec of inbound traffic at peak.

Webhooks are how customers react to what happens inside Mattrx without polling us. Every completed campaign, tracked conversion, or generated report can fan out to a customer endpoint. At our scale that's ~15M outbound deliveries a day — and every one goes to a URL some customer typed into a settings page, running software we have never seen and cannot fix.

The problem: why this is deceptively hard

The naive mental model is "an HTTP POST." The real problem is a distributed-systems problem wearing an HTTP POST's clothes:

The endpoints are untrusted and unreliable. They go down, deploy, rate-limit you, return 500s, hang until timeout, and change IP mid-flight. You own none of that.
You must not lose a committed event. If Mattrx tells its own database "campaign completed," the customer's webhook is now owed. Dropping it is a correctness bug, not a blip.
You can't guarantee exactly-once. To an endpoint you don't control, across a network that can fail after the endpoint processed but before you got the 200, exactly-once is a fantasy. The honest target is at-least-once + idempotency.
One slow customer can sink the rest. Shared threads or a shared worker pool mean a single endpoint that takes 30 seconds starves every other customer's deliveries.
Failure is the common case, not the edge case. At 15M/day, "0.1% of endpoints are broken right now" is 15,000 events fighting your retry machinery every day.

Requirements

Functional

Deliver every committed event to the customer's endpoint at least once.
Retry transient failures automatically, with sane backoff.
Expose delivery status per event and per customer (delivered / retrying / dead-lettered).
Let customers configure endpoints + secrets and choose event types.

Non-functional

Durability: never lose an event once its domain change committed.
Scale: 15M/day sustained, absorb 5–10× peaks without falling over.
Isolation: one tenant's broken endpoint must not delay another's deliveries.
Security: payloads signed and tamper-evident; HTTPS only; replay-resistant.
Idempotency: safe to retry; duplicates are expected and identifiable.

Back-of-the-envelope

Throughput: 15,000,000 ÷ 86,400 s ≈ 173/sec average. Real traffic is bursty (campaigns end on the hour, reports finish in batches), so plan for 5–10× peaks ≈ 870–1,730/sec.
Concurrency (Little's law): in-flight = arrival rate × latency. At peak 1,730/s × ~0.8s average delivery ≈ ~1,400 concurrent deliveries. Size the worker pool for that, not for the average.
Storage: keep ~24h of events retryable. 15M rows/day × ~2 KB ≈ ~30 GB working set for the outbox; prune delivered rows aggressively.
Egress: 15M × ~5 KB median payload ≈ ~75 GB/day outbound.

These numbers pick the architecture: average throughput is trivial, but the peaks + the long retry tail + the isolation requirement are what force a queue, a worker pool, and per-tenant partitioning.

The naive approach — and why it collapses

The first version delivered the webhook inside the request that caused the event.

// BEFORE: fire the webhook synchronously, in the request path.
[HttpPost("campaigns/{id}/complete")]
public async Task<IActionResult> Complete(string id, CancellationToken ct)
{
    await campaigns.CompleteAsync(id, ct);

    var endpoint = await webhooks.GetEndpointAsync(TenantId, "campaign.completed", ct);
    using var http = new HttpClient { Timeout = TimeSpan.FromSeconds(5) };
    await http.PostAsJsonAsync(endpoint.Url, new { type = "campaign.completed", id }, ct); // blocks here

    return Ok();
}

Why it collapses:

Lost events. If the process restarts (deploy, scale-in, crash) between CompleteAsync and the POST, the event is gone forever — the DB says "completed," the customer never hears.
Head-of-line blocking. The API thread waits on the customer's endpoint. A 5-second timeout × many slow customers exhausts the thread pool and takes down unrelated endpoints.
Coupled failure. A customer returning 500 fails the whole API call — their broken server becomes your 500.
No retries. A transient blip = a permanently missed event.

Mattrx metric: in the naive era, every deploy dropped thousands of in-flight events, and a single slow endpoint could push API p95 from 120 ms into multiple seconds. That is the problem the rest of this post solves.

The architecture

  Domain change (API request / job)
        |  (same DB transaction)
        v
  [ OUTBOX table ]   persist-before-deliver — a crash never drops an event
        |
        v
  Dispatcher (relay)   claims Pending rows: FOR UPDATE SKIP LOCKED
        |
        v
  [ Azure Service Bus: webhook-deliveries ]   partitioned by tenantId
        |
        +--> Worker 1 --\
        +--> Worker 2 -----> HTTP POST (HMAC-signed, HTTPS) --> Customer endpoint
        +--> Worker N --/                 |
                                          | 5xx / timeout / network?
                                          v
                          re-queue with exponential backoff + jitter
                                          |
                                 exhausted (8 attempts / ~24h)?
                                          v
                              [ DEAD-LETTER QUEUE ] --> per-customer status + alert

  (N consecutive failures for an endpoint => circuit OPEN => endpoint auto-disabled)

Now each box, with the before that broke and the after that holds.

Fix 1: the Outbox Pattern — persist before you deliver

Before

Deliver first, hope it worked. A crash between the state change and the send loses the event (shown above).

After

Write the event into an outbox table in the same database transaction as the domain change. If the transaction commits, the event will be delivered — later, by a separate relay. If it rolls back, the event never existed. No window to lose anything.

// AFTER: the outbox row commits atomically with the state change.
public async Task CompleteCampaignAsync(string tenantId, string campaignId, CancellationToken ct)
{
    await using var tx = await db.BeginTransactionAsync(ct);

    await campaigns.MarkCompletedAsync(campaignId, ct);

    await db.Outbox.InsertAsync(new OutboxEvent
    {
        Id = Guid.NewGuid(),                 // stable event id == idempotency key
        TenantId = tenantId,
        Type = "campaign.completed",
        Payload = JsonSerializer.Serialize(new { campaignId }),
        Status = OutboxStatus.Pending,
        CreatedAt = clock.UtcNow,
    }, ct);

    await tx.CommitAsync(ct);   // state change + event, all or nothing
}

A background relay polls the outbox and publishes to the queue, claiming rows with FOR UPDATE SKIP LOCKED so multiple relay instances never grab the same row:

public sealed class OutboxDispatcher(IDb db, IServiceBus bus)
{
    public async Task PumpAsync(CancellationToken ct)
    {
        // Claim a batch without blocking other dispatchers on locked rows.
        var batch = await db.Outbox.ClaimPendingAsync(limit: 500, ct); // UPDATE ... RETURNING, SKIP LOCKED
        foreach (var e in batch)
        {
            await bus.PublishAsync("webhook-deliveries", e.ToMessage(), partitionKey: e.TenantId, ct);
            await db.Outbox.MarkPublishedAsync(e.Id, ct);
        }
    }
}

Diagnostic: the outbox turns "we sent it" (a hope) into "we durably owe it" (a fact). Delivery becomes a retryable background task over committed state, not a fragile step in the request path.

Mattrx metric: events dropped per deploy went from thousands to zero. The outbox is the single change that made the whole system trustworthy.

Fix 2: queue + dispatcher + workers — parallelism without noisy neighbours

Before

Even after the outbox, a single worker loop delivering events one at a time can't keep up with peaks, and a shared pool lets one slow tenant hog every worker.

After

Publish to Azure Service Bus partitioned by tenantId, and run a pool of competing-consumer workers with per-tenant concurrency caps. Partitioning gives per-tenant ordering; the caps give isolation.

// Each worker pulls messages, but a per-tenant semaphore stops any single
// tenant from consuming the whole pool — the noisy-neighbour guard.
public sealed class DeliveryPump(IServiceBus bus, ITenantConcurrency limits, DeliveryWorker worker)
{
    public async Task RunAsync(CancellationToken ct)
    {
        await foreach (var msg in bus.ReceiveAsync("webhook-deliveries", ct))
        {
            var slot = await limits.AcquireAsync(msg.TenantId, maxPerTenant: 20, ct); // isolation
            _ = DeliverAndRelease(msg, slot, ct);   // fan out; don't block the receive loop
        }
    }
}

Diagnostic: without the per-tenant cap, a customer whose endpoint hangs for 30s will, at enough volume, occupy every worker — and everyone else's deliveries stall behind them. The cap means a broken tenant can waste at most its own 20 slots.

Mattrx metric: per-tenant caps eliminated noisy-neighbour incidents entirely — a single dead endpoint no longer moves the fleet's delivery p95. Peak concurrency sits around ~1,400 in-flight deliveries, exactly what the back-of-envelope predicted.

Fix 3: retries, exponential backoff + jitter, and the dead-letter queue

Before

One failure = one lost event. Or, worse, a naive while(!ok) retry() hot-loops against a struggling endpoint and DDoSes it back.

After

On a retryable failure, re-queue the message with a delay that grows exponentially and is jittered to avoid synchronized retry storms. After a fixed number of attempts, dead-letter it.

public sealed class DeliveryWorker(IHttpClientFactory http, IWebhookStore store, ISigner signer, IServiceBus bus)
{
    public async Task HandleAsync(WebhookMessage msg, CancellationToken ct)
    {
        var result = await DeliverAsync(msg, ct);
        if (result.Ok) { await store.RecordSuccessAsync(msg, result.Status, ct); return; }

        if (result.Retryable && msg.Attempt + 1 < MaxAttempts)
        {
            var delay = NextDelay(msg.Attempt);                       // backoff + jitter
            await bus.ScheduleAsync("webhook-deliveries", msg.NextAttempt(), delay, ct);
        }
        else
        {
            await bus.DeadLetterAsync(msg, reason: result.Describe(), ct);  // give up, visibly
            await store.RecordDeadLetterAsync(msg, result, ct);
        }
    }

    // Exponential backoff with FULL jitter, capped. 8 attempts span ~24h.
    private static TimeSpan NextDelay(int attempt)
    {
        var baseSeconds = Math.Min(BaseDelaySeconds * Math.Pow(2, attempt), MaxDelaySeconds); // cap at 6h
        var jittered = baseSeconds * (0.5 + Random.Shared.NextDouble() * 0.5);                 // full jitter
        return TimeSpan.FromSeconds(jittered);
    }
}

Not every failure is retryable: a 410 Gone or 400 Bad Request is the endpoint telling you to stop — dead-letter immediately. A 503, 429, timeout, or connection reset is transient — retry.

Diagnostic: jitter is not optional. Without it, ten thousand events that failed at the same instant (a customer's 60-second deploy) all retry at the same instant, hammering them the moment they recover. Full jitter spreads the herd.

Mattrx metric: retries lift delivery from ~96% first-attempt to ~99.98% eventual. The remaining ~0.02% land in the dead-letter queue — visible, queryable, and alertable, never silently dropped.

Fix 4: idempotency & de-duplication

Before

At-least-once means customers will sometimes get the same event twice (we delivered, our process died before recording success, we retried). Without a dedup key, that's a duplicate charge, a double email, a corrupted count on their side.

After

Every event carries a stable id (the outbox Id, unchanged across all retries) in a header. Customers de-duplicate on it; we document it as a contract.

req.Headers.Add("X-Mattrx-Event-Id", msg.Id.ToString());   // same id on every retry of this event
req.Headers.Add("X-Mattrx-Timestamp", msg.TimestampUnix.ToString());

The customer's handler becomes idempotent with a few lines:

// Customer-side (illustrative): ignore an event id you've already processed.
if (await seen.ExistsAsync(eventId)) return Ok();   // duplicate — safe no-op
await ProcessAsync(payload);
await seen.RecordAsync(eventId, ttl: TimeSpan.FromDays(2));

Diagnostic: you cannot make delivery exactly-once, but you can make processing effectively-once by shipping a stable id and telling customers to key on it. The honesty is the feature: "we deliver at least once; de-dupe on X-Mattrx-Event-Id."

Mattrx metric: the stable event id turned duplicate deliveries from support tickets into a documented, handled non-event.

Fix 5: security — HMAC-signed payloads over HTTPS

Before

An unauthenticated POST to a customer URL. The customer has no way to know the request actually came from Mattrx and wasn't forged or tampered with.

After

Sign each payload with a per-tenant secret using HMAC-SHA256, over the timestamp + event id + body, and require HTTPS. The timestamp lets the customer reject stale/replayed requests.

public sealed class HmacSigner : ISigner
{
    public string Sign(string secret, Guid eventId, long timestamp, string body)
    {
        // Signing the timestamp + id + body gives integrity AND replay protection.
        var signingInput = $"{timestamp}.{eventId}.{body}";
        using var hmac = new HMACSHA256(Encoding.UTF8.GetBytes(secret));
        var hash = hmac.ComputeHash(Encoding.UTF8.GetBytes(signingInput));
        return Convert.ToHexString(hash).ToLowerInvariant();
    }
}

req.Headers.Add("X-Mattrx-Signature", $"sha256={signer.Sign(endpoint.Secret, msg.Id, ts, body)}");

The customer verifies by recomputing the HMAC with their shared secret and comparing in constant time, and rejects anything whose timestamp is older than a few minutes.

Diagnostic: the signature covers the exact bytes you send, so a proxy that "helpfully" reformats JSON breaks verification — sign and send the raw serialized body, and tell customers to verify against the raw bytes.

Mattrx metric: every one of the ~15M daily deliveries is HMAC-signed and HTTPS-only; the timestamp window blocks replayed deliveries at the customer's edge.

Fix 6: chronically failing customers — the circuit breaker

Before

A customer deletes their endpoint but leaves it configured. Every event to them fails, retries 8 times over 24h, and burns worker capacity forever — multiplied across every dead endpoint.

After

Track consecutive failures per endpoint. After a threshold, auto-disable the endpoint (open the circuit) and email the owner. Any success resets the counter (closes it).

public async Task RecordFailureAsync(string endpointId, CancellationToken ct)
{
    var failures = await store.IncrementConsecutiveFailuresAsync(endpointId, ct);
    if (failures >= DisableThreshold)   // 20 consecutive
    {
        await store.DisableEndpointAsync(endpointId,
            reason: "auto-disabled after 20 consecutive failures", ct);
        await notifications.EmailEndpointOwnerAsync(endpointId, ct);
    }
}

public Task RecordSuccessAsync(string endpointId, CancellationToken ct) =>
    store.ResetConsecutiveFailuresAsync(endpointId, ct);   // close the breaker

Disabled endpoints are skipped at delivery time (see Fix 2's store.GetEndpointAsync returning Disabled), so they stop consuming retry capacity immediately. The customer re-enables from their dashboard once fixed.

Diagnostic: without a breaker, your retry budget is silently consumed by endpoints that will never succeed. The breaker converts "waste capacity forever" into "give up loudly, tell the human, move on."

Mattrx metric: the breaker auto-disables ~40 chronically-dead endpoints/day, reclaiming the worker capacity they'd otherwise waste and turning silent failure into an actionable email.

Fix 7: observability — per-customer delivery status

Before

"Did my webhook fire?" was unanswerable. We had application logs, not a delivery ledger.

After

Every attempt writes a delivery record (event id, endpoint, attempt, status code, latency, outcome), powering a per-customer delivery dashboard and internal alerts.

public sealed record DeliveryAttempt(
    Guid EventId, string TenantId, string EndpointId,
    int Attempt, int StatusCode, int LatencyMs,
    DeliveryOutcome Outcome, DateTimeOffset At);   // Delivered | Retrying | DeadLettered

Customers see each event's status and can replay a dead-lettered event after fixing their endpoint. Internally, we alert on delivery success-rate dips and dead-letter spikes per tenant.

Diagnostic: at 15M/day, aggregate "99.98% delivered" hides the one tenant at 40%. Per-customer status is what makes a partner integration debuggable instead of a mystery.

Mattrx metric: delivery success rate, p95 delivery latency, and dead-letter volume are dashboarded per tenant; a customer endpoint degrading is now a chart, not a support thread.

The delivery lifecycle

  Pending --publish--> Queued --deliver--> Delivered (2xx)   [terminal, success]
                          |                     ^
                          | 5xx / 429 / timeout |
                          v                     |
                      Retrying --backoff+jitter-+   (attempt < 8)
                          |
                          | attempt == 8  OR  4xx non-retryable (400/410)
                          v
                     Dead-lettered   [terminal, visible] --> per-customer status + alert + replay

  Side effect: 20 consecutive failures for an endpoint => circuit OPEN => endpoint auto-disabled

The numbers, in one place

Metric	Naive sync (before)	Outbox + queue + workers (after)
Events dropped per deploy	thousands	0
API p95 under slow endpoints	seconds	120 ms (unaffected)
First-attempt delivery	n/a (no retry)	~96%
Eventual delivery	best-effort	~99.98%
Retry attempts / window	0	8 / ~24h (backoff + jitter)
Dead-letter rate	(silent loss)	~0.02% (visible)
Peak concurrency	thread-pool bound	~1,400 in-flight
Noisy-neighbour isolation	none	per-tenant caps
Signed payloads	no	100% HMAC-SHA256 + HTTPS
Auto-disabled dead endpoints/day	0 (wasted forever)	~40

Design checklist

The honest stuff: when NOT to build this

This machinery earns its keep at scale and with untrusted endpoints. Skip parts of it when the situation is simpler:

Low volume. A few hundred events a day? An outbox table + one background worker + a retry column gets you there. Don't stand up Service Bus and a worker fleet for that.
You control both ends. Internal service-to-service can publish to a durable queue directly — the HMAC signing, endpoint config, and circuit breaker exist for untrusted external endpoints.
Dropping events is acceptable. Best-effort telemetry doesn't need an outbox. If losing some is fine, say so and save the complexity.
You need a synchronous answer. If the caller must have the result inline, a webhook is the wrong tool — that's an RPC.
You're promising global ordering. Per-tenant partitioning gives per-tenant order. Total global order across tenants is a much harder problem — don't promise what you can't cheaply deliver.
You haven't measured. Build for the volume you have. The outbox + a single worker scales surprisingly far; add partitioning and caps when a real peak forces it.
You're chasing exactly-once. You can't have it against endpoints you don't control. Design at-least-once + idempotency and be honest with customers about it.

The model to carry forward

At-least-once plus idempotency — never exactly-once. You cannot stop the endpoints from failing, so the whole design is about surviving their failure: persist before you deliver, isolate the slow from the fast, and make giving up a loud, observable, recoverable outcome instead of a silent drop.

Three habits that make it reliable:

Persist before you act. The outbox is the entire difference between "we think we sent it" and "we durably owe it and will."
Design the failure paths first. Retries, dead-letter, and the circuit breaker are the system. The happy-path POST is the trivial part.
Isolate tenants. One customer's dead endpoint must never move another customer's delivery latency — cap concurrency per tenant and mean it.

A webhook really is just a POST. Delivering fifteen million of them a day, to endpoints you don't control, without losing one — that's a distributed system, and it deserves to be designed like one.

TL;DR

Aspect	Naive sync POST (before)	Outbox + queue + workers (after)
Durability	events lost on crash/deploy	persisted before delivery, never lost
API latency	blocked on the customer's endpoint	decoupled; API p95 unaffected
Retries	none	exponential backoff + jitter, 8 attempts / ~24h
Isolation	one slow customer stalls everyone	per-tenant partitioning + concurrency caps
Giving up	fails the API call	dead-letter queue + circuit breaker
Security	ad hoc	HMAC-SHA256 signed, HTTPS, timestamped
Duplicates	unhandled	stable event id; customers de-dupe

~15M events/day ≈ 175/sec average, with peaks 5–10× (~1,500–1,730/sec).
Outbox pattern → zero events dropped after a committed change (we used to lose thousands per deploy).
Decoupling kept API p95 at 120 ms — synchronous webhooks had spiked it into seconds behind slow endpoints.
First-attempt delivery ~96%; ~99.98% eventual after retries.
8 retry attempts over ~24h with exponential backoff + full jitter.
~0.02% permanently fail → dead-letter queue → per-customer status + alert.
Per-tenant queue partitioning + concurrency caps → no noisy-neighbour starvation.
HMAC-SHA256 signatures (per-tenant secret) + timestamp → integrity and replay protection.
Circuit breaker auto-disables ~40 chronically-dead endpoints/day after 20 consecutive failures.
Peak in-flight deliveries ~1,400 (Little's law: 1,730/s × ~0.8s).

The one mental shift: you don't control the endpoints, so you cannot prevent failure — you can only make failure survivable. Persist before you deliver, retry with discipline, isolate the slow from the fast, and make giving up a first-class, observable outcome.

The running example: Mattrx

The problem: why this is deceptively hard

The naive mental model is "an HTTP POST." The real problem is a distributed-systems problem wearing an HTTP POST's clothes:

The endpoints are untrusted and unreliable. They go down, deploy, rate-limit you, return 500s, hang until timeout, and change IP mid-flight. You own none of that.
You must not lose a committed event. If Mattrx tells its own database "campaign completed," the customer's webhook is now owed. Dropping it is a correctness bug, not a blip.
You can't guarantee exactly-once. To an endpoint you don't control, across a network that can fail after the endpoint processed but before you got the 200, exactly-once is a fantasy. The honest target is at-least-once + idempotency.
One slow customer can sink the rest. Shared threads or a shared worker pool mean a single endpoint that takes 30 seconds starves every other customer's deliveries.
Failure is the common case, not the edge case. At 15M/day, "0.1% of endpoints are broken right now" is 15,000 events fighting your retry machinery every day.

Requirements

Functional

Deliver every committed event to the customer's endpoint at least once.
Retry transient failures automatically, with sane backoff.
Expose delivery status per event and per customer (delivered / retrying / dead-lettered).
Let customers configure endpoints + secrets and choose event types.

Non-functional

Durability: never lose an event once its domain change committed.
Scale: 15M/day sustained, absorb 5–10× peaks without falling over.
Isolation: one tenant's broken endpoint must not delay another's deliveries.
Security: payloads signed and tamper-evident; HTTPS only; replay-resistant.
Idempotency: safe to retry; duplicates are expected and identifiable.

Back-of-the-envelope

Throughput: 15,000,000 ÷ 86,400 s ≈ 173/sec average. Real traffic is bursty (campaigns end on the hour, reports finish in batches), so plan for 5–10× peaks ≈ 870–1,730/sec.
Concurrency (Little's law): in-flight = arrival rate × latency. At peak 1,730/s × ~0.8s average delivery ≈ ~1,400 concurrent deliveries. Size the worker pool for that, not for the average.
Storage: keep ~24h of events retryable. 15M rows/day × ~2 KB ≈ ~30 GB working set for the outbox; prune delivered rows aggressively.
Egress: 15M × ~5 KB median payload ≈ ~75 GB/day outbound.

The naive approach — and why it collapses

The first version delivered the webhook inside the request that caused the event.

// BEFORE: fire the webhook synchronously, in the request path.
[HttpPost("campaigns/{id}/complete")]
public async Task<IActionResult> Complete(string id, CancellationToken ct)
{
    await campaigns.CompleteAsync(id, ct);

    var endpoint = await webhooks.GetEndpointAsync(TenantId, "campaign.completed", ct);
    using var http = new HttpClient { Timeout = TimeSpan.FromSeconds(5) };
    await http.PostAsJsonAsync(endpoint.Url, new { type = "campaign.completed", id }, ct); // blocks here

    return Ok();
}

Why it collapses:

Lost events. If the process restarts (deploy, scale-in, crash) between CompleteAsync and the POST, the event is gone forever — the DB says "completed," the customer never hears.
Head-of-line blocking. The API thread waits on the customer's endpoint. A 5-second timeout × many slow customers exhausts the thread pool and takes down unrelated endpoints.
Coupled failure. A customer returning 500 fails the whole API call — their broken server becomes your 500.
No retries. A transient blip = a permanently missed event.

The architecture

  Domain change (API request / job)
        |  (same DB transaction)
        v
  [ OUTBOX table ]   persist-before-deliver — a crash never drops an event
        |
        v
  Dispatcher (relay)   claims Pending rows: FOR UPDATE SKIP LOCKED
        |
        v
  [ Azure Service Bus: webhook-deliveries ]   partitioned by tenantId
        |
        +--> Worker 1 --\
        +--> Worker 2 -----> HTTP POST (HMAC-signed, HTTPS) --> Customer endpoint
        +--> Worker N --/                 |
                                          | 5xx / timeout / network?
                                          v
                          re-queue with exponential backoff + jitter
                                          |
                                 exhausted (8 attempts / ~24h)?
                                          v
                              [ DEAD-LETTER QUEUE ] --> per-customer status + alert

  (N consecutive failures for an endpoint => circuit OPEN => endpoint auto-disabled)

Now each box, with the before that broke and the after that holds.

Fix 1: the Outbox Pattern — persist before you deliver

Before

Deliver first, hope it worked. A crash between the state change and the send loses the event (shown above).

After

// AFTER: the outbox row commits atomically with the state change.
public async Task CompleteCampaignAsync(string tenantId, string campaignId, CancellationToken ct)
{
    await using var tx = await db.BeginTransactionAsync(ct);

    await campaigns.MarkCompletedAsync(campaignId, ct);

    await db.Outbox.InsertAsync(new OutboxEvent
    {
        Id = Guid.NewGuid(),                 // stable event id == idempotency key
        TenantId = tenantId,
        Type = "campaign.completed",
        Payload = JsonSerializer.Serialize(new { campaignId }),
        Status = OutboxStatus.Pending,
        CreatedAt = clock.UtcNow,
    }, ct);

    await tx.CommitAsync(ct);   // state change + event, all or nothing
}

A background relay polls the outbox and publishes to the queue, claiming rows with FOR UPDATE SKIP LOCKED so multiple relay instances never grab the same row:

public sealed class OutboxDispatcher(IDb db, IServiceBus bus)
{
    public async Task PumpAsync(CancellationToken ct)
    {
        // Claim a batch without blocking other dispatchers on locked rows.
        var batch = await db.Outbox.ClaimPendingAsync(limit: 500, ct); // UPDATE ... RETURNING, SKIP LOCKED
        foreach (var e in batch)
        {
            await bus.PublishAsync("webhook-deliveries", e.ToMessage(), partitionKey: e.TenantId, ct);
            await db.Outbox.MarkPublishedAsync(e.Id, ct);
        }
    }
}

Diagnostic: the outbox turns "we sent it" (a hope) into "we durably owe it" (a fact). Delivery becomes a retryable background task over committed state, not a fragile step in the request path.

Mattrx metric: events dropped per deploy went from thousands to zero. The outbox is the single change that made the whole system trustworthy.

Fix 2: queue + dispatcher + workers — parallelism without noisy neighbours

Before

Even after the outbox, a single worker loop delivering events one at a time can't keep up with peaks, and a shared pool lets one slow tenant hog every worker.

After

// Each worker pulls messages, but a per-tenant semaphore stops any single
// tenant from consuming the whole pool — the noisy-neighbour guard.
public sealed class DeliveryPump(IServiceBus bus, ITenantConcurrency limits, DeliveryWorker worker)
{
    public async Task RunAsync(CancellationToken ct)
    {
        await foreach (var msg in bus.ReceiveAsync("webhook-deliveries", ct))
        {
            var slot = await limits.AcquireAsync(msg.TenantId, maxPerTenant: 20, ct); // isolation
            _ = DeliverAndRelease(msg, slot, ct);   // fan out; don't block the receive loop
        }
    }
}

Fix 3: retries, exponential backoff + jitter, and the dead-letter queue

Before

One failure = one lost event. Or, worse, a naive while(!ok) retry() hot-loops against a struggling endpoint and DDoSes it back.

After

On a retryable failure, re-queue the message with a delay that grows exponentially and is jittered to avoid synchronized retry storms. After a fixed number of attempts, dead-letter it.

public sealed class DeliveryWorker(IHttpClientFactory http, IWebhookStore store, ISigner signer, IServiceBus bus)
{
    public async Task HandleAsync(WebhookMessage msg, CancellationToken ct)
    {
        var result = await DeliverAsync(msg, ct);
        if (result.Ok) { await store.RecordSuccessAsync(msg, result.Status, ct); return; }

        if (result.Retryable && msg.Attempt + 1 < MaxAttempts)
        {
            var delay = NextDelay(msg.Attempt);                       // backoff + jitter
            await bus.ScheduleAsync("webhook-deliveries", msg.NextAttempt(), delay, ct);
        }
        else
        {
            await bus.DeadLetterAsync(msg, reason: result.Describe(), ct);  // give up, visibly
            await store.RecordDeadLetterAsync(msg, result, ct);
        }
    }

    // Exponential backoff with FULL jitter, capped. 8 attempts span ~24h.
    private static TimeSpan NextDelay(int attempt)
    {
        var baseSeconds = Math.Min(BaseDelaySeconds * Math.Pow(2, attempt), MaxDelaySeconds); // cap at 6h
        var jittered = baseSeconds * (0.5 + Random.Shared.NextDouble() * 0.5);                 // full jitter
        return TimeSpan.FromSeconds(jittered);
    }
}

Not every failure is retryable: a 410 Gone or 400 Bad Request is the endpoint telling you to stop — dead-letter immediately. A 503, 429, timeout, or connection reset is transient — retry.

Fix 4: idempotency & de-duplication

Before

After

Every event carries a stable id (the outbox Id, unchanged across all retries) in a header. Customers de-duplicate on it; we document it as a contract.

req.Headers.Add("X-Mattrx-Event-Id", msg.Id.ToString());   // same id on every retry of this event
req.Headers.Add("X-Mattrx-Timestamp", msg.TimestampUnix.ToString());

The customer's handler becomes idempotent with a few lines:

// Customer-side (illustrative): ignore an event id you've already processed.
if (await seen.ExistsAsync(eventId)) return Ok();   // duplicate — safe no-op
await ProcessAsync(payload);
await seen.RecordAsync(eventId, ttl: TimeSpan.FromDays(2));

Mattrx metric: the stable event id turned duplicate deliveries from support tickets into a documented, handled non-event.

Fix 5: security — HMAC-signed payloads over HTTPS

Before

An unauthenticated POST to a customer URL. The customer has no way to know the request actually came from Mattrx and wasn't forged or tampered with.

After

Sign each payload with a per-tenant secret using HMAC-SHA256, over the timestamp + event id + body, and require HTTPS. The timestamp lets the customer reject stale/replayed requests.

public sealed class HmacSigner : ISigner
{
    public string Sign(string secret, Guid eventId, long timestamp, string body)
    {
        // Signing the timestamp + id + body gives integrity AND replay protection.
        var signingInput = $"{timestamp}.{eventId}.{body}";
        using var hmac = new HMACSHA256(Encoding.UTF8.GetBytes(secret));
        var hash = hmac.ComputeHash(Encoding.UTF8.GetBytes(signingInput));
        return Convert.ToHexString(hash).ToLowerInvariant();
    }
}

req.Headers.Add("X-Mattrx-Signature", $"sha256={signer.Sign(endpoint.Secret, msg.Id, ts, body)}");

The customer verifies by recomputing the HMAC with their shared secret and comparing in constant time, and rejects anything whose timestamp is older than a few minutes.

Mattrx metric: every one of the ~15M daily deliveries is HMAC-signed and HTTPS-only; the timestamp window blocks replayed deliveries at the customer's edge.

Fix 6: chronically failing customers — the circuit breaker

Before

A customer deletes their endpoint but leaves it configured. Every event to them fails, retries 8 times over 24h, and burns worker capacity forever — multiplied across every dead endpoint.

After

Track consecutive failures per endpoint. After a threshold, auto-disable the endpoint (open the circuit) and email the owner. Any success resets the counter (closes it).

public async Task RecordFailureAsync(string endpointId, CancellationToken ct)
{
    var failures = await store.IncrementConsecutiveFailuresAsync(endpointId, ct);
    if (failures >= DisableThreshold)   // 20 consecutive
    {
        await store.DisableEndpointAsync(endpointId,
            reason: "auto-disabled after 20 consecutive failures", ct);
        await notifications.EmailEndpointOwnerAsync(endpointId, ct);
    }
}

public Task RecordSuccessAsync(string endpointId, CancellationToken ct) =>
    store.ResetConsecutiveFailuresAsync(endpointId, ct);   // close the breaker

Mattrx metric: the breaker auto-disables ~40 chronically-dead endpoints/day, reclaiming the worker capacity they'd otherwise waste and turning silent failure into an actionable email.

Fix 7: observability — per-customer delivery status

Before

"Did my webhook fire?" was unanswerable. We had application logs, not a delivery ledger.

After

Every attempt writes a delivery record (event id, endpoint, attempt, status code, latency, outcome), powering a per-customer delivery dashboard and internal alerts.

public sealed record DeliveryAttempt(
    Guid EventId, string TenantId, string EndpointId,
    int Attempt, int StatusCode, int LatencyMs,
    DeliveryOutcome Outcome, DateTimeOffset At);   // Delivered | Retrying | DeadLettered

Customers see each event's status and can replay a dead-lettered event after fixing their endpoint. Internally, we alert on delivery success-rate dips and dead-letter spikes per tenant.

Diagnostic: at 15M/day, aggregate "99.98% delivered" hides the one tenant at 40%. Per-customer status is what makes a partner integration debuggable instead of a mystery.

Mattrx metric: delivery success rate, p95 delivery latency, and dead-letter volume are dashboarded per tenant; a customer endpoint degrading is now a chart, not a support thread.

The delivery lifecycle

  Pending --publish--> Queued --deliver--> Delivered (2xx)   [terminal, success]
                          |                     ^
                          | 5xx / 429 / timeout |
                          v                     |
                      Retrying --backoff+jitter-+   (attempt < 8)
                          |
                          | attempt == 8  OR  4xx non-retryable (400/410)
                          v
                     Dead-lettered   [terminal, visible] --> per-customer status + alert + replay

  Side effect: 20 consecutive failures for an endpoint => circuit OPEN => endpoint auto-disabled

The numbers, in one place

Metric	Naive sync (before)	Outbox + queue + workers (after)
Events dropped per deploy	thousands	0
API p95 under slow endpoints	seconds	120 ms (unaffected)
First-attempt delivery	n/a (no retry)	~96%
Eventual delivery	best-effort	~99.98%
Retry attempts / window	0	8 / ~24h (backoff + jitter)
Dead-letter rate	(silent loss)	~0.02% (visible)
Peak concurrency	thread-pool bound	~1,400 in-flight
Noisy-neighbour isolation	none	per-tenant caps
Signed payloads	no	100% HMAC-SHA256 + HTTPS
Auto-disabled dead endpoints/day	0 (wasted forever)	~40

Design checklist

The honest stuff: when NOT to build this

This machinery earns its keep at scale and with untrusted endpoints. Skip parts of it when the situation is simpler:

Low volume. A few hundred events a day? An outbox table + one background worker + a retry column gets you there. Don't stand up Service Bus and a worker fleet for that.
You control both ends. Internal service-to-service can publish to a durable queue directly — the HMAC signing, endpoint config, and circuit breaker exist for untrusted external endpoints.
Dropping events is acceptable. Best-effort telemetry doesn't need an outbox. If losing some is fine, say so and save the complexity.
You need a synchronous answer. If the caller must have the result inline, a webhook is the wrong tool — that's an RPC.
You're promising global ordering. Per-tenant partitioning gives per-tenant order. Total global order across tenants is a much harder problem — don't promise what you can't cheaply deliver.
You haven't measured. Build for the volume you have. The outbox + a single worker scales surprisingly far; add partitioning and caps when a real peak forces it.
You're chasing exactly-once. You can't have it against endpoints you don't control. Design at-least-once + idempotency and be honest with customers about it.

The model to carry forward

Three habits that make it reliable:

Persist before you act. The outbox is the entire difference between "we think we sent it" and "we durably owe it and will."
Design the failure paths first. Retries, dead-letter, and the circuit breaker are the system. The happy-path POST is the trivial part.
Isolate tenants. One customer's dead endpoint must never move another customer's delivery latency — cap concurrency per tenant and mean it.

A webhook really is just a POST. Delivering fifteen million of them a day, to endpoints you don't control, without losing one — that's a distributed system, and it deserves to be designed like one.

Get the next issue

Keep reading

Get the next issue

Keep reading