How We Deliver 15 Million Webhooks a Day Without Losing a Single Event
Delivering 15M webhooks a day to endpoints you don't control is deceptively hard. Here's the outbox + queue + retry design that never drops an event.
- Author
- Randhir Jassal
- Published
- Reading time
- 20 min read
- Views
- 7 views
A webhook looks like the easiest feature you'll ever build: something happens, you POST it to the customer's URL. Then you ship it, and reality arrives — the customer's endpoint is down, or slow, or returns 500, or times out, or your own process restarts mid-send. Multiply that by 15 million events a day across thousands of endpoints you don't control, and "just POST it" becomes one of the hardest reliability problems in your system.
This is the design we run on Mattrx, our multi-tenant marketing-analytics SaaS, to deliver ~15 million webhook events per day — campaign.completed, conversion.tracked, budget.threshold.crossed, report.ready — to customer-configured endpoints. The first version was a synchronous POST inside the request handler. It lost events on every deploy and turned one slow customer into an outage for everyone. This post is everything we changed, and why.
TL;DR
| Aspect | Naive sync POST (before) | Outbox + queue + workers (after) |
|---|---|---|
| Durability | events lost on crash/deploy | persisted before delivery, never lost |
| API latency | blocked on the customer's endpoint | decoupled; API p95 unaffected |
| Retries | none | exponential backoff + jitter, 8 attempts / ~24h |
| Isolation | one slow customer stalls everyone | per-tenant partitioning + concurrency caps |
| Giving up | fails the API call | dead-letter queue + circuit breaker |
| Security | ad hoc | HMAC-SHA256 signed, HTTPS, timestamped |
| Duplicates | unhandled | stable event id; customers de-dupe |
- ~15M events/day ≈ 175/sec average, with peaks 5–10× (~1,500–1,730/sec).
- Outbox pattern → zero events dropped after a committed change (we used to lose thousands per deploy).
- Decoupling kept API p95 at 120 ms — synchronous webhooks had spiked it into seconds behind slow endpoints.
- First-attempt delivery ~96%; ~99.98% eventual after retries.
- 8 retry attempts over ~24h with exponential backoff + full jitter.
- ~0.02% permanently fail → dead-letter queue → per-customer status + alert.
- Per-tenant queue partitioning + concurrency caps → no noisy-neighbour starvation.
- HMAC-SHA256 signatures (per-tenant secret) + timestamp → integrity and replay protection.
- Circuit breaker auto-disables ~40 chronically-dead endpoints/day after 20 consecutive failures.
- Peak in-flight deliveries ~1,400 (Little's law: 1,730/s × ~0.8s).
The one mental shift: you don't control the endpoints, so you cannot prevent failure — you can only make failure survivable. Persist before you deliver, retry with discipline, isolate the slow from the fast, and make giving up a first-class, observable outcome.
The running example: Mattrx
Mattrx is a real system — Angular 19 front end, .NET 9 / ASP.NET Core back end (Clean Architecture + CQRS), Azure SQL, Azure App Service. Kafka handles ingestion; Azure Service Bus already carries our report-command queue; Event Grid wires the reactive paths. 110k MAU, ~3,200 req/sec of inbound traffic at peak.
Webhooks are how customers react to what happens inside Mattrx without polling us. Every completed campaign, tracked conversion, or generated report can fan out to a customer endpoint. At our scale that's ~15M outbound deliveries a day — and every one goes to a URL some customer typed into a settings page, running software we have never seen and cannot fix.
The problem: why this is deceptively hard
The naive mental model is "an HTTP POST." The real problem is a distributed-systems problem wearing an HTTP POST's clothes:
- The endpoints are untrusted and unreliable. They go down, deploy, rate-limit you, return 500s, hang until timeout, and change IP mid-flight. You own none of that.
- You must not lose a committed event. If Mattrx tells its own database "campaign completed," the customer's webhook is now owed. Dropping it is a correctness bug, not a blip.
- You can't guarantee exactly-once. To an endpoint you don't control, across a network that can fail after the endpoint processed but before you got the 200, exactly-once is a fantasy. The honest target is at-least-once + idempotency.
- One slow customer can sink the rest. Shared threads or a shared worker pool mean a single endpoint that takes 30 seconds starves every other customer's deliveries.
- Failure is the common case, not the edge case. At 15M/day, "0.1% of endpoints are broken right now" is 15,000 events fighting your retry machinery every day.
Requirements
Functional
- Deliver every committed event to the customer's endpoint at least once.
- Retry transient failures automatically, with sane backoff.
- Expose delivery status per event and per customer (delivered / retrying / dead-lettered).
- Let customers configure endpoints + secrets and choose event types.
Non-functional
- Durability: never lose an event once its domain change committed.
- Scale: 15M/day sustained, absorb 5–10× peaks without falling over.
- Isolation: one tenant's broken endpoint must not delay another's deliveries.
- Security: payloads signed and tamper-evident; HTTPS only; replay-resistant.
- Idempotency: safe to retry; duplicates are expected and identifiable.
Back-of-the-envelope
- Throughput: 15,000,000 ÷ 86,400 s ≈ 173/sec average. Real traffic is bursty (campaigns end on the hour, reports finish in batches), so plan for 5–10× peaks ≈ 870–1,730/sec.
- Concurrency (Little's law): in-flight = arrival rate × latency. At peak 1,730/s × ~0.8s average delivery ≈ ~1,400 concurrent deliveries. Size the worker pool for that, not for the average.
- Storage: keep ~24h of events retryable. 15M rows/day × ~2 KB ≈ ~30 GB working set for the outbox; prune delivered rows aggressively.
- Egress: 15M × ~5 KB median payload ≈ ~75 GB/day outbound.
These numbers pick the architecture: average throughput is trivial, but the peaks + the long retry tail + the isolation requirement are what force a queue, a worker pool, and per-tenant partitioning.
The naive approach — and why it collapses
The first version delivered the webhook inside the request that caused the event.
// BEFORE: fire the webhook synchronously, in the request path.
[HttpPost("campaigns/{id}/complete")]
public async Task<IActionResult> Complete(string id, CancellationToken ct)
{
await campaigns.CompleteAsync(id, ct);
var endpoint = await webhooks.GetEndpointAsync(TenantId, "campaign.completed", ct);
using var http = new HttpClient { Timeout = TimeSpan.FromSeconds(5) };
await http.PostAsJsonAsync(endpoint.Url, new { type = "campaign.completed", id }, ct); // blocks here
return Ok();
}
Why it collapses:
- Lost events. If the process restarts (deploy, scale-in, crash) between
CompleteAsyncand the POST, the event is gone forever — the DB says "completed," the customer never hears. - Head-of-line blocking. The API thread waits on the customer's endpoint. A 5-second timeout × many slow customers exhausts the thread pool and takes down unrelated endpoints.
- Coupled failure. A customer returning 500 fails the whole API call — their broken server becomes your 500.
- No retries. A transient blip = a permanently missed event.
Mattrx metric: in the naive era, every deploy dropped thousands of in-flight events, and a single slow endpoint could push API p95 from 120 ms into multiple seconds. That is the problem the rest of this post solves.
The architecture
Domain change (API request / job)
| (same DB transaction)
v
[ OUTBOX table ] persist-before-deliver — a crash never drops an event
|
v
Dispatcher (relay) claims Pending rows: FOR UPDATE SKIP LOCKED
|
v
[ Azure Service Bus: webhook-deliveries ] partitioned by tenantId
|
+--> Worker 1 --\
+--> Worker 2 -----> HTTP POST (HMAC-signed, HTTPS) --> Customer endpoint
+--> Worker N --/ |
| 5xx / timeout / network?
v
re-queue with exponential backoff + jitter
|
exhausted (8 attempts / ~24h)?
v
[ DEAD-LETTER QUEUE ] --> per-customer status + alert
(N consecutive failures for an endpoint => circuit OPEN => endpoint auto-disabled)
Now each box, with the before that broke and the after that holds.
Fix 1: the Outbox Pattern — persist before you deliver
Before
Deliver first, hope it worked. A crash between the state change and the send loses the event (shown above).
After
Write the event into an outbox table in the same database transaction as the domain change. If the transaction commits, the event will be delivered — later, by a separate relay. If it rolls back, the event never existed. No window to lose anything.
// AFTER: the outbox row commits atomically with the state change.
public async Task CompleteCampaignAsync(string tenantId, string campaignId, CancellationToken ct)
{
await using var tx = await db.BeginTransactionAsync(ct);
await campaigns.MarkCompletedAsync(campaignId, ct);
await db.Outbox.InsertAsync(new OutboxEvent
{
Id = Guid.NewGuid(), // stable event id == idempotency key
TenantId = tenantId,
Type = "campaign.completed",
Payload = JsonSerializer.Serialize(new { campaignId }),
Status = OutboxStatus.Pending,
CreatedAt = clock.UtcNow,
}, ct);
await tx.CommitAsync(ct); // state change + event, all or nothing
}
A background relay polls the outbox and publishes to the queue, claiming rows with FOR UPDATE SKIP LOCKED so multiple relay instances never grab the same row:
public sealed class OutboxDispatcher(IDb db, IServiceBus bus)
{
public async Task PumpAsync(CancellationToken ct)
{
// Claim a batch without blocking other dispatchers on locked rows.
var batch = await db.Outbox.ClaimPendingAsync(limit: 500, ct); // UPDATE ... RETURNING, SKIP LOCKED
foreach (var e in batch)
{
await bus.PublishAsync("webhook-deliveries", e.ToMessage(), partitionKey: e.TenantId, ct);
await db.Outbox.MarkPublishedAsync(e.Id, ct);
}
}
}
Diagnostic: the outbox turns "we sent it" (a hope) into "we durably owe it" (a fact). Delivery becomes a retryable background task over committed state, not a fragile step in the request path.
Mattrx metric: events dropped per deploy went from thousands to zero. The outbox is the single change that made the whole system trustworthy.
Fix 2: queue + dispatcher + workers — parallelism without noisy neighbours
Before
Even after the outbox, a single worker loop delivering events one at a time can't keep up with peaks, and a shared pool lets one slow tenant hog every worker.
After
Publish to Azure Service Bus partitioned by tenantId, and run a pool of competing-consumer workers with per-tenant concurrency caps. Partitioning gives per-tenant ordering; the caps give isolation.
// Each worker pulls messages, but a per-tenant semaphore stops any single
// tenant from consuming the whole pool — the noisy-neighbour guard.
public sealed class DeliveryPump(IServiceBus bus, ITenantConcurrency limits, DeliveryWorker worker)
{
public async Task RunAsync(CancellationToken ct)
{
await foreach (var msg in bus.ReceiveAsync("webhook-deliveries", ct))
{
var slot = await limits.AcquireAsync(msg.TenantId, maxPerTenant: 20, ct); // isolation
_ = DeliverAndRelease(msg, slot, ct); // fan out; don't block the receive loop
}
}
}
Diagnostic: without the per-tenant cap, a customer whose endpoint hangs for 30s will, at enough volume, occupy every worker — and everyone else's deliveries stall behind them. The cap means a broken tenant can waste at most its own 20 slots.
Mattrx metric: per-tenant caps eliminated noisy-neighbour incidents entirely — a single dead endpoint no longer moves the fleet's delivery p95. Peak concurrency sits around ~1,400 in-flight deliveries, exactly what the back-of-envelope predicted.
Fix 3: retries, exponential backoff + jitter, and the dead-letter queue
Before
One failure = one lost event. Or, worse, a naive while(!ok) retry() hot-loops against a struggling endpoint and DDoSes it back.
After
On a retryable failure, re-queue the message with a delay that grows exponentially and is jittered to avoid synchronized retry storms. After a fixed number of attempts, dead-letter it.
public sealed class DeliveryWorker(IHttpClientFactory http, IWebhookStore store, ISigner signer, IServiceBus bus)
{
public async Task HandleAsync(WebhookMessage msg, CancellationToken ct)
{
var result = await DeliverAsync(msg, ct);
if (result.Ok) { await store.RecordSuccessAsync(msg, result.Status, ct); return; }
if (result.Retryable && msg.Attempt + 1 < MaxAttempts)
{
var delay = NextDelay(msg.Attempt); // backoff + jitter
await bus.ScheduleAsync("webhook-deliveries", msg.NextAttempt(), delay, ct);
}
else
{
await bus.DeadLetterAsync(msg, reason: result.Describe(), ct); // give up, visibly
await store.RecordDeadLetterAsync(msg, result, ct);
}
}
// Exponential backoff with FULL jitter, capped. 8 attempts span ~24h.
private static TimeSpan NextDelay(int attempt)
{
var baseSeconds = Math.Min(BaseDelaySeconds * Math.Pow(2, attempt), MaxDelaySeconds); // cap at 6h
var jittered = baseSeconds * (0.5 + Random.Shared.NextDouble() * 0.5); // full jitter
return TimeSpan.FromSeconds(jittered);
}
}
Not every failure is retryable: a 410 Gone or 400 Bad Request is the endpoint telling you to stop — dead-letter immediately. A 503, 429, timeout, or connection reset is transient — retry.
Diagnostic: jitter is not optional. Without it, ten thousand events that failed at the same instant (a customer's 60-second deploy) all retry at the same instant, hammering them the moment they recover. Full jitter spreads the herd.
Mattrx metric: retries lift delivery from ~96% first-attempt to ~99.98% eventual. The remaining ~0.02% land in the dead-letter queue — visible, queryable, and alertable, never silently dropped.
Fix 4: idempotency & de-duplication
Before
At-least-once means customers will sometimes get the same event twice (we delivered, our process died before recording success, we retried). Without a dedup key, that's a duplicate charge, a double email, a corrupted count on their side.
After
Every event carries a stable id (the outbox Id, unchanged across all retries) in a header. Customers de-duplicate on it; we document it as a contract.
req.Headers.Add("X-Mattrx-Event-Id", msg.Id.ToString()); // same id on every retry of this event
req.Headers.Add("X-Mattrx-Timestamp", msg.TimestampUnix.ToString());
The customer's handler becomes idempotent with a few lines:
// Customer-side (illustrative): ignore an event id you've already processed.
if (await seen.ExistsAsync(eventId)) return Ok(); // duplicate — safe no-op
await ProcessAsync(payload);
await seen.RecordAsync(eventId, ttl: TimeSpan.FromDays(2));
Diagnostic: you cannot make delivery exactly-once, but you can make processing effectively-once by shipping a stable id and telling customers to key on it. The honesty is the feature: "we deliver at least once; de-dupe on X-Mattrx-Event-Id."
Mattrx metric: the stable event id turned duplicate deliveries from support tickets into a documented, handled non-event.
Fix 5: security — HMAC-signed payloads over HTTPS
Before
An unauthenticated POST to a customer URL. The customer has no way to know the request actually came from Mattrx and wasn't forged or tampered with.
After
Sign each payload with a per-tenant secret using HMAC-SHA256, over the timestamp + event id + body, and require HTTPS. The timestamp lets the customer reject stale/replayed requests.
public sealed class HmacSigner : ISigner
{
public string Sign(string secret, Guid eventId, long timestamp, string body)
{
// Signing the timestamp + id + body gives integrity AND replay protection.
var signingInput = $"{timestamp}.{eventId}.{body}";
using var hmac = new HMACSHA256(Encoding.UTF8.GetBytes(secret));
var hash = hmac.ComputeHash(Encoding.UTF8.GetBytes(signingInput));
return Convert.ToHexString(hash).ToLowerInvariant();
}
}
req.Headers.Add("X-Mattrx-Signature", $"sha256={signer.Sign(endpoint.Secret, msg.Id, ts, body)}");
The customer verifies by recomputing the HMAC with their shared secret and comparing in constant time, and rejects anything whose timestamp is older than a few minutes.
Diagnostic: the signature covers the exact bytes you send, so a proxy that "helpfully" reformats JSON breaks verification — sign and send the raw serialized body, and tell customers to verify against the raw bytes.
Mattrx metric: every one of the ~15M daily deliveries is HMAC-signed and HTTPS-only; the timestamp window blocks replayed deliveries at the customer's edge.
Fix 6: chronically failing customers — the circuit breaker
Before
A customer deletes their endpoint but leaves it configured. Every event to them fails, retries 8 times over 24h, and burns worker capacity forever — multiplied across every dead endpoint.
After
Track consecutive failures per endpoint. After a threshold, auto-disable the endpoint (open the circuit) and email the owner. Any success resets the counter (closes it).
public async Task RecordFailureAsync(string endpointId, CancellationToken ct)
{
var failures = await store.IncrementConsecutiveFailuresAsync(endpointId, ct);
if (failures >= DisableThreshold) // 20 consecutive
{
await store.DisableEndpointAsync(endpointId,
reason: "auto-disabled after 20 consecutive failures", ct);
await notifications.EmailEndpointOwnerAsync(endpointId, ct);
}
}
public Task RecordSuccessAsync(string endpointId, CancellationToken ct) =>
store.ResetConsecutiveFailuresAsync(endpointId, ct); // close the breaker
Disabled endpoints are skipped at delivery time (see Fix 2's store.GetEndpointAsync returning Disabled), so they stop consuming retry capacity immediately. The customer re-enables from their dashboard once fixed.
Diagnostic: without a breaker, your retry budget is silently consumed by endpoints that will never succeed. The breaker converts "waste capacity forever" into "give up loudly, tell the human, move on."
Mattrx metric: the breaker auto-disables ~40 chronically-dead endpoints/day, reclaiming the worker capacity they'd otherwise waste and turning silent failure into an actionable email.
Fix 7: observability — per-customer delivery status
Before
"Did my webhook fire?" was unanswerable. We had application logs, not a delivery ledger.
After
Every attempt writes a delivery record (event id, endpoint, attempt, status code, latency, outcome), powering a per-customer delivery dashboard and internal alerts.
public sealed record DeliveryAttempt(
Guid EventId, string TenantId, string EndpointId,
int Attempt, int StatusCode, int LatencyMs,
DeliveryOutcome Outcome, DateTimeOffset At); // Delivered | Retrying | DeadLettered
Customers see each event's status and can replay a dead-lettered event after fixing their endpoint. Internally, we alert on delivery success-rate dips and dead-letter spikes per tenant.
Diagnostic: at 15M/day, aggregate "99.98% delivered" hides the one tenant at 40%. Per-customer status is what makes a partner integration debuggable instead of a mystery.
Mattrx metric: delivery success rate, p95 delivery latency, and dead-letter volume are dashboarded per tenant; a customer endpoint degrading is now a chart, not a support thread.
The delivery lifecycle
Pending --publish--> Queued --deliver--> Delivered (2xx) [terminal, success]
| ^
| 5xx / 429 / timeout |
v |
Retrying --backoff+jitter-+ (attempt < 8)
|
| attempt == 8 OR 4xx non-retryable (400/410)
v
Dead-lettered [terminal, visible] --> per-customer status + alert + replay
Side effect: 20 consecutive failures for an endpoint => circuit OPEN => endpoint auto-disabled
The numbers, in one place
| Metric | Naive sync (before) | Outbox + queue + workers (after) |
|---|---|---|
| Events dropped per deploy | thousands | 0 |
| API p95 under slow endpoints | seconds | 120 ms (unaffected) |
| First-attempt delivery | n/a (no retry) | ~96% |
| Eventual delivery | best-effort | ~99.98% |
| Retry attempts / window | 0 | 8 / ~24h (backoff + jitter) |
| Dead-letter rate | (silent loss) | ~0.02% (visible) |
| Peak concurrency | thread-pool bound | ~1,400 in-flight |
| Noisy-neighbour isolation | none | per-tenant caps |
| Signed payloads | no | 100% HMAC-SHA256 + HTTPS |
| Auto-disabled dead endpoints/day | 0 (wasted forever) | ~40 |
Design checklist
- Persist the event (outbox) in the same transaction as the domain change.
- Relay with
FOR UPDATE SKIP LOCKEDso multiple dispatchers don't double-send. - Publish to a queue partitioned by tenant; deliver from a worker pool.
- Enforce per-tenant concurrency caps so one slow endpoint can't starve others.
- Retry only retryable failures; exponential backoff + full jitter; cap attempts.
- Dead-letter exhausted/permanent failures — visible, queryable, replayable.
- Ship a stable event id; document at-least-once + de-dupe on that id.
- HMAC-sign the raw body with a per-tenant secret; require HTTPS; include a timestamp.
- Circuit-break endpoints after N consecutive failures; auto-disable + notify.
- Record per-attempt delivery status; dashboard and alert per tenant.
The honest stuff: when NOT to build this
This machinery earns its keep at scale and with untrusted endpoints. Skip parts of it when the situation is simpler:
- Low volume. A few hundred events a day? An outbox table + one background worker + a retry column gets you there. Don't stand up Service Bus and a worker fleet for that.
- You control both ends. Internal service-to-service can publish to a durable queue directly — the HMAC signing, endpoint config, and circuit breaker exist for untrusted external endpoints.
- Dropping events is acceptable. Best-effort telemetry doesn't need an outbox. If losing some is fine, say so and save the complexity.
- You need a synchronous answer. If the caller must have the result inline, a webhook is the wrong tool — that's an RPC.
- You're promising global ordering. Per-tenant partitioning gives per-tenant order. Total global order across tenants is a much harder problem — don't promise what you can't cheaply deliver.
- You haven't measured. Build for the volume you have. The outbox + a single worker scales surprisingly far; add partitioning and caps when a real peak forces it.
- You're chasing exactly-once. You can't have it against endpoints you don't control. Design at-least-once + idempotency and be honest with customers about it.
The model to carry forward
At-least-once plus idempotency — never exactly-once. You cannot stop the endpoints from failing, so the whole design is about surviving their failure: persist before you deliver, isolate the slow from the fast, and make giving up a loud, observable, recoverable outcome instead of a silent drop.
Three habits that make it reliable:
- Persist before you act. The outbox is the entire difference between "we think we sent it" and "we durably owe it and will."
- Design the failure paths first. Retries, dead-letter, and the circuit breaker are the system. The happy-path POST is the trivial part.
- Isolate tenants. One customer's dead endpoint must never move another customer's delivery latency — cap concurrency per tenant and mean it.
A webhook really is just a POST. Delivering fifteen million of them a day, to endpoints you don't control, without losing one — that's a distributed system, and it deserves to be designed like one.
Further reading
- Outbox Pattern — A Complete Guide with Order Processing Example
- We Replaced REST with Kafka and Cut Failures by 90%
- SAGA Pattern in Microservices — A Complete Guide
Designing an event or webhook delivery system and want a second pair of eyes on the failure paths? I'm always happy to compare notes — reach me at randhir.jassal@gmail.com.
Get the next issue
A short, curated email with the newest posts and questions.