We Replaced REST with Kafka and Cut Failures by 90% — The Event-Driven Architecture, Explained
How Mattrx swapped synchronous REST calls for Kafka — decoupling the ingestion pipeline, killing cascading failures, and cutting failures by 90%.
- Author
- Randhir Jassal
- Published
- Reading time
- 21 min read
- Views
- 5 views
Mattrx's event-ingestion pipeline used to be a chain of synchronous REST calls: collect → enrich → roll up → persist. When any link was slow or down, the whole chain failed and customers silently lost events. We replaced those inter-service REST calls with Kafka topics, and end-to-end ingestion failures dropped ~90%. Here's the before architecture, the after architecture, the producer/consumer code, and the trade-offs we'd warn you about.
TL;DR
A synchronous REST call between services is a hidden coupling: the caller's success now depends on the callee being up right now. Chain four of them and your availability is the product of four availabilities — one slow downstream and the whole request fails. For a high-volume, fire-and-forget workload like event ingestion, that's the wrong shape. Events want a log, not a phone call.
We moved Mattrx's ingestion pipeline from REST fan-out to Kafka (managed, Confluent Cloud). The collector now produces an event to a topic and returns immediately; independent consumer groups enrich, roll up, and persist asynchronously, with retries, a dead-letter topic, and replay. A downstream outage no longer reaches the customer — events buffer in the log and drain when the consumer recovers.
| Dimension | Before (synchronous REST) | After (Kafka) |
|---|---|---|
| Coupling | Temporal — caller needs callee up now | Decoupled — producer doesn't know consumers |
| Failure mode | One slow link fails the whole chain | Failures isolated to one consumer |
| Lost events on downstream outage | Yes (timeouts → dropped) | No (buffered in the log, replayed) |
| Ingestion p95 | ~180 ms (waits for the chain) | ~8 ms (produce + 202) |
| Burst handling | Backpressure cascades upstream | Log absorbs the burst |
| Adding a consumer | Edit + redeploy the producer | New consumer group, zero producer changes |
| Retries | Ad-hoc, in-request, amplify load | Built-in: retry topic + DLQ + replay |
Production metrics (8 weeks before vs after the cutover, ingestion pipeline):
- End-to-end ingestion failures (5xx + dropped events): down ~90% (from ~1.9% of requests during downstream incidents to ~0.18%).
- Events lost during a downstream outage: a recurring batch (tens of thousands) → 0 (the log buffers; consumers replay).
- Ingestion endpoint p95: 180 ms → 8 ms (the producer no longer waits for the chain).
- Cascading-failure incidents (one slow service taking the pipeline down): ~3/month → 0.
- Peak absorbed without scaling the web tier: month-end burst now buffers in Kafka instead of melting the API.
- Mean time to add a new event consumer (e.g., fraud scoring): ~1 sprint (edit producer, coordinate deploy) → ~1 day (new consumer group, producer untouched).
- Retry-storm load during incidents: client retries hammering a struggling service → eliminated (the client gets a 202 and walks away).
- On-call pages tagged "ingestion pipeline cascading failure": ~3/month → 0.
The one idea: a REST call couples you to the callee's availability; a Kafka topic couples you to a log that's almost always up. For events that don't need an answer, that swap is most of the resilience win.
The one mental shift
The reflex in a .NET shop is: service A needs something to happen in service B, so A calls B's REST endpoint and waits. That's correct for a query ("give me this tenant's plan") — A genuinely needs the answer to continue. It's wrong for an event ("this campaign event occurred") — A doesn't need anything back, yet it's now blocked on B, C, and D being healthy.
The shift:
Distinguish commands/queries (need an answer, use REST) from events (fire-and-forget, use a log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time — the producer succeeds the instant the event is durably written, and consumers process whenever they can. Resilience isn't added on top; it falls out of the shape.
Once you stop treating "something happened" as a function call and start treating it as a fact appended to a log, cascading failures, retry storms, and burst overload mostly disappear — because nothing downstream is on the request's critical path anymore.
The running example: Mattrx ingestion
Mattrx is a multi-tenant marketing-analytics SaaS — 110k MAU, Angular 19 front end, .NET 9 / ASP.NET Core back end, Azure SQL, ~3,200 req/sec peak, ~1.2B CampaignEvents. The hottest write path is /api/collect: customer websites POST campaign events (impressions, clicks, conversions), and the pipeline enriches them (geo, device, tenant rules), rolls them into analytics aggregates, and persists raw events.
For two years that pipeline was four services calling each other over HTTP. It worked at low load and fell apart under exactly the conditions that mattered: a slow downstream during the month-end burst.
Before: the synchronous REST chain
BEFORE — ingestion is a chain of synchronous REST calls. Availability multiplies.
customer site
│ POST /api/collect
▼
┌───────────┐ HTTP ┌────────────┐ HTTP ┌────────────┐ HTTP ┌───────────┐
│ Collector │ ──────► │ Enrichment │ ──────► │ Analytics │ ──────► │ Persister │
│ (waits) │ ◄────── │ (geo/device)│ ◄────── │ (rollups) │ ◄────── │ (Azure SQL)│
└───────────┘ resp └────────────┘ resp └────────────┘ resp └───────────┘
- Collector can't 200 until ALL of Enrichment, Analytics, Persister return.
- Availability = A_enrich × A_analytics × A_persist (four 99.9%s ≈ 99.6%)
- Analytics slow at month-end → Collector times out → customer's event LOST
- Customer retries → MORE load on the already-struggling chain → retry storm
The collector code was honest about its own fragility:
// BEFORE — the collector waits on the whole chain; any failure loses the event
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
var enriched = await _enrichClient.EnrichAsync(ev, ct); // network hop 1 (can fail)
await _analyticsClient.RollupAsync(enriched, ct); // network hop 2 (can fail)
await _persistClient.SaveAsync(enriched, ct); // network hop 3 (can fail)
return Results.Ok(); // only if ALL three succeed
// if Analytics is slow, this request times out and the event is gone.
}
Three failure modes, all real, all hit during month-end:
- Temporal coupling — the collector's success requires three other services up simultaneously.
- Lost events — a timeout anywhere drops the event; there's no buffer.
- Retry storms — clients retry failed POSTs, piling load onto the service that's already struggling.
After: Kafka decouples the pipeline
AFTER — the collector produces ONE event to a log and returns. Consumers drain async.
customer site
│ POST /api/collect
▼
┌───────────┐ produce ┌──────────────────────────────────────────────┐
│ Collector │ ─────────► │ Kafka: topic "events.raw" │
│ returns │ (8ms) │ [e1][e2][e3][e4][e5][e6] ... durable log │
│ 202 │ └──────────────────────────────────────────────┘
└───────────┘ │ consumer │ consumer │ consumer
│ group A │ group B │ group C
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌───────────┐
│ Enrichment│ │ Analytics │ │ Persister │
└────┬─────┘ └──────────┘ └───────────┘
│ on failure
▼
┌──────────┐ replay ┌──────────────┐
│ retry top.│ ────────► │ events.dlq │ (inspect + replay)
└──────────┘ └──────────────┘
- Collector succeeds the instant the event is in the log. Customer NEVER waits on a consumer.
- Analytics down? Its consumer group lags and catches up later. Zero loss, zero customer impact.
- Each consumer group reads the SAME events independently — add one without touching producers.
The collector becomes trivial and fast: write to the log, return 202.
// AFTER — produce one event to Kafka and return. No downstream on the critical path.
public sealed class Collector(IProducer<string, byte[]> producer)
{
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
var msg = new Message<string, byte[]>
{
Key = ev.TenantId.ToString(), // key by tenant → ordered per tenant
Value = JsonSerializer.SerializeToUtf8Bytes(ev),
Headers = new Headers { { "schema", "v1"u8.ToArray() } },
};
await producer.ProduceAsync("events.raw", msg, ct); // durable write, ~ms
return Results.Accepted(); // 202 — we've got it from here
}
}
Mattrx metric: the collector's p95 dropped 180 ms → 8 ms because it no longer waits on the chain, and an outage in any consumer stopped reaching the customer entirely — the headline 90% failure reduction is mostly this one structural change.
Section 1 — The consumer: groups, manual commit, retry, DLQ
A consumer is a long-running BackgroundService reading from the topic. The important parts: a consumer group (so work parallelizes across partitions and instances), manual offset commit (commit only after successful processing — at-least-once delivery), and a retry → dead-letter path so a poison message can't block the partition forever.
// AFTER — analytics consumer: process, commit on success, retry then DLQ on failure
public sealed class AnalyticsConsumer(
IConsumer<string, byte[]> consumer, IProducer<string, byte[]> producer,
IAnalyticsRollup rollup, ILogger<AnalyticsConsumer> log) : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken ct)
{
consumer.Subscribe("events.raw");
while (!ct.IsCancellationRequested)
{
var result = consumer.Consume(ct);
var ev = JsonSerializer.Deserialize<CampaignEvent>(result.Message.Value)!;
try
{
await rollup.ApplyAsync(ev, ct); // the actual work
consumer.Commit(result); // commit ONLY after success
}
catch (TransientException) when (Attempts(result) < 5)
{
await producer.ProduceAsync("events.retry", result.Message, ct); // backoff retry
consumer.Commit(result);
}
catch (Exception ex)
{
log.LogError(ex, "Poison message → DLQ at offset {Offset}", result.Offset);
await producer.ProduceAsync("events.dlq", result.Message, ct); // park it
consumer.Commit(result); // don't block the partition
}
}
}
}
// consumer group config — many instances of this service share the partitions
new ConsumerConfig
{
GroupId = "analytics", // each GROUP gets every event; instances split partitions
EnableAutoCommit = false, // we commit manually after processing
AutoOffsetReset = AutoOffsetReset.Earliest,
BootstrapServers = cfg.Brokers,
};
# diagnostic: consumer lag is THE health metric — how far behind real time is each group?
kafka-consumer-groups --bootstrap-server $BROKERS --describe --group analytics
# LAG column climbing = that consumer can't keep up (scale it, don't touch producers)
Mattrx metric: when Analytics had an incident, its consumer group's lag climbed to ~40 minutes and then drained to zero once fixed — with 0 lost events and 0 customer-facing errors. Pre-Kafka, that same incident dropped tens of thousands of events and spiked ingestion 5xx.
Section 2 — Producing reliably from a DB write (the outbox)
There's a subtle trap: if the collector both writes to Azure SQL and produces to Kafka, those are two systems — a crash between them loses or duplicates the event. For events that must reflect a committed DB change (e.g., "campaign published"), use the outbox pattern: write the event to an outbox table in the same transaction as the business data, and a relay publishes it to Kafka.
// AFTER — outbox: the event and the business write commit atomically in ONE SQL transaction
await using var tx = await db.Database.BeginTransactionAsync(ct);
db.Campaigns.Add(campaign);
db.Outbox.Add(new OutboxMessage("events.raw", JsonSerializer.Serialize(ev))); // same tx
await db.SaveChangesAsync(ct);
await tx.CommitAsync(ct);
// a background relay reads unsent outbox rows and produces them to Kafka, then marks them sent.
For the high-volume /api/collect path (which isn't tied to a DB transaction) the direct produce in the section above is fine — at-least-once with idempotent consumers (next section). Use the outbox only where the event must mirror a committed DB state.
Mattrx metric: the outbox on the campaign-lifecycle events eliminated the "DB says published, no event fired" class of bug — dual-write inconsistencies → 0. (Full outbox teardown is linked below.)
Section 3 — Idempotent consumers (because delivery is at-least-once)
Kafka gives you at-least-once delivery by default: a consumer can crash after processing but before committing, and reprocess on restart. So consumers must be idempotent — processing the same event twice must equal processing it once. The simplest tool: a dedup key with a unique constraint.
// AFTER — idempotent rollup: a unique processed-event key makes reprocessing a no-op
public async Task ApplyAsync(CampaignEvent ev, CancellationToken ct)
{
// INSERT the dedup key first; if it already exists, we've seen this event — skip.
var firstTime = await _db.ProcessedEvents
.Where(p => p.EventId == ev.Id)
.ExecuteInsertIfAbsentAsync(ev.Id, ct); // unique index on EventId
if (!firstTime) return; // duplicate → no-op, safe
await _db.Aggregates.IncrementAsync(ev.CampaignId, ev.Type, ct); // the real effect
}
"Exactly-once" is mostly a myth at the boundary; at-least-once delivery + idempotent processing = effectively-once, and it's far simpler than chasing true exactly-once semantics.
Mattrx metric: building idempotency in from day one meant a consumer redeploy or rebalance (which reprocesses a few messages) caused 0 double-counted aggregates — the dedup key absorbs every replay.
Section 4 — Failure isolation, replay, and adding consumers for free
Two superpowers fall out of the log that REST never gave us.
Replay. A bug in the analytics rollup that corrupted a day of aggregates? Fix the consumer, reset its group's offset to yesterday, and reprocess from the log. The raw events are still there — the log is the source of truth, not a transient message.
# replay: reset ONE consumer group to a point in time; other groups are unaffected
kafka-consumer-groups --bootstrap-server $BROKERS --group analytics \
--reset-offsets --to-datetime 2026-06-10T00:00:00.000 --topic events.raw --execute
Add a consumer without touching the producer. When we added fraud scoring, we created a new consumer group on the same events.raw topic. The collector — the producer — didn't change at all. In the REST world that would have meant editing and redeploying the collector to add a fourth call.
EXTENSIBILITY — one topic, many independent readers
events.raw ──► [analytics group] (existing)
──► [persister group] (existing)
──► [enrichment group] (existing)
──► [fraud group] (NEW — producer never knew it was added)
Mattrx metric: shipping the fraud-scoring consumer took ~1 day with zero changes to the ingestion producer or any other consumer — versus the multi-team, edit-the-producer coordination the REST chain required.
Aggregate metrics
| Metric | Before (REST) | After (Kafka) | Delta |
|---|---|---|---|
| Ingestion failures (incident windows) | ~1.9% | ~0.18% | −90% |
| Events lost on downstream outage | tens of thousands | 0 | eliminated |
| Ingestion p95 | 180 ms | 8 ms | −96% |
| Cascading-failure incidents / mo | ~3 | 0 | eliminated |
| Retry-storm load during incidents | severe | none (202 + walk away) | eliminated |
| Time to add a new consumer | ~1 sprint | ~1 day | −80%+ |
| Replay a bad day of processing | impossible | one command | new capability |
| Dual-write inconsistencies (outbox paths) | recurring | 0 | eliminated |
The shape: most of the win isn't throughput, it's failure isolation. The producer succeeds against a log that's almost always up, and every downstream problem became "a consumer is lagging" instead of "the pipeline is down and we're losing data."
Pre-adoption checklist
- The interaction is an event (fire-and-forget), not a query that needs an answer — don't Kafka a request/response.
- Partition key chosen for the ordering you need (we key by
TenantId→ per-tenant order; global order is not guaranteed). - Consumers are idempotent (dedup key + unique constraint) — delivery is at-least-once.
- Manual offset commit after successful processing, not auto-commit before.
- A retry topic + dead-letter topic, with a way to inspect and replay the DLQ.
- DB-coupled events use the outbox pattern (atomic with the business write), not a dual write.
- Consumer lag is monitored and alerted — it's the real health signal.
- Schema strategy decided (versioned JSON or a Schema Registry with Avro) so producers/consumers can evolve.
- Managed Kafka (Confluent Cloud / Event Hubs) unless you have the ops capacity to run brokers.
- A plan for poison messages (they will happen) so one bad event can't wedge a partition.
Honest stuff — when NOT to reach for Kafka
-
Don't replace request/response with Kafka. If the caller needs the answer to continue (auth check, "is this tenant valid?"), that's a synchronous query — REST/gRPC is correct. Kafka is for events, not for getting an answer back.
-
Kafka is operational weight. Brokers, partitions, consumer groups, schema registry, lag monitoring — it's a real system to run. For a small app with modest volume, a simpler queue (Azure Service Bus, a DB-backed queue) gives most of the decoupling with far less to operate. We used managed Kafka precisely so a 5-person team didn't run brokers.
-
"Exactly-once" is a trap. Plan for at-least-once and make consumers idempotent. Chasing true exactly-once (transactions across produce+consume) adds complexity most teams don't need.
-
Ordering is per-partition, not global. You get ordering within a partition (we key by tenant). If you assumed total global ordering, Kafka will surprise you — design the key around the ordering you actually require.
-
You traded synchronous errors for asynchronous lag. The failure didn't vanish; it moved. Instead of a 500 the customer sees, you get a consumer falling behind that you must watch. If nobody monitors lag, you've hidden the problem, not solved it.
-
On Azure, weigh Service Bus / Event Hubs first. We chose Kafka (via Confluent Cloud) for the replay + high-throughput log semantics and ecosystem, but Azure Service Bus (queues/topics) or Event Hubs (with the Kafka endpoint) are first-party and may fit a .NET shop with less friction. This was a deliberate trade, not a default.
-
What we'd do differently: introduce the outbox and idempotency from commit one, before the first consumer. We retrofitted idempotency after a rebalance double-counted a few aggregates — cheap to fix, cheaper to never ship.
The closing mental model
A synchronous call couples you to the callee being up; an event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time — and most of resilience, burst absorption, and extensibility falls out of that one change. The failure doesn't disappear; it moves from the customer's request to a consumer's lag, where you can absorb it on your own schedule.
Three habits this leaves you with:
- Sort every interaction into query vs event. Queries stay synchronous; events go to the log. Most "we need a message bus" problems are really "we mislabeled an event as a call."
- Make consumers idempotent and watch the lag. At-least-once + a dedup key = effectively-once; consumer lag is the one dashboard that tells you the pipeline's truth.
- Treat the log as the source of truth. If you can replay it, a bad consumer is a fix-and-reprocess, not a data-loss incident.
Further reading
- Outbox Pattern — A Complete Guide with Order Processing Example — the atomic produce-with-DB-write from Section 2, in full.
- SAGA Pattern in Microservices — A Complete Guide — coordinating multi-step workflows once they're event-driven.
- Race Conditions and Concurrency Bugs in ASP.NET Core — idempotent consumers are the at-least-once version of the dedup patterns here.
- Scaling ASP.NET Core APIs to 100,000 Requests Per Minute — why a fast, decoupled ingestion endpoint matters at peak.
- When NOT to Use Microservices: A Decision Framework — Kafka is a tool for real seams, not an excuse to distribute everything.
Staring at a chain of REST calls that falls over under load, or unsure whether your interaction is an event or a query? Email randhir.jassal@gmail.com with the call chain and where it breaks, and I'll tell you which links belong on a log and which should stay synchronous.
Get the next issue
A short, curated email with the newest posts and questions.