We Replaced REST with Kafka and Cut Failures by 90% — The Event-Driven Architecture, Explained

Mattrx's event-ingestion pipeline used to be a chain of synchronous REST calls: collect → enrich → roll up → persist. When any link was slow or down, the whole chain failed and customers silently lost events. We replaced those inter-service REST calls with Kafka topics, and end-to-end ingestion failures dropped ~90%. Here's the before architecture, the after architecture, the producer/consumer code, and the trade-offs we'd warn you about.

TL;DR

A synchronous REST call between services is a hidden coupling: the caller's success now depends on the callee being up right now. Chain four of them and your availability is the product of four availabilities — one slow downstream and the whole request fails. For a high-volume, fire-and-forget workload like event ingestion, that's the wrong shape. Events want a log, not a phone call.

We moved Mattrx's ingestion pipeline from REST fan-out to Kafka (managed, Confluent Cloud). The collector now produces an event to a topic and returns immediately; independent consumer groups enrich, roll up, and persist asynchronously, with retries, a dead-letter topic, and replay. A downstream outage no longer reaches the customer — events buffer in the log and drain when the consumer recovers.

Dimension	Before (synchronous REST)	After (Kafka)
Coupling	Temporal — caller needs callee up now	Decoupled — producer doesn't know consumers
Failure mode	One slow link fails the whole chain	Failures isolated to one consumer
Lost events on downstream outage	Yes (timeouts → dropped)	No (buffered in the log, replayed)
Ingestion p95	~180 ms (waits for the chain)	~8 ms (produce + 202)
Burst handling	Backpressure cascades upstream	Log absorbs the burst
Adding a consumer	Edit + redeploy the producer	New consumer group, zero producer changes
Retries	Ad-hoc, in-request, amplify load	Built-in: retry topic + DLQ + replay

Production metrics (8 weeks before vs after the cutover, ingestion pipeline):

End-to-end ingestion failures (5xx + dropped events): down ~90% (from ~1.9% of requests during downstream incidents to ~0.18%).
Events lost during a downstream outage: a recurring batch (tens of thousands) → 0 (the log buffers; consumers replay).
Ingestion endpoint p95: 180 ms → 8 ms (the producer no longer waits for the chain).
Cascading-failure incidents (one slow service taking the pipeline down): ~3/month → 0.
Peak absorbed without scaling the web tier: month-end burst now buffers in Kafka instead of melting the API.
Mean time to add a new event consumer (e.g., fraud scoring): ~1 sprint (edit producer, coordinate deploy) → ~1 day (new consumer group, producer untouched).
Retry-storm load during incidents: client retries hammering a struggling service → eliminated (the client gets a 202 and walks away).
On-call pages tagged "ingestion pipeline cascading failure": ~3/month → 0.

The one idea: a REST call couples you to the callee's availability; a Kafka topic couples you to a log that's almost always up. For events that don't need an answer, that swap is most of the resilience win.

The one mental shift

The reflex in a .NET shop is: service A needs something to happen in service B, so A calls B's REST endpoint and waits. That's correct for a query ("give me this tenant's plan") — A genuinely needs the answer to continue. It's wrong for an event ("this campaign event occurred") — A doesn't need anything back, yet it's now blocked on B, C, and D being healthy.

The shift:

Distinguish commands/queries (need an answer, use REST) from events (fire-and-forget, use a log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time — the producer succeeds the instant the event is durably written, and consumers process whenever they can. Resilience isn't added on top; it falls out of the shape.

Once you stop treating "something happened" as a function call and start treating it as a fact appended to a log, cascading failures, retry storms, and burst overload mostly disappear — because nothing downstream is on the request's critical path anymore.

The running example: Mattrx ingestion

Mattrx is a multi-tenant marketing-analytics SaaS — 110k MAU, Angular 19 front end, .NET 9 / ASP.NET Core back end, Azure SQL, ~3,200 req/sec peak, ~1.2B CampaignEvents. The hottest write path is /api/collect: customer websites POST campaign events (impressions, clicks, conversions), and the pipeline enriches them (geo, device, tenant rules), rolls them into analytics aggregates, and persists raw events.

For two years that pipeline was four services calling each other over HTTP. It worked at low load and fell apart under exactly the conditions that mattered: a slow downstream during the month-end burst.

Before: the synchronous REST chain

BEFORE — ingestion is a chain of synchronous REST calls. Availability multiplies.

  customer site
      │ POST /api/collect
      ▼
┌───────────┐  HTTP   ┌────────────┐  HTTP   ┌────────────┐  HTTP   ┌───────────┐
│ Collector │ ──────► │ Enrichment │ ──────► │ Analytics   │ ──────► │ Persister │
│ (waits)   │ ◄────── │ (geo/device)│ ◄────── │ (rollups)   │ ◄────── │ (Azure SQL)│
└───────────┘  resp   └────────────┘  resp   └────────────┘  resp   └───────────┘

  - Collector can't 200 until ALL of Enrichment, Analytics, Persister return.
  - Availability = A_enrich × A_analytics × A_persist  (four 99.9%s ≈ 99.6%)
  - Analytics slow at month-end → Collector times out → customer's event LOST
  - Customer retries → MORE load on the already-struggling chain → retry storm

The collector code was honest about its own fragility:

// BEFORE — the collector waits on the whole chain; any failure loses the event
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
    var enriched = await _enrichClient.EnrichAsync(ev, ct);        // network hop 1 (can fail)
    await _analyticsClient.RollupAsync(enriched, ct);             // network hop 2 (can fail)
    await _persistClient.SaveAsync(enriched, ct);                // network hop 3 (can fail)
    return Results.Ok();                                          // only if ALL three succeed
    // if Analytics is slow, this request times out and the event is gone.
}

Three failure modes, all real, all hit during month-end:

Temporal coupling — the collector's success requires three other services up simultaneously.
Lost events — a timeout anywhere drops the event; there's no buffer.
Retry storms — clients retry failed POSTs, piling load onto the service that's already struggling.

After: Kafka decouples the pipeline

AFTER — the collector produces ONE event to a log and returns. Consumers drain async.

  customer site
      │ POST /api/collect
      ▼
┌───────────┐  produce   ┌──────────────────────────────────────────────┐
│ Collector │ ─────────► │              Kafka: topic "events.raw"         │
│ returns   │   (8ms)    │   [e1][e2][e3][e4][e5][e6] ... durable log     │
│   202     │            └──────────────────────────────────────────────┘
└───────────┘                 │ consumer        │ consumer        │ consumer
                              │ group A         │ group B         │ group C
                              ▼                 ▼                 ▼
                        ┌──────────┐      ┌──────────┐      ┌───────────┐
                        │ Enrichment│      │ Analytics │      │ Persister  │
                        └────┬─────┘      └──────────┘      └───────────┘
                             │ on failure
                             ▼
                        ┌──────────┐   replay   ┌──────────────┐
                        │ retry top.│ ────────► │ events.dlq    │ (inspect + replay)
                        └──────────┘            └──────────────┘

  - Collector succeeds the instant the event is in the log. Customer NEVER waits on a consumer.
  - Analytics down? Its consumer group lags and catches up later. Zero loss, zero customer impact.
  - Each consumer group reads the SAME events independently — add one without touching producers.

The collector becomes trivial and fast: write to the log, return 202.

// AFTER — produce one event to Kafka and return. No downstream on the critical path.
public sealed class Collector(IProducer<string, byte[]> producer)
{
    public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
    {
        var msg = new Message<string, byte[]>
        {
            Key = ev.TenantId.ToString(),                 // key by tenant → ordered per tenant
            Value = JsonSerializer.SerializeToUtf8Bytes(ev),
            Headers = new Headers { { "schema", "v1"u8.ToArray() } },
        };
        await producer.ProduceAsync("events.raw", msg, ct);  // durable write, ~ms
        return Results.Accepted();                            // 202 — we've got it from here
    }
}

Mattrx metric: the collector's p95 dropped 180 ms → 8 ms because it no longer waits on the chain, and an outage in any consumer stopped reaching the customer entirely — the headline 90% failure reduction is mostly this one structural change.

Section 1 — The consumer: groups, manual commit, retry, DLQ

A consumer is a long-running BackgroundService reading from the topic. The important parts: a consumer group (so work parallelizes across partitions and instances), manual offset commit (commit only after successful processing — at-least-once delivery), and a retry → dead-letter path so a poison message can't block the partition forever.

// AFTER — analytics consumer: process, commit on success, retry then DLQ on failure
public sealed class AnalyticsConsumer(
    IConsumer<string, byte[]> consumer, IProducer<string, byte[]> producer,
    IAnalyticsRollup rollup, ILogger<AnalyticsConsumer> log) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        consumer.Subscribe("events.raw");
        while (!ct.IsCancellationRequested)
        {
            var result = consumer.Consume(ct);
            var ev = JsonSerializer.Deserialize<CampaignEvent>(result.Message.Value)!;
            try
            {
                await rollup.ApplyAsync(ev, ct);            // the actual work
                consumer.Commit(result);                   // commit ONLY after success
            }
            catch (TransientException) when (Attempts(result) < 5)
            {
                await producer.ProduceAsync("events.retry", result.Message, ct);  // backoff retry
                consumer.Commit(result);
            }
            catch (Exception ex)
            {
                log.LogError(ex, "Poison message → DLQ at offset {Offset}", result.Offset);
                await producer.ProduceAsync("events.dlq", result.Message, ct);    // park it
                consumer.Commit(result);                   // don't block the partition
            }
        }
    }
}

// consumer group config — many instances of this service share the partitions
new ConsumerConfig
{
    GroupId = "analytics",                 // each GROUP gets every event; instances split partitions
    EnableAutoCommit = false,              // we commit manually after processing
    AutoOffsetReset = AutoOffsetReset.Earliest,
    BootstrapServers = cfg.Brokers,
};

# diagnostic: consumer lag is THE health metric — how far behind real time is each group?
kafka-consumer-groups --bootstrap-server $BROKERS --describe --group analytics
# LAG column climbing = that consumer can't keep up (scale it, don't touch producers)

Mattrx metric: when Analytics had an incident, its consumer group's lag climbed to ~40 minutes and then drained to zero once fixed — with 0 lost events and 0 customer-facing errors. Pre-Kafka, that same incident dropped tens of thousands of events and spiked ingestion 5xx.

Section 2 — Producing reliably from a DB write (the outbox)

There's a subtle trap: if the collector both writes to Azure SQL and produces to Kafka, those are two systems — a crash between them loses or duplicates the event. For events that must reflect a committed DB change (e.g., "campaign published"), use the outbox pattern: write the event to an outbox table in the same transaction as the business data, and a relay publishes it to Kafka.

// AFTER — outbox: the event and the business write commit atomically in ONE SQL transaction
await using var tx = await db.Database.BeginTransactionAsync(ct);
db.Campaigns.Add(campaign);
db.Outbox.Add(new OutboxMessage("events.raw", JsonSerializer.Serialize(ev)));  // same tx
await db.SaveChangesAsync(ct);
await tx.CommitAsync(ct);
// a background relay reads unsent outbox rows and produces them to Kafka, then marks them sent.

For the high-volume /api/collect path (which isn't tied to a DB transaction) the direct produce in the section above is fine — at-least-once with idempotent consumers (next section). Use the outbox only where the event must mirror a committed DB state.

Mattrx metric: the outbox on the campaign-lifecycle events eliminated the "DB says published, no event fired" class of bug — dual-write inconsistencies → 0. (Full outbox teardown is linked below.)

Section 3 — Idempotent consumers (because delivery is at-least-once)

Kafka gives you at-least-once delivery by default: a consumer can crash after processing but before committing, and reprocess on restart. So consumers must be idempotent — processing the same event twice must equal processing it once. The simplest tool: a dedup key with a unique constraint.

// AFTER — idempotent rollup: a unique processed-event key makes reprocessing a no-op
public async Task ApplyAsync(CampaignEvent ev, CancellationToken ct)
{
    // INSERT the dedup key first; if it already exists, we've seen this event — skip.
    var firstTime = await _db.ProcessedEvents
        .Where(p => p.EventId == ev.Id)
        .ExecuteInsertIfAbsentAsync(ev.Id, ct);          // unique index on EventId
    if (!firstTime) return;                              // duplicate → no-op, safe

    await _db.Aggregates.IncrementAsync(ev.CampaignId, ev.Type, ct);  // the real effect
}

"Exactly-once" is mostly a myth at the boundary; at-least-once delivery + idempotent processing = effectively-once, and it's far simpler than chasing true exactly-once semantics.

Mattrx metric: building idempotency in from day one meant a consumer redeploy or rebalance (which reprocesses a few messages) caused 0 double-counted aggregates — the dedup key absorbs every replay.

Section 4 — Failure isolation, replay, and adding consumers for free

Two superpowers fall out of the log that REST never gave us.

Replay. A bug in the analytics rollup that corrupted a day of aggregates? Fix the consumer, reset its group's offset to yesterday, and reprocess from the log. The raw events are still there — the log is the source of truth, not a transient message.

# replay: reset ONE consumer group to a point in time; other groups are unaffected
kafka-consumer-groups --bootstrap-server $BROKERS --group analytics \
  --reset-offsets --to-datetime 2026-06-10T00:00:00.000 --topic events.raw --execute

Add a consumer without touching the producer. When we added fraud scoring, we created a new consumer group on the same events.raw topic. The collector — the producer — didn't change at all. In the REST world that would have meant editing and redeploying the collector to add a fourth call.

EXTENSIBILITY — one topic, many independent readers

  events.raw ──► [analytics group]   (existing)
             ──► [persister group]   (existing)
             ──► [enrichment group]  (existing)
             ──► [fraud group]       (NEW — producer never knew it was added)

Mattrx metric: shipping the fraud-scoring consumer took ~1 day with zero changes to the ingestion producer or any other consumer — versus the multi-team, edit-the-producer coordination the REST chain required.

Aggregate metrics

Metric	Before (REST)	After (Kafka)	Delta
Ingestion failures (incident windows)	~1.9%	~0.18%	−90%
Events lost on downstream outage	tens of thousands	0	eliminated
Ingestion p95	180 ms	8 ms	−96%
Cascading-failure incidents / mo	~3	0	eliminated
Retry-storm load during incidents	severe	none (202 + walk away)	eliminated
Time to add a new consumer	~1 sprint	~1 day	−80%+
Replay a bad day of processing	impossible	one command	new capability
Dual-write inconsistencies (outbox paths)	recurring	0	eliminated

The shape: most of the win isn't throughput, it's failure isolation. The producer succeeds against a log that's almost always up, and every downstream problem became "a consumer is lagging" instead of "the pipeline is down and we're losing data."

Pre-adoption checklist

Honest stuff — when NOT to reach for Kafka

Don't replace request/response with Kafka. If the caller needs the answer to continue (auth check, "is this tenant valid?"), that's a synchronous query — REST/gRPC is correct. Kafka is for events, not for getting an answer back.
Kafka is operational weight. Brokers, partitions, consumer groups, schema registry, lag monitoring — it's a real system to run. For a small app with modest volume, a simpler queue (Azure Service Bus, a DB-backed queue) gives most of the decoupling with far less to operate. We used managed Kafka precisely so a 5-person team didn't run brokers.
"Exactly-once" is a trap. Plan for at-least-once and make consumers idempotent. Chasing true exactly-once (transactions across produce+consume) adds complexity most teams don't need.
Ordering is per-partition, not global. You get ordering within a partition (we key by tenant). If you assumed total global ordering, Kafka will surprise you — design the key around the ordering you actually require.
You traded synchronous errors for asynchronous lag. The failure didn't vanish; it moved. Instead of a 500 the customer sees, you get a consumer falling behind that you must watch. If nobody monitors lag, you've hidden the problem, not solved it.
On Azure, weigh Service Bus / Event Hubs first. We chose Kafka (via Confluent Cloud) for the replay + high-throughput log semantics and ecosystem, but Azure Service Bus (queues/topics) or Event Hubs (with the Kafka endpoint) are first-party and may fit a .NET shop with less friction. This was a deliberate trade, not a default.
What we'd do differently: introduce the outbox and idempotency from commit one, before the first consumer. We retrofitted idempotency after a rebalance double-counted a few aggregates — cheap to fix, cheaper to never ship.

The closing mental model

A synchronous call couples you to the callee being up; an event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time — and most of resilience, burst absorption, and extensibility falls out of that one change. The failure doesn't disappear; it moves from the customer's request to a consumer's lag, where you can absorb it on your own schedule.

Three habits this leaves you with:

Sort every interaction into query vs event. Queries stay synchronous; events go to the log. Most "we need a message bus" problems are really "we mislabeled an event as a call."
Make consumers idempotent and watch the lag. At-least-once + a dedup key = effectively-once; consumer lag is the one dashboard that tells you the pipeline's truth.
Treat the log as the source of truth. If you can replay it, a bad consumer is a fix-and-reprocess, not a data-loss incident.

TL;DR

Dimension	Before (synchronous REST)	After (Kafka)
Coupling	Temporal — caller needs callee up now	Decoupled — producer doesn't know consumers
Failure mode	One slow link fails the whole chain	Failures isolated to one consumer
Lost events on downstream outage	Yes (timeouts → dropped)	No (buffered in the log, replayed)
Ingestion p95	~180 ms (waits for the chain)	~8 ms (produce + 202)
Burst handling	Backpressure cascades upstream	Log absorbs the burst
Adding a consumer	Edit + redeploy the producer	New consumer group, zero producer changes
Retries	Ad-hoc, in-request, amplify load	Built-in: retry topic + DLQ + replay

Production metrics (8 weeks before vs after the cutover, ingestion pipeline):

End-to-end ingestion failures (5xx + dropped events): down ~90% (from ~1.9% of requests during downstream incidents to ~0.18%).
Events lost during a downstream outage: a recurring batch (tens of thousands) → 0 (the log buffers; consumers replay).
Ingestion endpoint p95: 180 ms → 8 ms (the producer no longer waits for the chain).
Cascading-failure incidents (one slow service taking the pipeline down): ~3/month → 0.
Peak absorbed without scaling the web tier: month-end burst now buffers in Kafka instead of melting the API.
Mean time to add a new event consumer (e.g., fraud scoring): ~1 sprint (edit producer, coordinate deploy) → ~1 day (new consumer group, producer untouched).
Retry-storm load during incidents: client retries hammering a struggling service → eliminated (the client gets a 202 and walks away).
On-call pages tagged "ingestion pipeline cascading failure": ~3/month → 0.

The one mental shift

The shift:

Distinguish commands/queries (need an answer, use REST) from events (fire-and-forget, use a log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time — the producer succeeds the instant the event is durably written, and consumers process whenever they can. Resilience isn't added on top; it falls out of the shape.

The running example: Mattrx ingestion

Before: the synchronous REST chain

BEFORE — ingestion is a chain of synchronous REST calls. Availability multiplies.

  customer site
      │ POST /api/collect
      ▼
┌───────────┐  HTTP   ┌────────────┐  HTTP   ┌────────────┐  HTTP   ┌───────────┐
│ Collector │ ──────► │ Enrichment │ ──────► │ Analytics   │ ──────► │ Persister │
│ (waits)   │ ◄────── │ (geo/device)│ ◄────── │ (rollups)   │ ◄────── │ (Azure SQL)│
└───────────┘  resp   └────────────┘  resp   └────────────┘  resp   └───────────┘

  - Collector can't 200 until ALL of Enrichment, Analytics, Persister return.
  - Availability = A_enrich × A_analytics × A_persist  (four 99.9%s ≈ 99.6%)
  - Analytics slow at month-end → Collector times out → customer's event LOST
  - Customer retries → MORE load on the already-struggling chain → retry storm

The collector code was honest about its own fragility:

// BEFORE — the collector waits on the whole chain; any failure loses the event
public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
{
    var enriched = await _enrichClient.EnrichAsync(ev, ct);        // network hop 1 (can fail)
    await _analyticsClient.RollupAsync(enriched, ct);             // network hop 2 (can fail)
    await _persistClient.SaveAsync(enriched, ct);                // network hop 3 (can fail)
    return Results.Ok();                                          // only if ALL three succeed
    // if Analytics is slow, this request times out and the event is gone.
}

Three failure modes, all real, all hit during month-end:

Temporal coupling — the collector's success requires three other services up simultaneously.
Lost events — a timeout anywhere drops the event; there's no buffer.
Retry storms — clients retry failed POSTs, piling load onto the service that's already struggling.

After: Kafka decouples the pipeline

AFTER — the collector produces ONE event to a log and returns. Consumers drain async.

  customer site
      │ POST /api/collect
      ▼
┌───────────┐  produce   ┌──────────────────────────────────────────────┐
│ Collector │ ─────────► │              Kafka: topic "events.raw"         │
│ returns   │   (8ms)    │   [e1][e2][e3][e4][e5][e6] ... durable log     │
│   202     │            └──────────────────────────────────────────────┘
└───────────┘                 │ consumer        │ consumer        │ consumer
                              │ group A         │ group B         │ group C
                              ▼                 ▼                 ▼
                        ┌──────────┐      ┌──────────┐      ┌───────────┐
                        │ Enrichment│      │ Analytics │      │ Persister  │
                        └────┬─────┘      └──────────┘      └───────────┘
                             │ on failure
                             ▼
                        ┌──────────┐   replay   ┌──────────────┐
                        │ retry top.│ ────────► │ events.dlq    │ (inspect + replay)
                        └──────────┘            └──────────────┘

  - Collector succeeds the instant the event is in the log. Customer NEVER waits on a consumer.
  - Analytics down? Its consumer group lags and catches up later. Zero loss, zero customer impact.
  - Each consumer group reads the SAME events independently — add one without touching producers.

The collector becomes trivial and fast: write to the log, return 202.

// AFTER — produce one event to Kafka and return. No downstream on the critical path.
public sealed class Collector(IProducer<string, byte[]> producer)
{
    public async Task<IResult> Collect(CampaignEvent ev, CancellationToken ct)
    {
        var msg = new Message<string, byte[]>
        {
            Key = ev.TenantId.ToString(),                 // key by tenant → ordered per tenant
            Value = JsonSerializer.SerializeToUtf8Bytes(ev),
            Headers = new Headers { { "schema", "v1"u8.ToArray() } },
        };
        await producer.ProduceAsync("events.raw", msg, ct);  // durable write, ~ms
        return Results.Accepted();                            // 202 — we've got it from here
    }
}

Section 1 — The consumer: groups, manual commit, retry, DLQ

// AFTER — analytics consumer: process, commit on success, retry then DLQ on failure
public sealed class AnalyticsConsumer(
    IConsumer<string, byte[]> consumer, IProducer<string, byte[]> producer,
    IAnalyticsRollup rollup, ILogger<AnalyticsConsumer> log) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        consumer.Subscribe("events.raw");
        while (!ct.IsCancellationRequested)
        {
            var result = consumer.Consume(ct);
            var ev = JsonSerializer.Deserialize<CampaignEvent>(result.Message.Value)!;
            try
            {
                await rollup.ApplyAsync(ev, ct);            // the actual work
                consumer.Commit(result);                   // commit ONLY after success
            }
            catch (TransientException) when (Attempts(result) < 5)
            {
                await producer.ProduceAsync("events.retry", result.Message, ct);  // backoff retry
                consumer.Commit(result);
            }
            catch (Exception ex)
            {
                log.LogError(ex, "Poison message → DLQ at offset {Offset}", result.Offset);
                await producer.ProduceAsync("events.dlq", result.Message, ct);    // park it
                consumer.Commit(result);                   // don't block the partition
            }
        }
    }
}

// consumer group config — many instances of this service share the partitions
new ConsumerConfig
{
    GroupId = "analytics",                 // each GROUP gets every event; instances split partitions
    EnableAutoCommit = false,              // we commit manually after processing
    AutoOffsetReset = AutoOffsetReset.Earliest,
    BootstrapServers = cfg.Brokers,
};

# diagnostic: consumer lag is THE health metric — how far behind real time is each group?
kafka-consumer-groups --bootstrap-server $BROKERS --describe --group analytics
# LAG column climbing = that consumer can't keep up (scale it, don't touch producers)

Section 2 — Producing reliably from a DB write (the outbox)

// AFTER — outbox: the event and the business write commit atomically in ONE SQL transaction
await using var tx = await db.Database.BeginTransactionAsync(ct);
db.Campaigns.Add(campaign);
db.Outbox.Add(new OutboxMessage("events.raw", JsonSerializer.Serialize(ev)));  // same tx
await db.SaveChangesAsync(ct);
await tx.CommitAsync(ct);
// a background relay reads unsent outbox rows and produces them to Kafka, then marks them sent.

Section 3 — Idempotent consumers (because delivery is at-least-once)

// AFTER — idempotent rollup: a unique processed-event key makes reprocessing a no-op
public async Task ApplyAsync(CampaignEvent ev, CancellationToken ct)
{
    // INSERT the dedup key first; if it already exists, we've seen this event — skip.
    var firstTime = await _db.ProcessedEvents
        .Where(p => p.EventId == ev.Id)
        .ExecuteInsertIfAbsentAsync(ev.Id, ct);          // unique index on EventId
    if (!firstTime) return;                              // duplicate → no-op, safe

    await _db.Aggregates.IncrementAsync(ev.CampaignId, ev.Type, ct);  // the real effect
}

"Exactly-once" is mostly a myth at the boundary; at-least-once delivery + idempotent processing = effectively-once, and it's far simpler than chasing true exactly-once semantics.

Section 4 — Failure isolation, replay, and adding consumers for free

Two superpowers fall out of the log that REST never gave us.

# replay: reset ONE consumer group to a point in time; other groups are unaffected
kafka-consumer-groups --bootstrap-server $BROKERS --group analytics \
  --reset-offsets --to-datetime 2026-06-10T00:00:00.000 --topic events.raw --execute

EXTENSIBILITY — one topic, many independent readers

  events.raw ──► [analytics group]   (existing)
             ──► [persister group]   (existing)
             ──► [enrichment group]  (existing)
             ──► [fraud group]       (NEW — producer never knew it was added)

Aggregate metrics

Metric	Before (REST)	After (Kafka)	Delta
Ingestion failures (incident windows)	~1.9%	~0.18%	−90%
Events lost on downstream outage	tens of thousands	0	eliminated
Ingestion p95	180 ms	8 ms	−96%
Cascading-failure incidents / mo	~3	0	eliminated
Retry-storm load during incidents	severe	none (202 + walk away)	eliminated
Time to add a new consumer	~1 sprint	~1 day	−80%+
Replay a bad day of processing	impossible	one command	new capability
Dual-write inconsistencies (outbox paths)	recurring	0	eliminated

Pre-adoption checklist

Honest stuff — when NOT to reach for Kafka

Don't replace request/response with Kafka. If the caller needs the answer to continue (auth check, "is this tenant valid?"), that's a synchronous query — REST/gRPC is correct. Kafka is for events, not for getting an answer back.
Kafka is operational weight. Brokers, partitions, consumer groups, schema registry, lag monitoring — it's a real system to run. For a small app with modest volume, a simpler queue (Azure Service Bus, a DB-backed queue) gives most of the decoupling with far less to operate. We used managed Kafka precisely so a 5-person team didn't run brokers.
"Exactly-once" is a trap. Plan for at-least-once and make consumers idempotent. Chasing true exactly-once (transactions across produce+consume) adds complexity most teams don't need.
Ordering is per-partition, not global. You get ordering within a partition (we key by tenant). If you assumed total global ordering, Kafka will surprise you — design the key around the ordering you actually require.
You traded synchronous errors for asynchronous lag. The failure didn't vanish; it moved. Instead of a 500 the customer sees, you get a consumer falling behind that you must watch. If nobody monitors lag, you've hidden the problem, not solved it.
On Azure, weigh Service Bus / Event Hubs first. We chose Kafka (via Confluent Cloud) for the replay + high-throughput log semantics and ecosystem, but Azure Service Bus (queues/topics) or Event Hubs (with the Kafka endpoint) are first-party and may fit a .NET shop with less friction. This was a deliberate trade, not a default.
What we'd do differently: introduce the outbox and idempotency from commit one, before the first consumer. We retrofitted idempotency after a rebalance double-counted a few aggregates — cheap to fix, cheaper to never ship.

The closing mental model

A synchronous call couples you to the callee being up; an event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time — and most of resilience, burst absorption, and extensibility falls out of that one change. The failure doesn't disappear; it moves from the customer's request to a consumer's lag, where you can absorb it on your own schedule.

Three habits this leaves you with:

Sort every interaction into query vs event. Queries stay synchronous; events go to the log. Most "we need a message bus" problems are really "we mislabeled an event as a call."
Make consumers idempotent and watch the lag. At-least-once + a dedup key = effectively-once; consumer lag is the one dashboard that tells you the pipeline's truth.
Treat the log as the source of truth. If you can replay it, a bad consumer is a fix-and-reprocess, not a data-loss incident.

We Replaced REST with Kafka and Cut Failures by 90% — The Event-Driven Architecture, Explained

TL;DR

The one mental shift

The running example: Mattrx ingestion

Before: the synchronous REST chain

After: Kafka decouples the pipeline

Section 1 — The consumer: groups, manual commit, retry, DLQ

Section 2 — Producing reliably from a DB write (the outbox)

Section 3 — Idempotent consumers (because delivery is at-least-once)

Section 4 — Failure isolation, replay, and adding consumers for free

Aggregate metrics

Pre-adoption checklist

Honest stuff — when NOT to reach for Kafka

The closing mental model

Further reading

We Replaced REST with Kafka and Cut Failures by 90% — The Event-Driven Architecture, Explained

TL;DR

The one mental shift

The running example: Mattrx ingestion

Before: the synchronous REST chain

After: Kafka decouples the pipeline

Section 1 — The consumer: groups, manual commit, retry, DLQ

Section 2 — Producing reliably from a DB write (the outbox)

Section 3 — Idempotent consumers (because delivery is at-least-once)

Section 4 — Failure isolation, replay, and adding consumers for free

Aggregate metrics

Pre-adoption checklist

Honest stuff — when NOT to reach for Kafka

The closing mental model

Further reading

MCP Deep Dive, Part 9: When the Tool Takes Minutes — Streaming and Long-Running Tools Over MCP

MCP Deep Dive, Part 4: Build an MCP Client That Connects to Any Tool (and Any Model)

MCP Deep Dive, Part 3: Build a Production-Grade MCP Server From Scratch

MCP Deep Dive, Part 9: When the Tool Takes Minutes — Streaming and Long-Running Tools Over MCP

MCP Deep Dive, Part 4: Build an MCP Client That Connects to Any Tool (and Any Model)

MCP Deep Dive, Part 3: Build a Production-Grade MCP Server From Scratch

Get the next issue

Keep reading

Get the next issue

Keep reading