Our Azure Bill Spiked Overnight — Here's Exactly How We Cut It 60% (7 Real Fixes)

Finance pinged us on a Monday: Mattrx's Azure bill was up ~150% month-over-month and still climbing. Nobody had shipped a "big" feature. This is the investigation — how we found where the money was actually going, the seven causes (most of them invisible until you look), and the before/after that took the bill down 60% and kept it there. Real Azure config, the diagnostic queries, and the dollar figures.

TL;DR

Cloud bills don't usually spike because of one dramatic thing. They spike because of several small, invisible things at once — a debug log left on, an autoscale rule that scales out but never in, a load-test environment nobody shut down, a feature serving images without a CDN. None of them pages anyone. All of them bill you hourly.

Mattrx's bill jumped from a steady ~$4,800/month to ~$12,100/month over three weeks. We traced it to seven causes, fixed them, and landed at ~$4,700/month — a 60% reduction from the spike, and slightly below the original baseline. The biggest lesson: you can't optimize what you can't see, and almost nobody is looking at the cost dashboard until finance makes them.

Cost driver	Spiked	After fix	What it was
Log / telemetry ingestion	$3,100/mo	$340/mo	A debug log level left on → GBs/day to App Insights
Compute (autoscale)	$3,400/mo	$1,500/mo	Scale-OUT rule, no scale-IN — instances ran 24/7
Orphaned resources	$1,900/mo	$0	Load-test env + unattached disks/IPs left running
Egress / bandwidth	$1,200/mo	$280/mo	New report feature served large files, no CDN
SQL + Redis (over-provisioned)	$1,600/mo	$900/mo	Premium tiers + no reservations on steady baseline
Storage transactions	$500/mo	$190/mo	Millions of tiny blob ops, hot tier, no lifecycle
AI / LLM tokens	$400/mo	$190/mo	RAG calls uncached, oversized model for cheap tasks
Total	~$12,100/mo	~$4,700/mo	−60%

Production / billing metrics (the month after the cleanup):

Monthly Azure spend: $12,100 → **$4,700** (−61%), now below the pre-spike $4,800 baseline.
Log ingestion: ~95 GB/day → ~9 GB/day (sampling + log levels + a daily cap).
Always-on instances at 3am (zero traffic): 8 → 2 (scale-in rules fixed).
Untagged resources: ~40% of spend → <3% (tagging policy enforced).
Time from "bill spikes" to "we know why": was never (no alerting) → same-day (anomaly alert + tag-sliced dashboard).
Engineer-time to investigate + fix: ~3 days; payback period: ~2 days of the savings.

The one rule we adopted: cost is a feature with an owner and a dashboard. The spike wasn't an Azure problem; it was a visibility problem.

The spike, as finance saw it

MONTHLY AZURE SPEND (the graph that started the investigation)

 $12k ┤                                   ╭────  ← "why is it still climbing?"
 $10k ┤                              ╭────╯
  $8k ┤                         ╭────╯
  $6k ┤                    ╭────╯
  $4.8k ┤━━━━━━━━━━━━━━━━━━╯   ← steady for a year, then 3 weeks of climb
  $2k ┤
      └──────────────────────────────────────────────►
        Jan   Feb   Mar   Apr   May   Jun
                              ▲ a routine release + a load test + a forgotten log level

No single change caused it. Three unrelated things landed in the same fortnight, each adding cost quietly, and there was no alert watching the total.

The one mental shift

Engineers treat the cloud bill as finance's problem and finance treats it as a black box. Both are wrong. The bill is a direct, line-by-line consequence of architecture and configuration decisions engineers make — and it's the one production signal nobody instruments.

Cost is observability you're not doing. You alert on p95 and error rate; you should alert on spend-per-day and cost-per-tenant the same way. Every resource should be tagged (so you can slice the bill by team/feature/env) and every unusual jump should page someone the way a latency spike does. You can't optimize a number you never look at.

The fix for "our bill spiked" is rarely one clever change. It's turning the lights on — tagging, a cost dashboard, an anomaly alert — and then the seven causes below become obvious instead of invisible.

The running example: Mattrx on Azure

Mattrx is a multi-tenant marketing-analytics SaaS — 110k MAU, Angular 19 front end, .NET 9 / ASP.NET Core back end, Azure SQL, ~3,200 req/sec peak, on Azure App Service. Five backend engineers, one SRE. The bill had been a boring ~$4,800/month for a year, which is exactly why nobody watched it — until it wasn't boring.

The investigation: turn the lights on first

Before fixing anything, we made the spend visible. This is the five-step opener for any cost investigation:

# 1. Cost Analysis by RESOURCE — where is the money actually going this month?
az consumption usage list --top 20 --query "sort_by([].{name:instanceName, cost:pretaxCost}, &cost)"

# 2. Slice by TAG (if you tagged things) — by env, team, feature
#    Cost Management -> Cost analysis -> Group by: Tag -> 'env' / 'feature'

# 3. Find the BIGGEST single line item, then drill in. (For us: Log Analytics.)

# 4. Log ingestion by source — the #1 surprise cost (KQL in Log Analytics):
#    Usage | where TimeGenerated > ago(7d) | summarize GB=sum(Quantity)/1000 by DataType
#         | sort by GB desc

# 5. Turn on anomaly alerts so the NEXT spike pages you on day one, not day 21.
#    Cost Management -> Cost alerts -> Anomaly alert

The tag slice was damning: ~40% of spend was untagged, meaning we couldn't even attribute it. Step one of the fix was a tagging policy (below), because you can't manage what you can't name.

Cause 1 — Runaway telemetry ingestion (the #1 surprise)

The single biggest line item wasn't compute — it was Log Analytics / Application Insights ingestion. A release had flipped a logger to Debug "temporarily" to chase a bug, and left it. At 3,200 req/sec, debug logging is ~95 GB/day into Log Analytics, billed per GB.

Before

// BEFORE — Debug level in production + no sampling = every request logs everything
// appsettings.Production.json
"Logging": { "LogLevel": { "Default": "Debug" } }   // left on after a debugging session
builder.Services.AddApplicationInsightsTelemetry();  // no sampling -> 100% of telemetry ingested

After

Sensible log levels, adaptive sampling, and a daily ingestion cap as a backstop so a future mistake can't run away.

// AFTER — Information level + adaptive sampling caps telemetry volume
"Logging": { "LogLevel": { "Default": "Information", "Microsoft.AspNetCore": "Warning" } }

builder.Services.AddApplicationInsightsTelemetry(o =>
{
    o.EnableAdaptiveSampling = true;          // sample to a target rate, not 100%
});
// + Log Analytics workspace: set a Daily Cap (GB/day) — a hard backstop against runaways.

// diagnostic that found it — top ingesting tables over 7 days
Usage
| where TimeGenerated > ago(7d) and IsBillable == true
| summarize GB = sum(Quantity) / 1000 by DataType
| sort by GB desc   // AppTraces was 80% of it

Mattrx metric: ingestion 95 GB/day → 9 GB/day, cutting this line from $3,100 → $340/month. The daily cap means the next "temporary Debug" mistake costs a capped amount, not an open-ended one.

Cause 2 — Autoscale that scaled out but never in

The web tier had an autoscale rule to add instances when CPU was high — but the scale-in rule was misconfigured, so once it scaled to 8 instances during a busy afternoon, it stayed at 8 forever, including overnight at zero traffic.

Before

BEFORE — scale-out only; the fleet ratchets up and never comes down
Rule: CPU > 70% for 10 min  -> +2 instances     [works]
Rule: CPU < 30% for 10 min  -> -1 instance       [window too short / cooldown wrong -> never fired]
Result: 8 instances running at 3am with ~0 traffic, billed hourly.

After

A correct, symmetric scale-in rule plus a schedule for the predictable overnight lull.

// AFTER — symmetric rules + a default minimum; scales DOWN as readily as up
autoscaleProfile: {
  capacity: { minimum: '2', maximum: '8', default: '2' }
  rules: [
    { metricTrigger: { metricName: 'CpuPercentage', operator: 'GreaterThan', threshold: 70, timeWindow: 'PT10M' }
      scaleAction: { direction: 'Increase', value: '2', cooldown: 'PT5M' } }
    { metricTrigger: { metricName: 'CpuPercentage', operator: 'LessThan', threshold: 40, timeWindow: 'PT10M' }
      scaleAction: { direction: 'Decrease', value: '1', cooldown: 'PT5M' } }  // the missing half
  ]
}

AFTER — instance count tracks traffic both ways
 8 ┤        ╭──╮                  (peak)
 4 ┤    ╭───╯  ╰───╮
 2 ┤━━━━╯          ╰━━━━━━━━━━━━  (scales back to 2 overnight)
   └───────────────────────────►  midnight -> noon -> midnight

Mattrx metric: average running instances dropped from a stuck 8 to a traffic-tracking 2–8, cutting compute $3,400 → $1,500/month. Same peak headroom, no 3am idle fleet.

Cause 3 — Orphaned and untagged resources

A load test two weeks earlier spun up a parallel environment at production scale — and nobody deleted it. Add the usual graveyard: unattached managed disks, idle public IPs, and old snapshots. All billing, none serving traffic.

Before / After

# BEFORE — find the graveyard (these cost money while attached to nothing)
az disk list --query "[?diskState=='Unattached'].{name:name, gb:diskSizeGb}" -o table
az network public-ip list --query "[?ipConfiguration==null].name" -o table
az resource list --query "[?tags.env=='loadtest']" -o table   # the forgotten environment

# AFTER — delete the orphans + enforce a tagging POLICY so it can't recur
az policy assignment create --name require-env-tag \
  --policy "/providers/Microsoft.Authorization/policyDefinitions/<require-tag>" \
  --params '{ "tagName": { "value": "env" } }'   # resources without an 'env' tag are denied

# auto-expire load-test environments so they can't be "forgotten" again
az resource tag --tags env=loadtest expiry=2026-06-15 --ids <resource-id>
# + a scheduled job deletes anything past its 'expiry' tag

Mattrx metric: deleting the orphaned load-test env, 6 unattached disks, and 3 idle IPs removed $1,900/month outright. The tagging policy (deny-without-env-tag) took untagged spend from ~40% to <3%, which is what made every other slice in this post possible.

Cause 4 — Egress with no CDN

A new "download full report as PDF/PNG" feature served multi-MB files straight from the app — and customers are global, so a lot of that was internet egress, billed per GB. Worse, some assets were fetched cross-region.

Before / After

BEFORE:  global users --> App Service (one region) --> multi-MB files over internet egress
AFTER:   global users --> Azure CDN edge (cached) --> origin only on a miss
                          + Brotli compression + same-region storage

// AFTER — serve static/report assets via CDN, compress, cache at the edge
builder.Services.AddResponseCompression(o => o.Providers.Add<BrotliCompressionProvider>());
// report blobs are written to storage and served through the CDN endpoint, not the app:
// https://mattrx-cdn.azureedge.net/reports/...  (cache-control: public, max-age=86400)

Mattrx metric: moving report/asset delivery to the CDN with compression cut egress $1,200 → $280/month — and made downloads faster for the global user base as a bonus.

Cause 5 — Over-provisioned tiers + no reservations

Two compounding issues: the SQL tier and Redis were on premium SKUs sized for a worst-case that rarely happened, and the steady baseline was billed pay-as-you-go when it should have been on a 1-year reservation.

Before / After

# BEFORE — right-size first (Azure Advisor flags under-utilized resources)
az advisor recommendation list --category Cost -o table   # "downsize SQL", "buy reservation"

# AFTER — 1) downsize to the measured need, 2) reserve the steady baseline
#   - Azure SQL: dropped two vCore sizes after confirming p95 DB CPU sat at ~22%
#   - Redis: Premium -> Standard (we don't use the premium-only features)
#   - Reserved Instances / Savings Plan on the always-on P1v3 baseline (1-yr) -> ~35% off compute

The rule we follow: right-size on measured utilization first, then reserve the baseline. Reserving an over-provisioned resource just locks in waste.

Mattrx metric: downsizing SQL + Redis and reserving the steady compute baseline cut $1,600 → $900/month (~$280 of that is the SQL-tier saving the perf audit had already identified; reservations did the rest). Reservations alone gave ~35% off the always-on compute we know we'll run for a year.

Cause 6 — Storage tier and transaction costs

Storage isn't just GB stored — it's transactions (every read/write/list is billed) and access tier (hot vs cool vs archive). Mattrx wrote millions of tiny per-event blobs to the hot tier and never tiered old ones down.

Before / After

// AFTER — a blob lifecycle policy: auto-tier old data, delete what's expired
{
  "rules": [{
    "name": "tier-and-expire-events",
    "type": "Lifecycle",
    "definition": {
      "filters": { "blobTypes": ["blockBlob"], "prefixMatch": ["raw-events/"] },
      "actions": { "baseBlob": {
        "tierToCool":    { "daysAfterModificationGreaterThan": 30 },
        "tierToArchive": { "daysAfterModificationGreaterThan": 90 },
        "delete":        { "daysAfterModificationGreaterThan": 400 }
      }}
    }
  }]
}

Plus batching: instead of one blob per event, we write batched files — fewer, larger writes mean far fewer billed transactions.

Mattrx metric: lifecycle tiering + batching small writes cut storage $500 → $190/month. Most of the win was transaction count, not bytes stored — a cost nobody thinks about until they read the bill line by line.

Cause 7 — AI/LLM token spend

Mattrx Help (the RAG chatbot) sends prompts to a hosted model, billed per token. Spend had crept up because identical questions weren't cached and a large, expensive model was doing work a small one could.

Before / After

// AFTER — cache identical answers + route cheap tasks to a cheaper model
public async Task<HelpAnswer> AskAsync(string q, CancellationToken ct)
{
    // 1) cache: identical (normalized) questions return the cached answer — no token spend
    return await _cache.GetOrCreateAsync($"help:{Normalize(q)}", async t =>
    {
        // 2) route: classification/extraction -> small model; only synthesis -> the big one
        var model = _router.Pick(q);                 // gpt-4o-mini for simple, 4o for hard
        var ctx = await _search.RetrieveAsync(q, 5, t);
        return await _chat.CompleteGroundedAsync(model, q, ctx, t);
    }, cancellationToken: ct);
}

Mattrx metric: answer caching (a big fraction of help questions repeat) plus model routing cut AI $400 → $190/month with no quality drop on the questions users actually ask. (The classical-ML predictions stay on ML.NET — zero token cost.)

Where the money went, before and after

COST BREAKDOWN

SPIKED (~$12,100/mo)              AFTER (~$4,700/mo)
Logs       ████████ $3.1k        Logs       █ $0.34k
Compute    █████████ $3.4k       Compute    ████ $1.5k
Orphaned   █████ $1.9k           Orphaned   (deleted) $0
Egress     ███ $1.2k             Egress     █ $0.28k
SQL+Redis  ████ $1.6k            SQL+Redis  ██ $0.9k
Storage    █ $0.5k               Storage    ▌ $0.19k
AI         ▌ $0.4k               AI         ▌ $0.19k

The two that mattered most — logs and compute — were both configuration mistakes, not architecture problems. That's the usual story: the big, scary bill is mostly small misconfigurations nobody was watching.

Aggregate metrics

Metric	Spiked	After	Delta
Monthly Azure spend	~$12,100	~$4,700	−61%
Log ingestion	95 GB/day	9 GB/day	−91%
Idle instances at 3am	8	2	−75%
Untagged spend	~40%	<3%	−93%
Egress cost	$1,200/mo	$280/mo	−77%
Time-to-detect a spike	never (no alert)	same day	new capability
Engineer-time to fix	—	~3 days	paid back in ~2 days

The headline 60% wasn't a heroic re-architecture. It was turning on visibility, deleting waste, fixing two config mistakes, and reserving the steady baseline.

FinOps checklist — so it doesn't recur

Honest stuff — cost optimization has limits

The cheapest architecture isn't the goal — the right one is. You can always cut more by degrading reliability or developer velocity; don't. We stopped at "no waste," not "minimum possible spend." Some spend (redundancy, headroom, good observability) is worth paying for.
Engineer time isn't free. A 3-day investigation that saves $7k/month is a no-brainer; a 2-week hunt to shave $40/month is waste of a more expensive resource. Optimize the big line items; ignore the rounding errors.
Reservations are a commitment. A 1-year reserved instance saves ~35% but locks you in. Reserve only the baseline you're certain you'll run; keep burst capacity on-demand. And never reserve before right-sizing — you'd lock in the waste.
Don't gut observability to save on logs. We cut log volume (sampling, levels, a cap), not log value. Going dark to save $300/month so you can't diagnose a $50k outage is a terrible trade. Sample, don't blind yourself.
A daily cap can drop data you need. The Log Analytics daily cap is a backstop against runaways, but if you hit it, you lose telemetry until reset. Set it well above normal volume so only a genuine runaway trips it.
Tags are only useful if enforced. A tagging convention gets ignored; a tagging policy (deny untagged) is what actually works. We learned this the expensive way — 40% untagged is 40% you can't attribute or optimize.
What we'd do differently: the dashboard and anomaly alert should have existed before the spike. The whole 21-day climb would have been a same-day alert. Cost observability is cheap to set up and expensive to lack.

The closing mental model

The cloud bill is a production signal — instrument it like one. Spikes are almost never one dramatic thing; they're several invisible small things at once, on resources nobody tagged, watched by no alert. Turn the lights on (tag, dashboard, anomaly alert), and 60% of a runaway bill is usually deleted waste and two fixed config mistakes — not a re-architecture.

Three habits this leaves you with:

Tag everything and enforce it with policy. You can't manage, attribute, or optimize what you can't name.
Alert on spend like you alert on latency. A daily-cost anomaly alert turns a 21-day surprise into a same-day fix.
Right-size, then reserve. Measure real utilization, cut to it, and only then buy reservations for the steady baseline — never the other way around.

TL;DR

Cost driver	Spiked	After fix	What it was
Log / telemetry ingestion	$3,100/mo	$340/mo	A debug log level left on → GBs/day to App Insights
Compute (autoscale)	$3,400/mo	$1,500/mo	Scale-OUT rule, no scale-IN — instances ran 24/7
Orphaned resources	$1,900/mo	$0	Load-test env + unattached disks/IPs left running
Egress / bandwidth	$1,200/mo	$280/mo	New report feature served large files, no CDN
SQL + Redis (over-provisioned)	$1,600/mo	$900/mo	Premium tiers + no reservations on steady baseline
Storage transactions	$500/mo	$190/mo	Millions of tiny blob ops, hot tier, no lifecycle
AI / LLM tokens	$400/mo	$190/mo	RAG calls uncached, oversized model for cheap tasks
Total	~$12,100/mo	~$4,700/mo	−60%

Production / billing metrics (the month after the cleanup):

Monthly Azure spend: $12,100 → **$4,700** (−61%), now below the pre-spike $4,800 baseline.
Log ingestion: ~95 GB/day → ~9 GB/day (sampling + log levels + a daily cap).
Always-on instances at 3am (zero traffic): 8 → 2 (scale-in rules fixed).
Untagged resources: ~40% of spend → <3% (tagging policy enforced).
Time from "bill spikes" to "we know why": was never (no alerting) → same-day (anomaly alert + tag-sliced dashboard).
Engineer-time to investigate + fix: ~3 days; payback period: ~2 days of the savings.

The one rule we adopted: cost is a feature with an owner and a dashboard. The spike wasn't an Azure problem; it was a visibility problem.

The spike, as finance saw it

MONTHLY AZURE SPEND (the graph that started the investigation)

 $12k ┤                                   ╭────  ← "why is it still climbing?"
 $10k ┤                              ╭────╯
  $8k ┤                         ╭────╯
  $6k ┤                    ╭────╯
  $4.8k ┤━━━━━━━━━━━━━━━━━━╯   ← steady for a year, then 3 weeks of climb
  $2k ┤
      └──────────────────────────────────────────────►
        Jan   Feb   Mar   Apr   May   Jun
                              ▲ a routine release + a load test + a forgotten log level

No single change caused it. Three unrelated things landed in the same fortnight, each adding cost quietly, and there was no alert watching the total.

The one mental shift

Cost is observability you're not doing. You alert on p95 and error rate; you should alert on spend-per-day and cost-per-tenant the same way. Every resource should be tagged (so you can slice the bill by team/feature/env) and every unusual jump should page someone the way a latency spike does. You can't optimize a number you never look at.

The running example: Mattrx on Azure

The investigation: turn the lights on first

Before fixing anything, we made the spend visible. This is the five-step opener for any cost investigation:

# 1. Cost Analysis by RESOURCE — where is the money actually going this month?
az consumption usage list --top 20 --query "sort_by([].{name:instanceName, cost:pretaxCost}, &cost)"

# 2. Slice by TAG (if you tagged things) — by env, team, feature
#    Cost Management -> Cost analysis -> Group by: Tag -> 'env' / 'feature'

# 3. Find the BIGGEST single line item, then drill in. (For us: Log Analytics.)

# 4. Log ingestion by source — the #1 surprise cost (KQL in Log Analytics):
#    Usage | where TimeGenerated > ago(7d) | summarize GB=sum(Quantity)/1000 by DataType
#         | sort by GB desc

# 5. Turn on anomaly alerts so the NEXT spike pages you on day one, not day 21.
#    Cost Management -> Cost alerts -> Anomaly alert

The tag slice was damning: ~40% of spend was untagged, meaning we couldn't even attribute it. Step one of the fix was a tagging policy (below), because you can't manage what you can't name.

Cause 1 — Runaway telemetry ingestion (the #1 surprise)

Before

// BEFORE — Debug level in production + no sampling = every request logs everything
// appsettings.Production.json
"Logging": { "LogLevel": { "Default": "Debug" } }   // left on after a debugging session
builder.Services.AddApplicationInsightsTelemetry();  // no sampling -> 100% of telemetry ingested

After

Sensible log levels, adaptive sampling, and a daily ingestion cap as a backstop so a future mistake can't run away.

// AFTER — Information level + adaptive sampling caps telemetry volume
"Logging": { "LogLevel": { "Default": "Information", "Microsoft.AspNetCore": "Warning" } }

builder.Services.AddApplicationInsightsTelemetry(o =>
{
    o.EnableAdaptiveSampling = true;          // sample to a target rate, not 100%
});
// + Log Analytics workspace: set a Daily Cap (GB/day) — a hard backstop against runaways.

// diagnostic that found it — top ingesting tables over 7 days
Usage
| where TimeGenerated > ago(7d) and IsBillable == true
| summarize GB = sum(Quantity) / 1000 by DataType
| sort by GB desc   // AppTraces was 80% of it

Cause 2 — Autoscale that scaled out but never in

Before

BEFORE — scale-out only; the fleet ratchets up and never comes down
Rule: CPU > 70% for 10 min  -> +2 instances     [works]
Rule: CPU < 30% for 10 min  -> -1 instance       [window too short / cooldown wrong -> never fired]
Result: 8 instances running at 3am with ~0 traffic, billed hourly.

After

A correct, symmetric scale-in rule plus a schedule for the predictable overnight lull.

// AFTER — symmetric rules + a default minimum; scales DOWN as readily as up
autoscaleProfile: {
  capacity: { minimum: '2', maximum: '8', default: '2' }
  rules: [
    { metricTrigger: { metricName: 'CpuPercentage', operator: 'GreaterThan', threshold: 70, timeWindow: 'PT10M' }
      scaleAction: { direction: 'Increase', value: '2', cooldown: 'PT5M' } }
    { metricTrigger: { metricName: 'CpuPercentage', operator: 'LessThan', threshold: 40, timeWindow: 'PT10M' }
      scaleAction: { direction: 'Decrease', value: '1', cooldown: 'PT5M' } }  // the missing half
  ]
}

AFTER — instance count tracks traffic both ways
 8 ┤        ╭──╮                  (peak)
 4 ┤    ╭───╯  ╰───╮
 2 ┤━━━━╯          ╰━━━━━━━━━━━━  (scales back to 2 overnight)
   └───────────────────────────►  midnight -> noon -> midnight

Mattrx metric: average running instances dropped from a stuck 8 to a traffic-tracking 2–8, cutting compute $3,400 → $1,500/month. Same peak headroom, no 3am idle fleet.

Cause 3 — Orphaned and untagged resources

Before / After

# BEFORE — find the graveyard (these cost money while attached to nothing)
az disk list --query "[?diskState=='Unattached'].{name:name, gb:diskSizeGb}" -o table
az network public-ip list --query "[?ipConfiguration==null].name" -o table
az resource list --query "[?tags.env=='loadtest']" -o table   # the forgotten environment

# AFTER — delete the orphans + enforce a tagging POLICY so it can't recur
az policy assignment create --name require-env-tag \
  --policy "/providers/Microsoft.Authorization/policyDefinitions/<require-tag>" \
  --params '{ "tagName": { "value": "env" } }'   # resources without an 'env' tag are denied

# auto-expire load-test environments so they can't be "forgotten" again
az resource tag --tags env=loadtest expiry=2026-06-15 --ids <resource-id>
# + a scheduled job deletes anything past its 'expiry' tag

Cause 4 — Egress with no CDN

Before / After

BEFORE:  global users --> App Service (one region) --> multi-MB files over internet egress
AFTER:   global users --> Azure CDN edge (cached) --> origin only on a miss
                          + Brotli compression + same-region storage

// AFTER — serve static/report assets via CDN, compress, cache at the edge
builder.Services.AddResponseCompression(o => o.Providers.Add<BrotliCompressionProvider>());
// report blobs are written to storage and served through the CDN endpoint, not the app:
// https://mattrx-cdn.azureedge.net/reports/...  (cache-control: public, max-age=86400)

Mattrx metric: moving report/asset delivery to the CDN with compression cut egress $1,200 → $280/month — and made downloads faster for the global user base as a bonus.

Cause 5 — Over-provisioned tiers + no reservations

Before / After

# BEFORE — right-size first (Azure Advisor flags under-utilized resources)
az advisor recommendation list --category Cost -o table   # "downsize SQL", "buy reservation"

# AFTER — 1) downsize to the measured need, 2) reserve the steady baseline
#   - Azure SQL: dropped two vCore sizes after confirming p95 DB CPU sat at ~22%
#   - Redis: Premium -> Standard (we don't use the premium-only features)
#   - Reserved Instances / Savings Plan on the always-on P1v3 baseline (1-yr) -> ~35% off compute

The rule we follow: right-size on measured utilization first, then reserve the baseline. Reserving an over-provisioned resource just locks in waste.

Cause 6 — Storage tier and transaction costs

Before / After

// AFTER — a blob lifecycle policy: auto-tier old data, delete what's expired
{
  "rules": [{
    "name": "tier-and-expire-events",
    "type": "Lifecycle",
    "definition": {
      "filters": { "blobTypes": ["blockBlob"], "prefixMatch": ["raw-events/"] },
      "actions": { "baseBlob": {
        "tierToCool":    { "daysAfterModificationGreaterThan": 30 },
        "tierToArchive": { "daysAfterModificationGreaterThan": 90 },
        "delete":        { "daysAfterModificationGreaterThan": 400 }
      }}
    }
  }]
}

Plus batching: instead of one blob per event, we write batched files — fewer, larger writes mean far fewer billed transactions.

Cause 7 — AI/LLM token spend

Before / After

// AFTER — cache identical answers + route cheap tasks to a cheaper model
public async Task<HelpAnswer> AskAsync(string q, CancellationToken ct)
{
    // 1) cache: identical (normalized) questions return the cached answer — no token spend
    return await _cache.GetOrCreateAsync($"help:{Normalize(q)}", async t =>
    {
        // 2) route: classification/extraction -> small model; only synthesis -> the big one
        var model = _router.Pick(q);                 // gpt-4o-mini for simple, 4o for hard
        var ctx = await _search.RetrieveAsync(q, 5, t);
        return await _chat.CompleteGroundedAsync(model, q, ctx, t);
    }, cancellationToken: ct);
}

Where the money went, before and after

COST BREAKDOWN

SPIKED (~$12,100/mo)              AFTER (~$4,700/mo)
Logs       ████████ $3.1k        Logs       █ $0.34k
Compute    █████████ $3.4k       Compute    ████ $1.5k
Orphaned   █████ $1.9k           Orphaned   (deleted) $0
Egress     ███ $1.2k             Egress     █ $0.28k
SQL+Redis  ████ $1.6k            SQL+Redis  ██ $0.9k
Storage    █ $0.5k               Storage    ▌ $0.19k
AI         ▌ $0.4k               AI         ▌ $0.19k

Aggregate metrics

Metric	Spiked	After	Delta
Monthly Azure spend	~$12,100	~$4,700	−61%
Log ingestion	95 GB/day	9 GB/day	−91%
Idle instances at 3am	8	2	−75%
Untagged spend	~40%	<3%	−93%
Egress cost	$1,200/mo	$280/mo	−77%
Time-to-detect a spike	never (no alert)	same day	new capability
Engineer-time to fix	—	~3 days	paid back in ~2 days

The headline 60% wasn't a heroic re-architecture. It was turning on visibility, deleting waste, fixing two config mistakes, and reserving the steady baseline.

FinOps checklist — so it doesn't recur

Honest stuff — cost optimization has limits

The cheapest architecture isn't the goal — the right one is. You can always cut more by degrading reliability or developer velocity; don't. We stopped at "no waste," not "minimum possible spend." Some spend (redundancy, headroom, good observability) is worth paying for.
Engineer time isn't free. A 3-day investigation that saves $7k/month is a no-brainer; a 2-week hunt to shave $40/month is waste of a more expensive resource. Optimize the big line items; ignore the rounding errors.
Reservations are a commitment. A 1-year reserved instance saves ~35% but locks you in. Reserve only the baseline you're certain you'll run; keep burst capacity on-demand. And never reserve before right-sizing — you'd lock in the waste.
Don't gut observability to save on logs. We cut log volume (sampling, levels, a cap), not log value. Going dark to save $300/month so you can't diagnose a $50k outage is a terrible trade. Sample, don't blind yourself.
A daily cap can drop data you need. The Log Analytics daily cap is a backstop against runaways, but if you hit it, you lose telemetry until reset. Set it well above normal volume so only a genuine runaway trips it.
Tags are only useful if enforced. A tagging convention gets ignored; a tagging policy (deny untagged) is what actually works. We learned this the expensive way — 40% untagged is 40% you can't attribute or optimize.
What we'd do differently: the dashboard and anomaly alert should have existed before the spike. The whole 21-day climb would have been a same-day alert. Cost observability is cheap to set up and expensive to lack.

The closing mental model

The cloud bill is a production signal — instrument it like one. Spikes are almost never one dramatic thing; they're several invisible small things at once, on resources nobody tagged, watched by no alert. Turn the lights on (tag, dashboard, anomaly alert), and 60% of a runaway bill is usually deleted waste and two fixed config mistakes — not a re-architecture.

Three habits this leaves you with:

Tag everything and enforce it with policy. You can't manage, attribute, or optimize what you can't name.
Alert on spend like you alert on latency. A daily-cost anomaly alert turns a 21-day surprise into a same-day fix.
Right-size, then reserve. Measure real utilization, cut to it, and only then buy reservations for the steady baseline — never the other way around.

Get the next issue

Keep reading

Get the next issue

Keep reading