Context Engineering for Enterprise AI, Part 5: Multi-Tenant Patterns That Don't Leak, Starve, or Overspend
Part 5: multi-tenant context engineering — tenant-scoped retrieval, per-tenant prompts and caches, noisy-neighbor fairness, model routing, and per-tenant cost attribution in C# + Python.
- Author
- Randhir Jassal
- Published
- Reading time
- 27 min read
- Views
- 11 views
This is Part 5 in Context Engineering for Enterprise AI — the multi-tenant deep-dive. Parts 1–4 built the context layer for Mattrx, a multi-tenant marketing-analytics SaaS (110k MAU, ~9,000 tenants, ~3,200 req/sec peak, ASP.NET Core / .NET 9 on Azure SQL plus a separate Python (FastAPI) AI compute service): what goes into one prompt, what survives between prompts, how agents coordinate, and the governance/eval/cost spine. Every one of those parts quietly leaned on one primitive — the tenant boundary. This part makes it the design surface instead of an afterthought. Get it wrong and one customer's data, cost, or latency bleeds into another's; get it right and one pipeline serves thousands of tenants without any of them noticing the others exist.
Recap
Part 1 — Context Management budgeted the window (tokens ~14,000 -> ~3,500, wrong-answer 18% -> 3%). Part 2 — The Memory Layer made memory tenant-isolated with a hard WHERE predicate and a right-to-be-forgotten that clears every store. Part 3 — Multi-Agent Architecture split work across scoped agents (agentic p95 4.2s -> 1.8s). Part 4 — Enterprise AI Design added eval gates, injection/PII defense, cost control, and tracing.
Part 2 proved isolation for one layer — memory. This part generalizes it to the entire context pipeline: retrieval corpora, prompt assembly, caches, rate limits, model routing, cost, and data residency. It answers the question Part 2 left open — what does multi-tenancy look like when every layer, not just memory, has to respect the boundary?
TL;DR
Multi-tenancy is not a WHERE tenant_id = ? you sprinkle on queries. It is a single boundary that has to hold across every surface of the context pipeline, each with its own failure mode:
| Surface | Leak / abuse if ignored | Where Mattrx enforces it |
|---|---|---|
| Identity / scope | Client spoofs tenant_id in the request body | TenantScope resolved from the workspace JWT at the edge |
| Retrieval corpus | Tenant A's docs surface in B's prompt | Pool index + hard tenant_id filter; silo index for promoted tenants |
| Memory | Cross-tenant recall | Hard WHERE predicate (built in Part 2) |
| Prompt / policy | Wrong persona, tools, or redaction per tenant | TenantConfig as data; shared prefix + small tenant delta |
| Cache | A's cached answer served to B | Cache key = residency:tenant:hash, never just the question |
| Rate / quota | One whale starves everyone | PartitionedRateLimiter keyed by tenant, per-plan budget |
| Model routing | Everyone pays whale economics (or all get the cheap model) | Plan- and budget-aware routing |
| Cost / billing | One blended bill, no caps, surprise spend | Per-call usage_ledger + per-tenant budget cap |
| Data residency | EU tenant's data leaves the region | Region-pinned indexes + Azure SQL Row-Level Security |
Mattrx production results after making tenant scope a first-class context primitive (2-week build, 5 backend engineers + 1 SRE):
- Cross-tenant leak incidents across docs + cache + prompt + logs, in load and red-team testing: 0 (every surface enforces the boundary, none relies on a prompt instruction).
- Noisy-neighbor: other tenants' Insights p95 during the whale's nightly batch: 6.4s -> 1.9s (partitioned fairness).
- Prompt-cache hit rate on the shared system preamble: 0% -> 71% (stable shared prefix + tenant delta) — ~50% cheaper prefill on the system block.
- Per-tenant cost attribution: 0% (one blended Azure bill) -> 100% (per-call token + USD ledger).
- Runaway-tenant spend in one hour: ~$140 (uncapped) -> $5 (budget cap trips, clean 429 instead of a surprise bill).
- New-tenant isolation onboarding (config + index/namespace + quotas + residency): ~2 days manual -> < 5 min automated.
- Retrieval p95 on the pool index with a tenant filter at ~9,000 tenants: 31 ms (held); the whale moved to a dedicated index: 22 ms, and stopped degrading everyone else's recall.
- Cost per AI query: $0.008 (unchanged) — but now attributed, capped, and routed per plan.
- C# app API p95 120 ms and agentic p95 1.8s — unchanged from Parts 1–3.
The one mental shift
Stop treating the tenant as a filter you remember to add. Treat tenant scope as part of the context itself — resolved once at the edge, carried as an unforgeable token through retrieval, prompt, cache, model, and ledger. If isolation depends on anyone remembering to add a filter, it will leak the day someone forgets.
A naive multi-tenant AI system enforces isolation in N places and trusts N engineers to never miss one. A context-engineered one resolves the tenant once, hands every layer a TenantScope it cannot widen, and backs it with a database that refuses cross-tenant rows even when the app code is wrong. Isolation stops being a discipline and becomes a property of the pipeline.
The running example
Mattrx runs ~9,000 tenants on one deployment, and the load is brutally uneven: the top 1% of tenants drive ~45% of traffic, led by one "whale" retailer that runs nightly batch analyses across its whole catalog. Free-tier tenants share everything; Enterprise-tier tenants buy EU data residency and a dedicated vector index. Both AI products feel multi-tenancy differently — Mattrx Help (RAG over each tenant's docs) must never retrieve another tenant's knowledge base, and Mattrx Insights (the agentic system from Part 3) must never let the whale's batch starve a small tenant's interactive question.
One pipeline, many isolation levels. As in every prior part, C# (ASP.NET Core + Azure SQL) resolves and owns the tenant boundary; Python (FastAPI + Azure OpenAI + Azure AI Search) never decides whose data it touches — it receives a scope it can only narrow, never widen.
The tenant boundary, end to end
The whole design in one picture: the tenant is resolved at the edge and travels as an unforgeable TenantScope that every downstream surface takes as a required input.
THE TENANT BOUNDARY, END TO END
(resolved once at the edge, carried as an unforgeable scope)
Client -- Authorization: Bearer <JWT with workspace_id claim>
|
v
+----------------------------------------------------------------------------+
| ASP.NET Core / .NET 9 -- OWNS the boundary |
| |
| TenantContextMiddleware --> TenantScope { TenantId, UserId, Plan, |
| (from JWT, never body) Residency } <-- unforgeable |
| | |
| +--> [1] sp_set_session_context(TenantId) --> Azure SQL RLS backstop|
| +--> [2] PartitionedRateLimiter[TenantId] --> fair quota per plan |
| +--> [3] PromptAssembler: shared prefix + tenant delta (cacheable) |
| +--> [4] cache key = residency:tenant:hash |
| +--> [5] TokenMeter --> usage_ledger(tenant) --> budget cap |
| | HTTP (scope travels as signed fields, not client data)|
+-------------------|--------------------------------------------------------+
v
+----------------------------------------------------------------------------+
| Python AI service (FastAPI) -- never decides WHOSE data |
| |
| IVectorIndexRouter(scope) |
| +-- pool: mattrx-kb-pool-{region} filter: tenant_id eq ... <- long tail
| +-- silo: mattrx-kb-{tenantId} (Enterprise / whale / EU) <- promoted
| | |
| memory (Part 2, tenant-scoped) --+ |
| docs (routed above) -------------+--> assemble into Part 1 budget --> model
| model_router(scope, budget) -----+ gpt-4o (quality) / gpt-4o-mini (cost)
+----------------------------------------------------------------------------+
One pipeline. ~9,000 tenants. The boundary holds on every surface.
1. Tenant identity: resolve once, carry everywhere, never trust the body
Before
The first multi-tenant version read tenant_id from wherever was convenient — sometimes a query string, sometimes the JSON body — and passed it straight to the Python service. Any authenticated user could ask for another tenant's data by changing one field.
// BEFORE — InsightsController.cs. Tenant comes from the request body. Spoofable.
[HttpPost("ask")]
public async Task<IActionResult> Ask(AskRequest req, CancellationToken ct)
{
// req.TenantId is whatever the client sent. Change it -> read another tenant.
var answer = await _ai.AskAsync(req.TenantId, req.UserId, req.Question, ct);
return Ok(new { answer });
}
Isolation here is a polite request, not a guarantee. The Python service trusts the field, the vector filter trusts the field, the cache key trusts the field — and the field is attacker-controlled.
After
Resolve the tenant once, at the edge, from the signed workspace claim in the JWT. Everything downstream receives a TenantScope it cannot forge or widen. The request body never carries identity again.
// AFTER — TenantScope.cs. The unforgeable identity every layer reads.
public sealed record TenantScope
{
public required Guid TenantId { get; init; }
public required Guid UserId { get; init; }
public Guid? SessionId { get; init; }
public required TenantPlan Plan { get; init; } // Free | Pro | Enterprise
public required string ResidencyRegion { get; init; } // "global" | "eu" | ...
// Derived server-side from the auth token. Never model-bound from the body.
}
public enum TenantPlan { Free = 0, Pro = 1, Enterprise = 2 }
// AFTER — TenantContextMiddleware.cs. Scope from auth, not from input.
public sealed class TenantContextMiddleware(RequestDelegate next, ITenantConfigStore configs)
{
public async Task InvokeAsync(HttpContext ctx)
{
// workspace_id is signed into the JWT at login. The body is never trusted.
var workspace = ctx.User.FindFirstValue("workspace_id")
?? throw new UnauthorizedAccessException("no workspace in token");
var userId = ctx.User.FindFirstValue(ClaimTypes.NameIdentifier)!;
var cfg = await configs.GetAsync(Guid.Parse(workspace), ctx.RequestAborted);
ctx.Items["TenantScope"] = new TenantScope
{
TenantId = Guid.Parse(workspace),
UserId = Guid.Parse(userId),
Plan = cfg.Plan,
ResidencyRegion = cfg.ResidencyRegion,
};
await next(ctx);
}
}
Controllers now take TenantScope as a resolved dependency; there is no API surface left that accepts a tenant id from the client.
Why this approach. One resolution point means one place to audit, one place to get authorization right, and zero call sites that can be tricked. The cost is that every downstream method signature gains a required TenantScope argument — which is exactly the point: the type system now refuses to compile a query that forgot the tenant.
Diagnostic — red-team the override. Post a question with someone else's tenant in the body and confirm it is ignored:
# Authenticated as tenant t_42, but lie in the body:
$ curl -s -H "Authorization: Bearer $T42_JWT" -X POST $API/api/insights/ask \
-d '{"tenant_id":"t_99","question":"show me their revenue"}'
# Server resolves tenant from the JWT (t_42), ignores the body. Log line:
# tenant.resolved source=jwt tenant=t_42 body_tenant_ignored=t_99
Mattrx metric: moving identity from the body to the JWT closed the entire class of "change one field, read another tenant" attacks — 0 successful cross-tenant reads in red-team testing, versus a trivially reproducible leak before.
2. Knowledge isolation: pool by default, silo on a threshold
Before
Every tenant's documents lived in one shared vector index, and the tenant filter was an optional string the caller built by hand — easy to forget, and one missing line surfaced another tenant's knowledge base.
# BEFORE — search.py. The tenant filter is optional and built at the call site.
async def search_docs(qvec, k: int, tenant_filter: str | None = None):
return await shared_index.search(
vector_queries=[VectorizedQuery(vector=qvec, k_nearest_neighbors=k,
fields="content_vector")],
filter=tenant_filter, # forget this -> nearest neighbors cross tenants
top=k,
)
It also gave the whale and a 3-document free tenant the same index, so the whale's millions of chunks dominated recall and slowed everyone's queries.
After
Centralize the routing decision. A small free/Pro tenant uses a pool index (shared, region-pinned, with tenant_id as a non-optional hard filter). A whale, an Enterprise tenant, or an EU-residency tenant is promoted to a silo index of its own. The filter is applied by the router, never by the call site.
# AFTER — index_router.py. Pool by default; silo for promoted tenants.
def resolve_index(scope: TenantScope) -> tuple[str, bool]:
"""Returns (index_name, is_pool). Pool indexes require a hard tenant filter."""
if scope.plan == "enterprise" or scope.tenant_id in DEDICATED_INDEX_TENANTS:
return f"mattrx-kb-{scope.tenant_id}", False # silo: physical isolation
return f"mattrx-kb-pool-{scope.residency_region}", True # pool: shared, filtered
# AFTER — retrieval.py. Tenant predicate is non-optional on the pool path.
async def search_docs(scope: TenantScope, qvec, k: int):
index, is_pool = resolve_index(scope)
client = search_clients[index]
# On the pool index, tenant_id is a HARD filter — not a hint, not a prompt line.
# On a silo index the whole index belongs to one tenant, so isolation is physical.
return await client.search(
vector_queries=[VectorizedQuery(vector=qvec, k_nearest_neighbors=k,
fields="content_vector")],
filter=(f"tenant_id eq '{scope.tenant_id}'" if is_pool else None),
top=k,
)
long-tail tenants promoted tenant (size / SLA / residency)
----------------- ----------------------------------------
t_103 -+ t_42 (whale)
t_104 -+--> mattrx-kb-pool-global --promote--> mattrx-kb-t_42 (dedicated)
t_277 -+ shared, tenant_id filter - own recall tuning
- cheap, instant onboarding - no noisy neighbor
- filter = hard predicate - easy residency + bulk delete
The three multi-tenancy models, and why Mattrx runs the hybrid:
| Model | What it is | Isolation | Cost / ops | Best for |
|---|---|---|---|---|
| Pool | One index, every doc tagged tenant_id, hard filter on read | Logical (a missing filter = leak) | Cheapest; instant onboarding; one index to operate | The long tail of small tenants |
| Silo | One index per tenant | Physical (nothing to forget) | Expensive at scale (index-count + storage limits); slow onboarding; cross-tenant analytics hard | Whales, Enterprise, regulated / EU-residency |
| Bridge | Pool by default; promote to silo on a threshold | Logical for most, physical for the few that need it | Two code paths; promotion must be automated | Mattrx's choice |
Why we chose Bridge. Pure pool can't give the whale dedicated tuning or an Enterprise tenant true residency, and it lets one tenant's corpus degrade everyone's recall. Pure silo is operationally impossible at 9,000 tenants — Azure AI Search caps indexes per service, and onboarding a tiny tenant by standing up an index is absurd. Bridge keeps the cheap, instant pool for the 99% and pays the silo cost only where size, SLA, or regulation forces it. Disadvantage: two retrieval paths and a promotion/teardown job to maintain — if that automation rots, silo tenants drift out of sync with pool features.
Diagnostic — probe for a cross-tenant hit. Embed a phrase that exists only in tenant A's docs, then search as tenant B:
$ curl -s -H "Authorization: Bearer $TENANT_B_JWT" -X POST $API/api/help/search \
-d '{"q":"ACME internal SKU naming convention"}' | jq 'length'
# 0 — B never sees A's corpus, on pool (filtered) or silo (separate index).
Mattrx metric: the pool index held retrieval p95 at 31 ms with a tenant filter across ~9,000 tenants; promoting the whale to its own index dropped its p95 to 22 ms and removed it from the pool, so the long tail's recall@5 recovered from 0.88 back to 0.94.
3. Prompt and policy variation: per-tenant behavior as data, not branches
Before
Per-tenant behavior accreted as if (tenantId == ...) branches scattered through the code, and a few large tenants got fully bespoke system prompts rebuilt on every call. That killed prompt caching (no two prompts shared a prefix) and let safety rules drift per tenant.
// BEFORE — branches in code + per-call bespoke prompt. Uncacheable, drifts.
var system = tenantId switch
{
var t when t == AcmeId => "You are ACME's analyst. " + AcmeRules + DateTime.UtcNow,
var t when t == GlobexId => "You are Globex's analyst. " + GlobexRules,
_ => "You are a helpful analytics assistant."
};
The DateTime.UtcNow interpolation alone meant the system block was unique every single call — a 0% prompt-cache hit rate, paying full prefill every time.
After
Tenant behavior is data — a TenantConfig row in Azure SQL — and the prompt is assembled as a stable shared preamble first (identical bytes for every tenant, so Azure OpenAI prompt caching reuses it) followed by a small tenant delta. Allowed tools, persona, and redaction all come from config.
// AFTER — TenantConfig.cs. Behavior is data, not branches.
public sealed record TenantConfig
{
public required TenantPlan Plan { get; init; }
public required string ResidencyRegion { get; init; }
public string? PersonaPreamble { get; init; } // optional tenant tone/brand
public required IReadOnlySet<string> AllowedTools { get; init; }
public required IReadOnlyList<string> RedactionPatterns { get; init; }
public required string ModelPolicy { get; init; } // "cost" | "balanced" | "quality"
public required decimal HourlyUsdBudget { get; init; }
}
// AFTER — PromptAssembler.cs. Stable shared prefix (cache hit) + tenant delta.
public string Build(TenantConfig cfg, string task, string context)
{
var sb = new StringBuilder();
// 1. SHARED, tenant-invariant preamble — identical bytes for EVERY tenant, so
// Azure OpenAI prompt caching reuses it. Keep it FIRST and BYTE-STABLE.
sb.Append(SharedSystemPreamble.Value); // versioned constant, no clock, no ids
// 2. Tenant delta — small, and LAST in the system block so the cache breaks
// here, not above. Everything tenant-specific lives below the cached prefix.
if (cfg.PersonaPreamble is { Length: > 0 } persona)
sb.Append("\n\n## Workspace style\n").Append(persona);
sb.Append("\n\n## Tools available\n").Append(string.Join(", ", cfg.AllowedTools));
// 3. Task + already-tenant-scoped context (Sections 1–2).
sb.Append("\n\n## Task\n").Append(task);
sb.Append("\n\n## Context\n").Append(context);
return sb.ToString();
}
Why this ordering matters. Prompt caching only rebates the prefix up to the first differing byte. Put the shared safety/behavior rules first and stable, and 71% of the system block is cached across all tenants; let one engineer interpolate a tenant name or a timestamp into the preamble and the hit rate silently collapses to 0%. Advantage: cheaper prefill and one canonical set of safety rules every tenant inherits. Disadvantage: a tenant can only customize the delta, never the core rules — deliberately, because that is the safety win; a genuinely bespoke requirement becomes a config-schema change with review, not a prompt hack.
Diagnostic — confirm the cache is actually hitting:
$ az monitor log-analytics query -w $LAW_ID --analytics-query \
"AppTraces | where Message has 'aoai.usage' \
| summarize cached=avg(toint(Properties.cached_prompt_tokens)), \
prompt=avg(toint(Properties.prompt_tokens))"
# cached/prompt: before 0/1850 -> after 1310/1850 (~71% of the system block cached)
Mattrx metric: config-as-data + a stable prefix took the system-block cache hit rate from 0% to 71%, cutting prefill cost on that block ~50% while making every tenant inherit the same injection and redaction rules — no per-tenant safety drift.
4. Cache, quota, and routing: stop one tenant from starving (or impersonating) another
Before
Two shared resources had no tenant in them. The semantic answer cache was keyed only by the question hash — so tenant A's cached answer was served to tenant B. And one global rate limiter meant the whale's nightly batch consumed the whole token budget, and small tenants' interactive questions queued behind it.
// BEFORE — semantic cache keyed only by the question. Cross-tenant answer bleed.
var key = $"ans:{Sha256(question)}"; // identical question -> A's answer returned to B
After
The cache key carries the tenant and residency, so an answer can never cross the boundary. Rate limiting is partitioned by tenant with per-plan token buckets, so fairness is structural — one tenant's burst can only exhaust its own bucket. And model routing respects the tenant's plan and remaining budget.
// AFTER — cache key is tenant- and residency-scoped. Never just the question.
var key = $"ans:{scope.ResidencyRegion}:{scope.TenantId}:{Sha256(question)}";
// AFTER — RateLimiting.cs. PartitionedRateLimiter keyed by tenant, per-plan budgets.
public static PartitionedRateLimiter<TenantScope> BuildLimiter() =>
PartitionedRateLimiter.Create<TenantScope, Guid>(scope =>
RateLimitPartition.GetTokenBucketLimiter(scope.TenantId, _ =>
new TokenBucketRateLimiterOptions
{
TokenLimit = PlanBurst(scope.Plan), // Enterprise 600 / Pro 120 / Free 30
TokensPerPeriod = PlanBurst(scope.Plan),
ReplenishmentPeriod = TimeSpan.FromMinutes(1),
QueueLimit = 0, // fail fast & fair, no unbounded queue
AutoReplenishment = true,
}));
static int PlanBurst(TenantPlan p) => p switch
{
TenantPlan.Enterprise => 600,
TenantPlan.Pro => 120,
_ => 30,
};
# AFTER — model_router.py. Routing respects the tenant's plan AND remaining budget.
def choose_model(scope: TenantScope, cfg: TenantConfig, est_cost_usd: float) -> str:
if cfg.model_policy == "cost":
return "gpt-4o-mini"
if scope.remaining_budget_usd < est_cost_usd: # over budget -> downgrade, never 500
return "gpt-4o-mini"
return "gpt-4o" if cfg.model_policy == "quality" else "gpt-4o-mini"
Why partition instead of one global limiter. A global limiter is "first come, first served," which at scale means "whoever batches hardest wins." Partitioning by tenant makes fairness a property of the data structure, not of luck. QueueLimit = 0 is deliberate: over-budget requests get an immediate 429 with Retry-After rather than silently queuing — predictable for the tenant, and it keeps the whale's overflow off everyone else's latency. Disadvantage: bursts become visible 429s, so clients need honest retry/backoff or they will read throttling as an outage; and per-plan quality routing means a Free tenant on gpt-4o-mini gets measurably weaker answers than an Enterprise tenant on gpt-4o for the same question — intended, but support has to know.
Diagnostic — reproduce the noisy-neighbor test. Fire the whale's batch and watch a small tenant's interactive p95:
# Whale (t_42) batch running; measure tenant t_104's interactive p95 meanwhile.
$ az monitor log-analytics query -w $LAW_ID --analytics-query \
"AppRequests | where Properties.tenant == 't_104' and Name == 'insights.ask' \
| summarize p95=percentile(DurationMs, 95)"
# Before partitioned limiter: 6,420 ms -> After: 1,910 ms
Mattrx metric: tenant-scoped cache keys removed cross-tenant answer bleed entirely, and partitioned fairness held small tenants' Insights p95 at 1.9s while the whale's batch ran — down from 6.4s when one global limiter let the whale consume the shared budget.
5. Cost attribution, residency, and the RLS backstop
Before
There was one blended Azure OpenAI bill. Nobody could say which tenant spent what, there was no per-tenant cap, and "EU residency" was a promise enforced by hope — a single mis-routed query could read EU rows from a US region.
// BEFORE — fire the model, bill nobody. No ledger, no cap, no residency proof.
var resp = await _aoai.GetChatCompletionsAsync(model, options, ct);
return resp.Value.Choices[0].Message.Content; // cost vanishes into the monthly invoice
After
Every model call books cost to a per-tenant ledger, which feeds dashboards, budget caps (the cap that trips Section 4's router), and a usage-based billing export. Residency is enforced by region-pinned indexes (Section 2) and Azure SQL Row-Level Security, so even a buggy query that forgets the tenant filter returns nothing instead of leaking.
// AFTER — TokenMeter.cs. Every model call books cost to a tenant ledger.
public async Task RecordAsync(TenantScope scope, string model, Usage u, CancellationToken ct)
{
var usd = Pricing.Cost(model, u.PromptTokens, u.CompletionTokens);
db.UsageLedger.Add(new UsageRow
{
TenantId = scope.TenantId,
Model = model,
PromptTokens = u.PromptTokens,
CompletionTokens = u.CompletionTokens,
Usd = usd,
At = DateTimeOffset.UtcNow,
});
await db.SaveChangesAsync(ct);
await _budgets.SpendAsync(scope.TenantId, usd, ct); // trips the cap from Section 4
}
-- AFTER — Azure SQL Row-Level Security. The DB itself refuses cross-tenant rows.
CREATE FUNCTION dbo.fn_tenant_predicate(@TenantId uniqueidentifier)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN SELECT 1 AS ok
WHERE @TenantId = CAST(SESSION_CONTEXT(N'TenantId') AS uniqueidentifier);
CREATE SECURITY POLICY dbo.TenantIsolation
ADD FILTER PREDICATE dbo.fn_tenant_predicate(TenantId) ON dbo.Memories,
ADD BLOCK PREDICATE dbo.fn_tenant_predicate(TenantId) ON dbo.Memories AFTER INSERT,
ADD FILTER PREDICATE dbo.fn_tenant_predicate(TenantId) ON dbo.UsageLedger
WITH (STATE = ON);
// AFTER — DbConnectionInterceptor: pin the RLS tenant on every pooled connection.
public override async ValueTask<DbConnection> ConnectionOpenedAsync(
DbConnection conn, ConnectionEndEventData data, CancellationToken ct = default)
{
await using var cmd = conn.CreateCommand();
cmd.CommandText = "EXEC sp_set_session_context @key=N'TenantId', @value=@t";
cmd.Parameters.Add(new SqlParameter("@t", _scope.TenantId));
await cmd.ExecuteNonQueryAsync(ct); // RLS now scopes every query on this connection
return conn;
}
Why enforce isolation twice. The app-level filter is the fast path; RLS is the backstop for the day a new query forgets the WHERE. With both, a forgotten filter degrades to an empty result set, not a breach. Disadvantage: SESSION_CONTEXT must be set on every pooled connection (hence the interceptor), RLS adds a small per-query cost, and — subtly — RLS can mask a missing app filter by silently returning no rows, hiding the bug. So Mattrx alerts whenever RLS filters rows that the app should have scoped itself: the backstop catching something is a signal, not a success.
Diagnostic — prove residency and the backstop together. Run a query that "forgets" the tenant filter, in staging:
$ sqlcmd -Q "EXEC sp_set_session_context @key=N'TenantId', @value='t_42'; \
SELECT COUNT(*) FROM dbo.Memories;" -- no WHERE clause at all
# 0 rows for any tenant but t_42 — RLS scoped it even though the query forgot to.
Mattrx metric: the ledger took per-tenant cost attribution from 0% to 100%, enabling caps that held a runaway tenant's hour of spend to $5 instead of ~$140; RLS + region-pinned indexes kept EU-residency tenants' data in-region with 0 cross-region reads in audit.
Aggregate metrics
| Metric | Before (tenant as an afterthought) | After (tenant as context) |
|---|---|---|
| Cross-tenant leak incidents (docs + cache + prompt + logs, load + red-team) | observed | 0 |
| Other tenants' Insights p95 during the whale's batch | 6.4 s | 1.9 s |
| Prompt-cache hit on the shared system preamble | 0% | 71% |
| Per-tenant cost attribution | 0% (one blended bill) | 100% (ledger) |
| Runaway-tenant spend in 1 hour | ~$140 (uncapped) | $5 (cap trips) |
| New-tenant isolation onboarding | ~2 days (manual) | < 5 min (automated) |
| Retrieval p95 — pool index, tenant filter (~9,000 tenants) | 31 ms | 31 ms (held) |
| Whale retrieval p95 after promotion to a silo index | 31 ms | 22 ms |
| Pool recall@5 after removing the whale from the pool | 0.88 | 0.94 |
| Cost / AI query | $0.008 | $0.008 (now attributed + capped) |
| Cross-region reads for EU-residency tenants (audit) | unverified | 0 |
| C# app API p95 / agentic p95 | 120 ms / 1.8 s | unchanged |
Pre-ship checklist
- Tenant identity is resolved once from the auth token at the edge; no endpoint accepts a tenant id from the request body or query string.
-
TenantScopeis a required argument on every downstream method — retrieval, prompt assembly, cache, limiter, model router, ledger. The type system refuses a tenant-less call. - The pool vector index applies
tenant_idas a hard filter centrally (in the router), never as an optional per-call-site string. - A documented threshold (size / SLA / residency) promotes a tenant from pool to a silo index, and promotion and teardown are automated.
- Per-tenant behavior lives in
TenantConfigdata, not inif (tenantId == ...)branches. - The system prompt is a byte-stable shared prefix + a small tenant delta; no clock, tenant name, or id is interpolated above the cached prefix.
- The semantic/answer cache key includes residency and tenant; an identical question from two tenants can never collide.
- Rate limiting is partitioned by tenant with per-plan buckets and
QueueLimit = 0; throttled requests return429+Retry-After, not an unbounded queue. - Model routing respects plan and remaining budget; over-budget downgrades the model, it never 500s.
- Every model call writes a
usage_ledgerrow (tenant, model, tokens, USD); a per-tenant budget cap reads from it. - Azure SQL RLS is enabled with
SESSION_CONTEXT(TenantId)set on every pooled connection, as a backstop to the app filter — and you alert when RLS catches a row the app should have scoped. - EU/region-pinned tenants use region-pinned indexes and storage; an audit confirms 0 cross-region reads.
Honest stuff
- Bridge means two code paths. Pool and silo retrieval diverge, and the silo path needs onboarding/teardown automation or it rots — silo tenants quietly miss features the pool gets. If you can't staff that automation, stay pure-pool until you must.
- You cannot silo everyone. Azure AI Search caps indexes per service; per-tenant indexes for 9,000 tenants is a non-starter. The long tail must stay pooled — silo is a scarce resource you spend on whales, Enterprise, and regulated tenants.
- Prompt caching is fragile. It only pays off while the shared prefix is byte-identical. One engineer interpolating a timestamp or tenant name into the preamble silently drops your hit rate to 0% and nobody notices until the bill moves. Guard the preamble with a test that asserts byte-stability.
- Fail-fast fairness looks like downtime.
QueueLimit = 0turns bursts into 429s. Without honest client retry/backoff and a clearRetry-After, tenants experience throttling as an outage. Document the contract. - Budgets are estimates; bills are actuals. Token-bucket budgets gate before the call; the ledger records after. A tenant can slightly overspend within a window before the cap trips — reconcile on the ledger, and bill on actuals, not on the pre-call estimate.
- RLS can hide bugs. As a backstop it's invaluable, but a missing app filter silently returns "no data" instead of failing loudly. Keep both layers and alert when the backstop fires — a caught leak is a bug to fix, not a win to celebrate.
- Per-tenant routing creates per-tenant quality. A Free tenant on
gpt-4o-minigets worse answers than an Enterprise tenant ongpt-4ofor the identical question. That's the business model, but support and your evals must account for it — run the eval suite per plan, not just once. - Residency fragments everything. Region-pinned data fragments your indexes and your evals and dashboards. A global "top failing queries" view must aggregate without ever joining across residency boundaries, or you've recreated the leak in your analytics.
- Cost dashboards tempt you to over-serve the whale. The bill is dominated by the top 1%, but churn lives in the long tail. Watch the pool's p95 and recall, not just the whale's invoice.
- Don't silo on request — silo on requirement. A tenant asking for "our own index" is not a reason; size, SLA, or regulation is. Premature siloing is how you end up operating thousands of indexes you can't afford to maintain.
The closing mental model
Multi-tenancy is one boundary, enforced in many places, resolved once and never re-derived. The tenant is part of the context — not a filter you remember to add.
Three enforceable habits:
- Resolve at the edge, carry as a token. Derive
TenantScopefrom auth once, make every layer take it as a required input, and let the type system reject any tenant-less call. If isolation depends on memory, it will fail on the day someone forgets. - Default to pool, promote on a threshold. Keep the cheap shared path for the 99%, spend physical isolation only where size, SLA, or regulation demands it — and automate the promotion, or you won't do it.
- Enforce isolation twice. App filter for speed, RLS for the day the filter is missing. If a single forgotten
WHEREcan leak, you have not isolated anything — you've documented an intention.
Continue the series
This is Part 5 of Context Engineering for Enterprise AI. (You are here.) It extends the original four-part arc with the cross-cutting concern that touched every part: the tenant boundary.
The full series:
- Part 1: Context Management
- Part 2: The Memory Layer
- Part 3: Multi-Agent Architecture
- Part 4: Enterprise AI Design
- Part 5: Multi-Tenant Patterns (you are here)
Next in the series: Designing Context Layers for Enterprise AI — the layered architecture (system, retrieved knowledge, memory, tools, user input) that composes everything in Parts 1–5 into one governed window.
Further reading
- Part 2: The Memory Layer — the tenant-scoped memory isolation this part generalizes to the whole pipeline.
- Part 4: Enterprise AI Design — the eval gates, cost controls, and tracing the per-tenant ledger plugs into.
- RAG with Azure OpenAI, Azure AI Search and C# — the retrieval foundation the pool/silo indexes are built on.
- LLM patterns in .NET — the Semantic Kernel / typed-HttpClient patterns behind the C# code here.
- Azure observability and AI for enterprise ASP.NET — the Log Analytics queries used for the noisy-neighbor and cache-hit diagnostics.
Running one context pipeline across thousands of tenants and unsure where to draw the silo/pool line, or how to make isolation and cost hold at the same time? Email me at randhir.jassal@gmail.com with your tenant distribution and residency requirements, and I'll send back the boundaries I'd enforce first.
Get the next issue
A short, curated email with the newest posts and questions.