AI-Native Architecture: The 9-Layer Blueprint Every Enterprise Will Adopt by 2027
Most enterprises bolt AI onto a backend built for CRUD. We rebuilt Mattrx around nine AI-native layers in production. Here is the blueprint, with code.
- Author
- Randhir Jassal
- Published
- Reading time
- 16 min read
- Views
- 6 views
Every enterprise has now shipped "an AI feature." Almost none have shipped an AI-native architecture. The difference is the one that decides whether your AI roadmap survives 2027 — or quietly gets ripped out after the third incident.
We learned this the expensive way on Mattrx, our multi-tenant marketing-analytics SaaS. The first version of "Mattrx Help" was a single MVC action that called a frontier model inline. It demoed beautifully. In production it leaked context across tenants, hallucinated 18% of the time, cost $0.021 per query, and one tenant's runaway retry loop billed the entire fleet for an afternoon.
The fix was not a better prompt. It was an architecture. This post is that architecture — nine layers, each with the before that broke and the after that holds, the real C# and Python we run, and the production numbers from a system serving 110k MAU at ~3,200 req/sec peak.
TL;DR
| Dimension | Bolt-on AI (before) | AI-native architecture (after) |
|---|---|---|
| Model access | SDK called inline in controllers | Single governed AI gateway |
| Tenant isolation | "Please don't leak" in the prompt | Filters pushed into the data layer |
| Context size | ~14,000 tokens, stuff everything | ~3,500 tokens, assembled + ranked |
| Memory | Stateless, cold every turn | Short-term buffer + semantic long-term |
| Retrieval | Naive top-k cosine | Hybrid recall + cross-encoder rerank |
| Reasoning | One mega-prompt | Orchestrator + specialist agents + eval gate |
| Model choice | gpt-4 everywhere | Routed by task complexity, with fallback |
| Actions | Agent had raw DB access | Typed, authorized tool contracts |
- Hallucination rate 18% → 3% after hybrid retrieval + rerank.
- Faithfulness 0.96, answer-relevance 0.91 on our offline eval set.
- Context tokens per call 14k → 3.5k — same answers, a quarter of the spend.
- Cost per query $0.021 → $0.008, mostly from model routing.
- Agentic p95 latency 4.2s → 1.8s after the planner picked shorter paths.
- Prompt-injection attempts blocked ~40/week at the gateway + identity layers.
- Eval gate threshold 0.90 — answers below it never reach a user.
- Zero cross-tenant data leaks in the six months since the rebuild.
- "Mattrx Help" now deflects ~520 support tickets/month.
- Same underlying infra cost envelope — we spend on tokens we actually need.
The one mental shift: stop treating the model as a feature you call, and start treating it as a tier you operate — with its own gateway, identity, memory, and governance, exactly like you already do for your database.
The running example: Mattrx, in production
Mattrx is a real system, not a toy. Angular 19 on the front, .NET 9 / ASP.NET Core on the back (Clean Architecture + CQRS with MediatR), Azure SQL, Azure App Service. Campaigns table ~4M rows, Events ~180M, CampaignEvents ~1.2B. Ingestion runs through Confluent Kafka; report commands queue on Azure Service Bus; Event Grid glues the reactive bits together.
The AI surface is two products:
- Mattrx Help — retrieval-augmented support assistant (Semantic Kernel + Azure AI Search).
- Mattrx Insights — an agentic analyst that plans, queries, forecasts, and writes up findings.
One architectural decision shapes everything below: C# owns orchestration and governance; a Python FastAPI service owns embeddings, retrieval, agents, and evaluation. C# is where the rules live. Python is where the model-heavy work lives. They talk over a typed internal contract.
Here is the shape we converged on — the same stack rendered in our docs:
Frontend
|
v
API Gateway <- AI gateway: routing, budgets, redaction
|
v
Identity <- tenant + scope propagation
|
v
Context Layer <- assemble, rank, compress to a budget
|
v
Memory <- short-term buffer + semantic long-term
|
v
Knowledge Base <- hybrid retrieval + rerank (Python)
|
v
Agents <- orchestrator + specialists + eval gate
|
v
LLM <- model router (cheap -> frontier)
|
v
Business APIs <- typed, authorized tool contracts
Read top to bottom, every request flows through each layer instead of skipping straight from a controller to a model. That single rule is what turned a demo into a system. Let's walk each layer with the before and the after.
Layer 1 → 2: the API Gateway becomes an AI Gateway
Before
The very first version put the model call inside an MVC action.
[ApiController]
[Route("api/help")]
public class HelpController(IOpenAiClient openai) : ControllerBase
{
[HttpPost("ask")]
public async Task<IActionResult> Ask([FromBody] AskRequest req)
{
// The entire AI call lives inside a controller. Nothing governs it.
var completion = await openai.CompleteAsync(
model: "gpt-4",
system: "You are Mattrx Help.",
user: req.Question);
return Ok(new { answer = completion.Text });
}
}
Diagnostic: no per-tenant rate limit, no token accounting, no PII redaction, no fallback model, no audit trail. When one tenant's client retried in a tight loop, every other tenant paid for it, and we had no record of who spent what.
After
Every model call — Help, Insights, internal classification — passes through one gateway. The controller becomes thin; the gateway is the chokepoint where policy lives.
public sealed record AiGatewayContext
{
public required string TenantId { get; init; }
public required string UserId { get; init; }
public required string Feature { get; init; } // "help", "insights", "classify"
public required int TokenBudget { get; init; }
}
public sealed class AiGateway(
IModelRouter router,
ITokenBudgetStore budgets,
IPiiRedactor redactor,
IAiAuditLog audit) : IAiGateway
{
public async Task<ModelResult> SendAsync(
AiGatewayContext ctx, ModelRequest request, CancellationToken ct)
{
var remaining = await budgets.ReserveAsync(
ctx.TenantId, ctx.Feature, request.EstimatedTokens, ct);
if (remaining < 0)
throw new TokenBudgetExceededException(ctx.TenantId, ctx.Feature);
// Redact before anything leaves our boundary.
request = request with { Prompt = redactor.Redact(request.Prompt) };
var model = router.Select(ctx.Feature, request.Complexity);
var result = await model.CompleteAsync(request, ct);
await audit.RecordAsync(ctx, model.Name, result.Usage, ct);
return result;
}
}
Mattrx metric: once budgets and redaction lived in the gateway, runaway loops became a 429 for one tenant instead of a fleet-wide bill, and we started blocking ~40 prompt-injection attempts per week at this boundary. Cost per query began its drop from $0.021 as routing kicked in (more on that at Layer 8).
Layer 3: Identity stops meaning "who logged in"
Before
Tenant scoping existed in SQL, but the AI path rebuilt prompts by hand and the model saw whatever the developer happened to pass. The isolation guarantee was, effectively, a sentence in the system prompt.
var system = "You are Mattrx Help. Only answer for the current customer. " +
"Do not reveal other customers' data.";
Diagnostic: "tell the model not to leak" is not a security control. The moment retrieval pulled a chunk from the wrong index partition, the model happily summarized another tenant's campaign.
After
Identity is a first-class object that propagates into retrieval and every tool call. The tenant filter is pushed down into the vector store — it is data-layer enforcement, not a polite request.
public sealed record AiPrincipal(
string TenantId,
string UserId,
IReadOnlySet<string> Scopes); // "campaigns:read", "reports:create"
public sealed class TenantScopedRetriever(
IVectorStore store, AiPrincipal principal) : IRetriever
{
public Task<IReadOnlyList<Chunk>> SearchAsync(
string query, int k, CancellationToken ct) =>
store.SearchAsync(query, k, filter: new VectorFilter
{
["tenant_id"] = principal.TenantId // hard isolation in the store
}, ct);
}
Mattrx metric: zero cross-tenant leakage in the six months since. The injection attempts we count at the gateway hit a second wall here — even a successful jailbreak can only ever retrieve within its own tenant partition.
Layer 4: the Context Layer (the one most teams skip)
Before
We concatenated everything and hoped the context window was big enough.
var prompt = $"""
System: You are Mattrx Help.
Conversation so far:
{entireConversationHistory}
Knowledge:
{allRetrievedDocs}
Question: {question}
""";
// ~14,000 tokens per call. Slow, expensive, and the model lost the thread
// somewhere in the middle of a wall of barely-relevant text.
Diagnostic: bigger context is not better context. Past a point, recall drops — the model anchors on whatever is at the edges and misses the middle. We were paying frontier prices to confuse the model.
After
A context assembler treats the prompt like a budget to be packed, not a bucket to be filled. It selects, ranks, and compresses to a hard token ceiling.
public sealed class ContextAssembler(
IRetriever retriever,
IMemoryStore memory,
ISummarizer summarizer)
{
public async Task<AssembledContext> BuildAsync(
AiPrincipal principal, string question, int tokenBudget, CancellationToken ct)
{
var recent = await memory.GetRecentTurnsAsync(principal, maxTurns: 6, ct);
var facts = await memory.GetSalientFactsAsync(principal, question, ct);
var chunks = await retriever.SearchAsync(question, k: 8, ct);
var ranked = chunks
.OrderByDescending(c => c.Score)
.ThenByDescending(c => c.Recency)
.ToList();
var packer = new TokenBudgetPacker(tokenBudget);
packer.Add(Section.Instructions, weight: 1.0);
packer.Add(Section.Facts, facts);
packer.Add(Section.History, recent, compressWith: summarizer);
packer.Add(Section.Knowledge, ranked);
return packer.Pack(); // guaranteed to fit the budget
}
}
Mattrx metric: context tokens 14k → 3.5k, with faithfulness 0.96 and answer-relevance 0.91 on our eval set. Smaller, better-ordered context made answers more accurate, not less. (We go deep on this in the context-engineering series linked below.)
Layer 5: Memory
Before
Every turn started cold. The assistant re-asked for the account, the date range, the metric the user had already given it thirty seconds earlier.
After
Two-tier memory: a short-term conversation buffer (TTL-bounded) and a long-term semantic memory that captures only salient turns — decisions, preferences, corrections — into the vector store.
public sealed class ConversationMemory(
IVectorStore vectors, IKeyValueStore kv) : IMemoryStore
{
public async Task RememberAsync(AiPrincipal p, Turn turn, CancellationToken ct)
{
// Short-term: full turn, expires on its own.
await kv.PushAsync(
key: $"conv:{p.TenantId}:{turn.ConversationId}",
value: turn,
ttl: TimeSpan.FromHours(4), ct);
// Long-term: only what is worth remembering.
if (turn.IsSalient)
{
var embedding = await vectors.EmbedAsync(turn.Summary, ct);
await vectors.UpsertAsync(new MemoryRecord(
Id: turn.Id,
TenantId: p.TenantId,
UserId: p.UserId,
Text: turn.Summary,
Vector: embedding), ct);
}
}
}
Mattrx metric: memory is a big part of how Help deflects ~520 tickets/month — it stops re-interrogating returning users, so conversations resolve in fewer turns and fewer escalate to a human.
Layer 6: the Knowledge Base — RAG done properly
Before
Top-k cosine over a single flat index. It returned chunks that looked relevant and were often subtly wrong, which is the worst failure mode because it sounds confident.
After
This is the part we moved to the Python FastAPI service. Hybrid recall (lexical BM25 + vector), then a cross-encoder rerank — the step most "RAG" implementations skip and the one that actually moved our hallucination number.
# ai-service/retrieval.py — Python owns embeddings, retrieval, rerank
async def retrieve(query: str, tenant_id: str, k: int = 8) -> list[Chunk]:
# 1. Hybrid recall, both tenant-scoped at the source.
lexical = await search_index.bm25(query, tenant_id=tenant_id, k=40)
vector = await vector_store.search(await embed(query),
tenant_id=tenant_id, k=40)
candidates = dedupe(lexical + vector)
# 2. Cross-encoder rerank — judges query+chunk together, not in isolation.
scores = await reranker.score(query, [c.text for c in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
# 3. Drop weak matches entirely. An empty result beats a wrong one.
return [c for c, s in ranked[:k] if s > 0.2]
Diagnostic: cosine similarity answers "are these vectors close?" The cross-encoder answers "does this chunk actually answer this question?" Those are different questions, and the gap between them was most of our hallucination rate.
Mattrx metric: hallucination 18% → 3%. The single biggest accuracy win in the whole rebuild came from the rerank, plus the rule that returning nothing is allowed.
Layer 7: Agents
Before
Mattrx Insights was one enormous prompt that tried to plan, query the warehouse, forecast, and write the summary in a single shot. When any step drifted, the whole answer drifted, and we couldn't tell which step failed.
After
An orchestrator plans, then dispatches to specialist agents with narrow capabilities. Nothing reaches the user until an eval gate scores it.
public sealed class InsightsOrchestrator(
IAgent planner,
IReadOnlyDictionary<string, IAgent> specialists,
IEvalGate gate) : IOrchestrator
{
public async Task<AgentResult> RunAsync(
AiPrincipal p, string goal, CancellationToken ct)
{
var plan = await planner.PlanAsync(goal, ct); // ordered, typed steps
var scratch = new AgentScratchpad();
foreach (var step in plan.Steps)
{
var agent = specialists[step.Capability]; // "sql", "forecast", "summarize"
var output = await agent.ExecuteAsync(step, scratch, ct);
scratch.Record(step, output);
}
var answer = scratch.Synthesize();
var verdict = await gate.EvaluateAsync(goal, answer, ct);
return verdict.Score < 0.90 // the eval gate
? AgentResult.Rejected(verdict)
: AgentResult.Ok(answer, verdict);
}
}
Mattrx metric: agentic p95 4.2s → 1.8s — the planner routes most goals through two or three steps instead of brute-forcing one giant prompt. The 0.90 eval gate quietly suppresses low-confidence answers; users see "let me get a human" instead of a confident hallucination.
Layer 8: the LLM layer is a router, not a model
Before
gpt-4 (or whatever the current frontier was) for every single call, including "is this question about billing or analytics?"
After
A router picks the cheapest model that can do the job, with a fallback path.
public sealed class ModelRouter(IModelCatalog catalog) : IModelRouter
{
public IChatModel Select(string feature, Complexity complexity) => complexity switch
{
Complexity.Trivial => catalog.Get("small-fast"), // classify, route, extract
Complexity.Standard => catalog.Get("mid"), // most Help answers
Complexity.Hard => catalog.Get("frontier"), // Insights synthesis
_ => catalog.Get("mid")
};
}
Diagnostic: when we instrumented it, ~70% of calls were trivial — classification and extraction that never needed a frontier model. We were paying frontier prices to detect intent.
Mattrx metric: cost per query $0.021 → $0.008. Most of that delta is simply not sending cheap work to expensive models.
Layer 9: Business APIs become governed tools
Before
The agent had a database connection. It could "decide" to run a query or kick off a report. The model was, functionally, an unaudited admin user.
After
Every action the model can take is a typed, authorized tool. The model proposes; the tool layer decides. Authorization and tenant binding happen in code the model can't talk its way around.
public sealed class CreateReportTool(
IReportService reports, AiPrincipal principal) : IAgentTool
{
public string Name => "create_report";
public async Task<ToolResult> InvokeAsync(JsonElement args, CancellationToken ct)
{
if (!principal.Scopes.Contains("reports:create"))
return ToolResult.Denied("missing scope reports:create");
var request = args.Deserialize<CreateReportRequest>()!;
request = request with { TenantId = principal.TenantId }; // never trust model-supplied tenant
var id = await reports.EnqueueAsync(request, ct); // -> Azure Service Bus
return ToolResult.Ok(new { reportId = id, status = "queued" });
}
}
Diagnostic: a model can hallucinate an action; it cannot hallucinate a scope check. Binding the tenant in code, not from the model's arguments, closes the most dangerous class of agent bug.
Mattrx metric: report generation (PuppeteerSharp PDF, 1.2M renders / 48h at peak) now flows from governed tool calls onto Service Bus. No agent ever holds a database handle.
How a single request actually flows
User asks Insights: "Why did campaign 4821's CTR drop last week?"
Frontend --> API Gateway
- reserve token budget for (tenant, "insights")
- redact PII from the question
Identity
- attach AiPrincipal { tenant, scopes }
Context Layer
- pull 6 recent turns + salient facts
Memory
- "user cares about mobile segment" (remembered)
Knowledge Base (Python)
- hybrid recall -> rerank -> 8 chunks, tenant-scoped
Agents
- plan: [sql -> forecast -> summarize]
- run specialists, fill scratchpad
- eval gate scores 0.93 (>= 0.90, pass)
LLM
- router: "summarize" -> mid model, not frontier
Business APIs
- create_report tool (scope checked) -> Service Bus
<-- answer + linked PDF report
Nine layers, one request, every hop governed. The model is never the thing in charge — the architecture is.
The numbers, in one place
| Metric | Before | After |
|---|---|---|
| Hallucination rate | 18% | 3% |
| Faithfulness (eval) | — | 0.96 |
| Answer-relevance (eval) | — | 0.91 |
| Context tokens / call | ~14,000 | ~3,500 |
| Cost / query | $0.021 | $0.008 |
| Agentic p95 latency | 4.2s | 1.8s |
| Injection attempts blocked | not measured | ~40 / week |
| Cross-tenant leaks (6 mo) | — | 0 |
| Tickets deflected / month | — | ~520 |
Adoption checklist
- Route every model call through one gateway — no SDK calls in controllers.
- Enforce per-tenant token budgets and audit every call's usage.
- Redact PII before anything crosses your boundary.
- Make identity a propagated object; push tenant filters into the data layer.
- Build a context assembler with a hard token budget — select, rank, compress.
- Add two-tier memory; persist only salient long-term facts.
- Use hybrid retrieval plus a cross-encoder rerank; allow empty results.
- Split monolithic prompts into orchestrator + specialist agents.
- Put an eval gate in front of users with a real threshold (we use 0.90).
- Route models by complexity; default to the cheapest that passes eval.
- Expose actions only as typed, scope-checked tools; bind tenant in code.
The honest stuff: when NOT to do this
This architecture earns its complexity at scale and under real risk. It is overkill for plenty of cases:
- You have one AI feature and no multi-tenancy. A clean service with a thin gateway is enough — don't build nine layers for one chatbot.
- Your traffic is low. If you serve hundreds of AI calls a day, model routing and budget reservation save you pennies and cost you weeks.
- There's no regulated or cross-tenant data. The identity and governance layers exist to stop leaks and injection. No sensitive data, far less payoff.
- You haven't shipped a v1 yet. Build the naive version first. You cannot design the right context budget or eval gate before you've watched the simple thing fail.
- You lack an eval set. The eval gate is theater without labeled data to score against. Build the dataset before the gate.
- Latency is non-negotiable and single-digit milliseconds. Multi-layer orchestration adds hops; some real-time paths can't afford them.
- Your team is one or two engineers. This is a 5+ backend-engineer architecture. Smaller teams should adopt the gateway and the rerank first and stop there.
We didn't build all nine layers at once, either. We added the gateway after the billing incident, identity after the first leak scare, the rerank after measuring hallucination, the eval gate after a confident-but-wrong answer reached a customer. Each layer was a response to a real failure — which is exactly how you should adopt it.
The model to carry forward
Treat AI as a tier, not a feature. Your database has a gateway, identity, isolation, and an audit trail because you learned, painfully, that direct access doesn't scale or stay safe. The model is no different. Give it the same operational respect and the same architectural boundaries.
Three habits that keep it healthy:
- Measure before you add a layer. Every layer here exists because a number was bad. No number, no layer.
- Make governance structural, not textual. If a guarantee lives in a prompt, it isn't a guarantee. Put it in code and in the data layer.
- Default to the cheapest path that passes eval. Spend frontier tokens only where eval proves you need them.
By 2027, "we added AI" will mean nothing — everyone will have. What will separate the systems that endure is whether AI was architected or merely attached. Build the tier.
Further reading
- Context Engineering for Enterprise GenAI, Part 1: Context Management
- Context Engineering for Enterprise GenAI, Part 3: Multi-Agent Architecture
- Context Engineering for Enterprise GenAI, Part 4: Enterprise AI Design
Building an AI-native tier and want a second pair of eyes on the architecture? I'm always happy to compare notes — reach me at randhir.jassal@gmail.com.
Get the next issue
A short, curated email with the newest posts and questions.