Context Engineering for Enterprise AI, Part 1: Context Management (Why RAG Alone Isn't Enough)
Part 1 of a context-engineering series: why naive RAG hallucinates and the C#+Python context layer that fixes it — rewriting, re-ranking, budgeting.
- Author
- Randhir Jassal
- Published
- Reading time
- 21 min read
- Views
- 2 views
This is Part 1 of Context Engineering for Enterprise AI, a four-part series on building a production context-engineering layer for enterprise generative AI. This part is about context management: the unglamorous plumbing that decides what actually lands inside the model's context window. We are going to take Mattrx Help from an 18% wrong-answer rate to 3% without touching the model — only by getting smarter about what we feed it.
TL;DR
The model is not your product. The context you assemble is. Prompt engineering tweaks the wrapper; context engineering controls the payload. Naive top-k RAG dumps the wrong chunks, blows the token budget, and ungrounds the answer. The fix is a pipeline: rewrite the query, retrieve hybrid, re-rank, budget, compress, ground, cite — or refuse.
| Concern | Prompt engineering | Context engineering | Verdict |
|---|---|---|---|
| What you control | Wording of the instruction | What data enters the window | Context wins |
| Failure mode | Model "ignores" the prompt | Wrong/missing/stale chunks | Fix the input |
| Retrieval | Vector top-k, dump it | Hybrid + re-rank + compress | Engineered |
| Token cost | Grows with naive context | Budgeted + deduped + compressed | 3.5k vs 14k |
| Hallucination | Hope | Grounding + refuse path | 3% vs 18% |
| Where it lives | Prompt string | C# Context API + Python retrieval svc | Split by concern |
Production metrics after shipping context management to Mattrx Help:
- Wrong-answer / hallucination rate: 18% (naive RAG) -> 3% (engineered context).
- Context tokens per request: ~14,000 (naive dump) -> ~3,500 (engineered + compressed).
- Mattrx Help still deflects ~520 support tickets/month — now with citations users trust.
- Faithfulness/groundedness eval score: 0.96; answer-relevance eval: 0.91.
- Cost per AI query: $0.021 -> $0.008 (less context, fewer retries).
- Retrieval recall@5 after re-rank: 0.94 (up from 0.71 vector-only).
- C# Context API p95: 140 ms for assembly (excludes generation).
- Python retrieval service p95: 95 ms for hybrid + cross-encoder re-rank.
- Empty-retrieval refuse rate: 6% of queries — answered honestly instead of guessed.
- Prompt-injection attempts blocked at the boundary: ~40/week.
The one mental shift
Stop asking "what prompt makes the model answer correctly?" Start asking "what is the smallest, most relevant, most trustworthy set of tokens that makes the correct answer the only reasonable one?" The model is a function of its context window. Engineer the window, not the wording.
A frontier model with garbage context will confidently produce garbage. A mediocre model with surgically assembled context will produce a correct, cited answer. Context engineering is the discipline of treating the context window as a scarce, governed resource — like a CPU cache, not a junk drawer.
The running example
Mattrx is a multi-tenant marketing-analytics SaaS — 110k MAU, ~3,200 req/sec at peak. The enterprise app, orchestration, governance, and system-of-record are ASP.NET Core / .NET 9 on Azure SQL. The AI compute lives in a separate Python FastAPI service doing embeddings, retrieval, ranking, and evaluation against Azure OpenAI + Azure AI Search. The C# app calls Python over HTTP.
Mattrx Help is the RAG product: it answers customer questions over product docs, runbooks, and release notes, and deflects ~520 support tickets a month. When it shipped on naive top-k RAG, 18% of answers were wrong or made-up. Support stopped trusting it. This part is how we fixed the context, not the model.
Section A — Query understanding and rewriting
Users do not ask retrievable questions. They ask "why is my dashboard empty again" three weeks after a feature rename. Embedding that raw string and searching is how you retrieve the wrong era of docs.
Before
We embedded the raw user question and searched. No rewriting, no expansion, no decomposition.
# python: naive retrieval — embed the raw question, dump top-k
from openai import AzureOpenAI
client = AzureOpenAI(azure_endpoint=ENDPOINT, api_key=KEY, api_version="2024-10-21")
def retrieve(question: str) -> list[str]:
emb = client.embeddings.create(
model="text-embedding-3-large", input=question
).data[0].embedding
results = search_client.search(
search_text=None,
vector_queries=[{"vector": emb, "k": 8, "fields": "contentVector"}],
)
return [r["content"] for r in results] # 8 chunks, no idea if relevant
A question like "why is my dashboard empty" embeds near every doc mentioning dashboards. It misses the actual cause ("Insights migration changed the default date range") because the user never said "date range."
After
We run a cheap rewrite/expansion pass on the C# orchestrator boundary before retrieval. One model call turns a vague question into a self-contained query plus a few expanded variants, and flags whether decomposition is needed.
// csharp: ASP.NET Core — query understanding before retrieval
public sealed record QueryPlan(string Rewritten, string[] Expansions, bool MultiHop);
public sealed class QueryRewriter(IChatCompletionService chat)
{
public async Task<QueryPlan> PlanAsync(string raw, string tenant, CancellationToken ct)
{
var history = new ChatHistory(
"Rewrite the user question into a single self-contained search query. " +
"Then give 2 expansion queries covering synonyms/renamed features. " +
"Set multiHop=true only if it needs multiple retrievals. " +
"Reply as strict JSON: {rewritten, expansions[], multiHop}.");
history.AddUserMessage(raw);
var reply = await chat.GetChatMessageContentAsync(
history,
new OpenAIPromptExecutionSettings { Temperature = 0, MaxTokens = 220 },
cancellationToken: ct);
var plan = JsonSerializer.Deserialize<QueryPlan>(reply.Content!)
?? new QueryPlan(raw, Array.Empty<string>(), false);
Log.Information("QueryPlan tenant={Tenant} multiHop={Hop}", tenant, plan.MultiHop);
return plan;
}
}
Diagnostic to confirm the rewrite is actually firing (and not silently falling back to raw):
az monitor app-insights query --app mattrx-help \
--analytics-query "traces | where message has 'QueryPlan' | summarize count() by tostring(customDimensions.multiHop)"
Mattrx metric: query rewriting lifted recall@5 from 0.71 to 0.86 on its own, before any re-ranking — the single highest-leverage change in the whole pipeline.
Section B — Hybrid retrieval + re-ranking
Vector search alone is fuzzy on exact tokens: error codes, API names, SKU strings, version numbers. Keyword search alone is brittle on paraphrase. You need both, then a re-ranker to sort the candidate pool by true relevance.
Before
Vector-only, top-k=8, no keyword channel, no re-rank. Whatever cosine similarity returned first won, including near-duplicates from three doc versions.
NAIVE RETRIEVAL
question --> embed --> vector top-8 --> dump all 8 into prompt
(dupes, stale versions, off-topic)
After
The FastAPI service runs hybrid search (BM25 keyword + vector) over Azure AI Search to build a wide candidate pool of ~40, then a cross-encoder re-ranker scores each candidate against the query and keeps the top 6. Re-ranking is where recall turns into precision.
# python: FastAPI retrieval service — hybrid search + cross-encoder re-rank
from fastapi import FastAPI
from pydantic import BaseModel
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from sentence_transformers import CrossEncoder
from openai import AzureOpenAI
app = FastAPI()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)
aoai = AzureOpenAI(azure_endpoint=ENDPOINT, api_key=KEY, api_version="2024-10-21")
class RetrieveReq(BaseModel):
rewritten: str
expansions: list[str] = []
tenant: str
top_k: int = 6
@app.post("/retrieve")
def retrieve(req: RetrieveReq):
emb = aoai.embeddings.create(
model="text-embedding-3-large", input=req.rewritten
).data[0].embedding
# Hybrid: BM25 keyword channel + vector channel, tenant-scoped
raw = search_client.search(
search_text=req.rewritten, # keyword (BM25)
vector_queries=[VectorizedQuery(vector=emb, k_nearest_neighbors=40,
fields="contentVector")],
filter=f"tenantId eq '{req.tenant}'",
top=40,
)
pool = [{"id": r["id"], "content": r["content"], "source": r["url"]} for r in raw]
if not pool:
return {"chunks": [], "empty": True}
# Cross-encoder re-rank: score (query, chunk) pairs, keep best
pairs = [(req.rewritten, c["content"]) for c in pool]
scores = reranker.predict(pairs)
ranked = sorted(zip(pool, scores), key=lambda x: x[1], reverse=True)
keep = [{**c, "score": float(s)} for c, s in ranked[: req.top_k] if s > 0.20]
return {"chunks": keep, "empty": len(keep) == 0}
Diagnostic — eyeball the re-rank score distribution to tune the cutoff:
curl -s localhost:8001/retrieve -d '{"rewritten":"dashboard empty after migration","tenant":"acme"}' \
| python -c "import sys,json;print([round(c['score'],2) for c in json.load(sys.stdin)['chunks']])"
Mattrx metric: hybrid + cross-encoder re-rank pushed recall@5 from 0.86 (rewrite-only) to 0.94, and cut near-duplicate chunks reaching the window by 80%. Python retrieval p95 stayed at 95 ms because the cross-encoder runs on a small candidate pool, not the whole index.
Section C — Context assembly and token budgeting
This is the part everyone skips and it is the part that controls cost and quality. The context window is a fixed budget. System prompt, tool schemas, conversation history, and retrieved chunks all compete for it. Dump everything and you pay 14k tokens, drown the answer in noise, and hit the "lost in the middle" problem where the model ignores the center of a long context.
Before
The C# orchestrator concatenated system prompt + full history + all 8 chunks + every tool schema and shipped it. No budget, no dedup, no compression.
// csharp: BEFORE — concatenate everything, hope it fits
var prompt = systemPrompt
+ string.Join("\n", history.Select(m => m.Content)) // entire conversation
+ string.Join("\n\n", chunks) // all 8 chunks raw
+ JsonSerializer.Serialize(allToolSchemas); // every tool, always
// ~14,000 tokens. $0.021/query. Answer quality: a coin flip.
After
We treat the window as a budget with explicit allocations, assemble in priority order, dedupe by content hash, and ask the Python service to contextually compress chunks (extract only sentences relevant to the query) before they land. Here is the target allocation:
CONTEXT WINDOW BUDGET (target ~3,500 tokens)
+----------------------------+---------+----------------------------+
| System prompt + guardrails | 450 | ########## |
| Tool schemas (filtered) | 300 | ####### |
| Conversation history (n=3) | 600 | ############# |
| Retrieved chunks (comp.) | 1900 | ######################## . |
| Answer headroom | 250 | ##### |
+----------------------------+---------+----------------------------+
= 3500 (was ~14,000 unbudgeted)
The assembly logic lives in the C# Context API. It allocates, fills highest-priority first, and truncates the lowest-priority bucket (history before chunks) when the budget is tight.
// csharp: ASP.NET Core minimal API — budgeted, deduped context assembly
app.MapPost("/context/assemble", async (
AssembleRequest req, IRetrievalClient retrieval, ITokenizer tok,
CancellationToken ct) =>
{
const int Budget = 3500;
var used = 0;
var parts = new List<string>();
// 1. System prompt + guardrails — non-negotiable, always included
used += tok.Count(SystemPrompt);
parts.Add(SystemPrompt);
// 2. Filter tool schemas to ones plausibly relevant to this query
var tools = ToolCatalog.Relevant(req.Rewritten);
used += tok.Count(tools);
parts.Add(tools);
// 3. Retrieve compressed, deduped chunks from the Python service
var r = await retrieval.RetrieveAsync(req.Rewritten, req.Expansions, req.Tenant, ct);
if (r.Empty)
return Results.Ok(new ContextResult(Empty: true, Context: "", Citations: []));
var seen = new HashSet<string>();
var citations = new List<Citation>();
foreach (var c in r.Chunks)
{
var hash = Sha.Of(c.Content);
if (!seen.Add(hash)) continue; // dedupe identical text
var cost = tok.Count(c.Content);
if (used + cost > Budget - 250) break; // reserve answer headroom
used += cost;
parts.Add($"[{citations.Count + 1}] {c.Content}");
citations.Add(new Citation(citations.Count + 1, c.Source, c.Score));
}
// 4. History fills whatever budget remains, newest-first
foreach (var m in req.History.AsEnumerable().Reverse())
{
var cost = tok.Count(m.Content);
if (used + cost > Budget - 250) break;
used += cost;
parts.Insert(2, m.Content);
}
Log.Information("Context assembled tenant={Tenant} tokens={Used}", req.Tenant, used);
return Results.Ok(new ContextResult(false, string.Join("\n\n", parts), citations));
}).WithName("AssembleContext");
And the Python side that does the actual compression — extract only query-relevant sentences so a 900-token chunk shrinks to the 180 tokens that matter:
# python: contextual compression — keep only sentences that earn their tokens
import re
def compress(chunk: str, query: str, keep_ratio: float = 0.4) -> str:
sentences = re.split(r"(?<=[.!?])\s+", chunk)
if len(sentences) <= 2:
return chunk
pairs = [(query, s) for s in sentences]
scores = reranker.predict(pairs) # reuse the cross-encoder
ranked = sorted(zip(sentences, scores), key=lambda x: x[1], reverse=True)
keep_n = max(2, int(len(sentences) * keep_ratio))
chosen = {s for s, _ in ranked[:keep_n]}
# preserve original order so the prose still reads coherently
return " ".join(s for s in sentences if s in chosen)
Diagnostic — assert the budget is actually being honored in production:
az monitor app-insights query --app mattrx-help --analytics-query \
"traces | where message has 'Context assembled' | summarize p95=percentile(toint(customDimensions.tokens),95)"
Mattrx metric: budgeting + dedup + compression cut context tokens from ~14,000 to ~3,500 and cost per query from $0.021 toward $0.008, while improving answer quality — less noise in the middle of the window means the model stops getting distracted.
Section D — Grounding and citations (and the refuse path)
The most dangerous answer is a fluent, confident, wrong one with no citation. Grounding means the model answers only from the assembled context and attaches inline citations to the chunks it used. And when retrieval comes back empty, the system must refuse — not improvise.
Before
The model answered freely from parametric memory. No citations, no refusal — it would invent a plausible-sounding config setting that never existed.
UNGROUNDED
context (maybe empty) --> model --> fluent answer (no sources)
--> 18% wrong, support stops trusting it
After
The pipeline ends with two rules enforced in C#: (1) if Empty, short-circuit to an honest refusal that routes the user to a human — never call the model; (2) otherwise instruct the model to ground every claim and emit [n] citations that map back to the assembled chunks. Here is the full grounded flow:
GROUNDED CONTEXT-ASSEMBLY PIPELINE
user question
|
v
[C#] query rewrite + expansion ----------------+
| |
v |
[Py] hybrid search (BM25 + vector, 40) ---------+--> Azure AI Search
|
v
[Py] cross-encoder re-rank --> top 6 --> compress
|
v
[C#] assemble: budget 3500, dedup, prioritize
|
+--(empty?)--> REFUSE + route to human ( ~6% of queries )
|
v
[C#] grounded generation w/ inline [n] citations --> answer + sources
// csharp: grounding + explicit refuse path
public async Task<HelpAnswer> AnswerAsync(string raw, string tenant,
IReadOnlyList<Turn> history, CancellationToken ct)
{
var plan = await _rewriter.PlanAsync(raw, tenant, ct);
var ctx = await _context.AssembleAsync(plan, tenant, history, ct);
if (ctx.Empty) // retrieval found nothing trustworthy — do NOT guess
{
Log.Information("Refuse: empty retrieval tenant={Tenant}", tenant);
return HelpAnswer.Refused(
"I don't have a documented answer for that. I've routed you to support.");
}
var sys = new ChatHistory(
"Answer ONLY from the provided context. Cite sources inline as [n]. " +
"If the context does not contain the answer, say so. Do not use outside knowledge.");
sys.AddUserMessage($"Context:\n{ctx.Context}\n\nQuestion: {raw}");
var reply = await _chat.GetChatMessageContentAsync(
sys, new OpenAIPromptExecutionSettings { Temperature = 0.1 }, cancellationToken: ct);
return new HelpAnswer(reply.Content!, ctx.Citations, Refused: false);
}
Diagnostic — watch the refuse rate; a sudden drop to ~0% means grounding silently regressed and the model is improvising again:
az monitor app-insights query --app mattrx-help --analytics-query \
"traces | where message has 'Refuse:' | summarize refusals=count() by bin(timestamp, 1d)"
Mattrx metric: grounding + the refuse path took the wrong-answer rate from 18% to 3%, lifted faithfulness eval to 0.96, and made support trust the ~520 monthly deflections — because every answer now carries a clickable source, and "I don't know" is a valid, honest output 6% of the time.
Aggregate metrics
| Metric | Before (naive RAG) | After (engineered context) |
|---|---|---|
| Wrong-answer / hallucination rate | 18% | 3% |
| Context tokens per request | ~14,000 | ~3,500 |
| Cost per AI query | $0.021 | $0.008 |
| Faithfulness eval score | 0.71 | 0.96 |
| Answer-relevance eval | 0.68 | 0.91 |
| Recall@5 (retrieval) | 0.71 | 0.94 |
| Tickets deflected / month | 520 (low trust) | 520 (trusted, cited) |
| C# Context API p95 | n/a | 140 ms |
| Python retrieval p95 | 120 ms | 95 ms |
| Empty-retrieval refuse rate | 0% (guessed) | 6% (honest) |
Pre-ship checklist
- Query rewriting runs before every retrieval; raw-question fallback is logged, not silent.
- Retrieval is hybrid (BM25 keyword + vector), not vector-only.
- A cross-encoder re-ranker sorts a wide candidate pool down to the final top-k.
- Every retrieval call is tenant-scoped with a
filter— no cross-tenant leakage. - The context window has an explicit token budget with per-bucket allocations.
- Chunks are deduplicated by content hash before entering the window.
- Contextual compression strips non-relevant sentences from each chunk.
- Answer headroom is reserved in the budget so generation never truncates.
- The system prompt enforces "answer only from context, cite inline as [n]".
- An explicit refuse path triggers on empty/low-score retrieval — the model is never called.
- Token count, refuse rate, and recall are emitted as telemetry and alarmed on.
- Prompt-injection patterns are stripped/flagged at the C# boundary (~40/week blocked).
Honest stuff
- Cross-encoders cost latency. Re-ranking adds ~30-50 ms. If you are under a 300 ms hard p95 with simple, well-tagged docs, vector-only may genuinely be enough. Measure before you add it.
- Query rewriting can over-rewrite. A confident rewrite of an already-precise question (an exact error code) can hurt recall. Keep
temperature=0, and keep the keyword channel so exact tokens survive a bad rewrite. - Compression is lossy by design. Aggressive
keep_ratiocan drop the one caveat sentence that mattered. Tune it per corpus and gate changes behind your faithfulness eval, not vibes. - Token budgeting needs a real tokenizer. Estimating tokens as
chars/4will silently overflow on code-heavy or non-English docs. Use the model's actual tokenizer in C#. - The refuse path will annoy people. A 6% "I don't know" rate is correct behavior, but stakeholders read it as "the bot is dumb." You have to defend it with the hallucination numbers.
- Re-rankers go stale. A generic
ms-marcocross-encoder is fine to start; on a specialized corpus you may eventually need a domain-tuned one. Don't tune it before you've shipped the basics. - Hybrid search hides config bugs. If BM25 quietly returns nothing (wrong analyzer), vector results mask it and you'll never notice until a keyword-heavy query fails. Test each channel in isolation.
- More context is not more quality. Past a point, adding chunks lowers answer quality via "lost in the middle." The win here came from removing tokens, not adding them.
The closing mental model
The context window is a budget you spend on relevance, not a bucket you fill with hope.
Three enforceable habits:
- Budget every byte. No context enters the window without an allocation, a dedup check, and a token cost. If it doesn't fit the budget, it doesn't ship — full stop.
- Rewrite, retrieve hybrid, re-rank — always in that order. Never embed a raw user question. Never trust a single retrieval channel. Never skip the re-rank.
- Refuse before you guess. Empty or low-score retrieval is a first-class outcome. Wire the refuse path before you wire the happy path, or you will ship the 18%.
Continue the series
This is Part 1 of 4 in Context Engineering for Enterprise AI. (you are here) -> Next: Part 2 — The Memory Layer — when retrieved context isn't enough and the system needs to remember across turns, sessions, and tenants. The full series:
- Part 1: Context Management
- Part 2: The Memory Layer
- Part 3: Multi-Agent Architecture
- Part 4: Enterprise AI Design
Further reading
- Part 2: The Memory Layer — the next step once retrieval alone stops scaling.
- Part 3: Multi-Agent Architecture — how context flows between specialized agents.
- RAG with Azure OpenAI, Azure SQL, and C# — the end-to-end RAG plumbing this part builds on.
- LLM patterns in .NET — Semantic Kernel and orchestration patterns for the C# side.
- LLM foundations: how they actually work (Python) — why context windows and tokenization behave the way they do.
Building a context layer over Azure OpenAI and Azure AI Search and fighting the same 18% wrong-answer problem? Tell me what your retrieval is actually returning — email me at randhir.jassal@gmail.com and I'll tell you which stage of the pipeline is leaking.
Get the next issue
A short, curated email with the newest posts and questions.