Context Engineering for Enterprise AI, Part 1: Context Management (Why RAG Alone Isn't Enough)

This is Part 1 of Context Engineering for Enterprise AI, a four-part series on building a production context-engineering layer for enterprise generative AI. This part is about context management: the unglamorous plumbing that decides what actually lands inside the model's context window. We are going to take Mattrx Help from an 18% wrong-answer rate to 3% without touching the model — only by getting smarter about what we feed it.

TL;DR

The model is not your product. The context you assemble is. Prompt engineering tweaks the wrapper; context engineering controls the payload. Naive top-k RAG dumps the wrong chunks, blows the token budget, and ungrounds the answer. The fix is a pipeline: rewrite the query, retrieve hybrid, re-rank, budget, compress, ground, cite — or refuse.

Concern	Prompt engineering	Context engineering	Verdict
What you control	Wording of the instruction	What data enters the window	Context wins
Failure mode	Model "ignores" the prompt	Wrong/missing/stale chunks	Fix the input
Retrieval	Vector top-k, dump it	Hybrid + re-rank + compress	Engineered
Token cost	Grows with naive context	Budgeted + deduped + compressed	3.5k vs 14k
Hallucination	Hope	Grounding + refuse path	3% vs 18%
Where it lives	Prompt string	C# Context API + Python retrieval svc	Split by concern

Production metrics after shipping context management to Mattrx Help:

Wrong-answer / hallucination rate: 18% (naive RAG) -> 3% (engineered context).
Context tokens per request: ~14,000 (naive dump) -> ~3,500 (engineered + compressed).
Mattrx Help still deflects ~520 support tickets/month — now with citations users trust.
Faithfulness/groundedness eval score: 0.96; answer-relevance eval: 0.91.
Cost per AI query: $0.021 -> $0.008 (less context, fewer retries).
Retrieval recall@5 after re-rank: 0.94 (up from 0.71 vector-only).
C# Context API p95: 140 ms for assembly (excludes generation).
Python retrieval service p95: 95 ms for hybrid + cross-encoder re-rank.
Empty-retrieval refuse rate: 6% of queries — answered honestly instead of guessed.
Prompt-injection attempts blocked at the boundary: ~40/week.

The one mental shift

Stop asking "what prompt makes the model answer correctly?" Start asking "what is the smallest, most relevant, most trustworthy set of tokens that makes the correct answer the only reasonable one?" The model is a function of its context window. Engineer the window, not the wording.

A frontier model with garbage context will confidently produce garbage. A mediocre model with surgically assembled context will produce a correct, cited answer. Context engineering is the discipline of treating the context window as a scarce, governed resource — like a CPU cache, not a junk drawer.

The running example

Mattrx is a multi-tenant marketing-analytics SaaS — 110k MAU, ~3,200 req/sec at peak. The enterprise app, orchestration, governance, and system-of-record are ASP.NET Core / .NET 9 on Azure SQL. The AI compute lives in a separate Python FastAPI service doing embeddings, retrieval, ranking, and evaluation against Azure OpenAI + Azure AI Search. The C# app calls Python over HTTP.

Mattrx Help is the RAG product: it answers customer questions over product docs, runbooks, and release notes, and deflects ~520 support tickets a month. When it shipped on naive top-k RAG, 18% of answers were wrong or made-up. Support stopped trusting it. This part is how we fixed the context, not the model.

Section A — Query understanding and rewriting

Users do not ask retrievable questions. They ask "why is my dashboard empty again" three weeks after a feature rename. Embedding that raw string and searching is how you retrieve the wrong era of docs.

Before

We embedded the raw user question and searched. No rewriting, no expansion, no decomposition.

# python: naive retrieval — embed the raw question, dump top-k
from openai import AzureOpenAI

client = AzureOpenAI(azure_endpoint=ENDPOINT, api_key=KEY, api_version="2024-10-21")

def retrieve(question: str) -> list[str]:
    emb = client.embeddings.create(
        model="text-embedding-3-large", input=question
    ).data[0].embedding
    results = search_client.search(
        search_text=None,
        vector_queries=[{"vector": emb, "k": 8, "fields": "contentVector"}],
    )
    return [r["content"] for r in results]  # 8 chunks, no idea if relevant

A question like "why is my dashboard empty" embeds near every doc mentioning dashboards. It misses the actual cause ("Insights migration changed the default date range") because the user never said "date range."

After

We run a cheap rewrite/expansion pass on the C# orchestrator boundary before retrieval. One model call turns a vague question into a self-contained query plus a few expanded variants, and flags whether decomposition is needed.

// csharp: ASP.NET Core — query understanding before retrieval
public sealed record QueryPlan(string Rewritten, string[] Expansions, bool MultiHop);

public sealed class QueryRewriter(IChatCompletionService chat)
{
    public async Task<QueryPlan> PlanAsync(string raw, string tenant, CancellationToken ct)
    {
        var history = new ChatHistory(
            "Rewrite the user question into a single self-contained search query. " +
            "Then give 2 expansion queries covering synonyms/renamed features. " +
            "Set multiHop=true only if it needs multiple retrievals. " +
            "Reply as strict JSON: {rewritten, expansions[], multiHop}.");
        history.AddUserMessage(raw);

        var reply = await chat.GetChatMessageContentAsync(
            history,
            new OpenAIPromptExecutionSettings { Temperature = 0, MaxTokens = 220 },
            cancellationToken: ct);

        var plan = JsonSerializer.Deserialize<QueryPlan>(reply.Content!)
                   ?? new QueryPlan(raw, Array.Empty<string>(), false);
        Log.Information("QueryPlan tenant={Tenant} multiHop={Hop}", tenant, plan.MultiHop);
        return plan;
    }
}

Diagnostic to confirm the rewrite is actually firing (and not silently falling back to raw):

az monitor app-insights query --app mattrx-help \
  --analytics-query "traces | where message has 'QueryPlan' | summarize count() by tostring(customDimensions.multiHop)"

Mattrx metric: query rewriting lifted recall@5 from 0.71 to 0.86 on its own, before any re-ranking — the single highest-leverage change in the whole pipeline.

Section B — Hybrid retrieval + re-ranking

Vector search alone is fuzzy on exact tokens: error codes, API names, SKU strings, version numbers. Keyword search alone is brittle on paraphrase. You need both, then a re-ranker to sort the candidate pool by true relevance.

Before

Vector-only, top-k=8, no keyword channel, no re-rank. Whatever cosine similarity returned first won, including near-duplicates from three doc versions.

NAIVE RETRIEVAL
  question --> embed --> vector top-8 --> dump all 8 into prompt
                                          (dupes, stale versions, off-topic)

After

The FastAPI service runs hybrid search (BM25 keyword + vector) over Azure AI Search to build a wide candidate pool of ~40, then a cross-encoder re-ranker scores each candidate against the query and keeps the top 6. Re-ranking is where recall turns into precision.

# python: FastAPI retrieval service — hybrid search + cross-encoder re-rank
from fastapi import FastAPI
from pydantic import BaseModel
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from sentence_transformers import CrossEncoder
from openai import AzureOpenAI

app = FastAPI()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)
aoai = AzureOpenAI(azure_endpoint=ENDPOINT, api_key=KEY, api_version="2024-10-21")

class RetrieveReq(BaseModel):
    rewritten: str
    expansions: list[str] = []
    tenant: str
    top_k: int = 6

@app.post("/retrieve")
def retrieve(req: RetrieveReq):
    emb = aoai.embeddings.create(
        model="text-embedding-3-large", input=req.rewritten
    ).data[0].embedding

    # Hybrid: BM25 keyword channel + vector channel, tenant-scoped
    raw = search_client.search(
        search_text=req.rewritten,                       # keyword (BM25)
        vector_queries=[VectorizedQuery(vector=emb, k_nearest_neighbors=40,
                                        fields="contentVector")],
        filter=f"tenantId eq '{req.tenant}'",
        top=40,
    )
    pool = [{"id": r["id"], "content": r["content"], "source": r["url"]} for r in raw]
    if not pool:
        return {"chunks": [], "empty": True}

    # Cross-encoder re-rank: score (query, chunk) pairs, keep best
    pairs = [(req.rewritten, c["content"]) for c in pool]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(pool, scores), key=lambda x: x[1], reverse=True)
    keep = [{**c, "score": float(s)} for c, s in ranked[: req.top_k] if s > 0.20]
    return {"chunks": keep, "empty": len(keep) == 0}

Diagnostic — eyeball the re-rank score distribution to tune the cutoff:

curl -s localhost:8001/retrieve -d '{"rewritten":"dashboard empty after migration","tenant":"acme"}' \
  | python -c "import sys,json;print([round(c['score'],2) for c in json.load(sys.stdin)['chunks']])"

Mattrx metric: hybrid + cross-encoder re-rank pushed recall@5 from 0.86 (rewrite-only) to 0.94, and cut near-duplicate chunks reaching the window by 80%. Python retrieval p95 stayed at 95 ms because the cross-encoder runs on a small candidate pool, not the whole index.

Section C — Context assembly and token budgeting

This is the part everyone skips and it is the part that controls cost and quality. The context window is a fixed budget. System prompt, tool schemas, conversation history, and retrieved chunks all compete for it. Dump everything and you pay 14k tokens, drown the answer in noise, and hit the "lost in the middle" problem where the model ignores the center of a long context.

Before

The C# orchestrator concatenated system prompt + full history + all 8 chunks + every tool schema and shipped it. No budget, no dedup, no compression.

// csharp: BEFORE — concatenate everything, hope it fits
var prompt = systemPrompt
    + string.Join("\n", history.Select(m => m.Content))   // entire conversation
    + string.Join("\n\n", chunks)                          // all 8 chunks raw
    + JsonSerializer.Serialize(allToolSchemas);            // every tool, always
// ~14,000 tokens. $0.021/query. Answer quality: a coin flip.

After

We treat the window as a budget with explicit allocations, assemble in priority order, dedupe by content hash, and ask the Python service to contextually compress chunks (extract only sentences relevant to the query) before they land. Here is the target allocation:

CONTEXT WINDOW BUDGET  (target ~3,500 tokens)
+----------------------------+---------+----------------------------+
| System prompt + guardrails |   450   | ##########                 |
| Tool schemas (filtered)    |   300   | #######                    |
| Conversation history (n=3) |   600   | #############              |
| Retrieved chunks (comp.)   |  1900   | ######################## . |
| Answer headroom            |   250   | #####                      |
+----------------------------+---------+----------------------------+
                              =  3500   (was ~14,000 unbudgeted)

The assembly logic lives in the C# Context API. It allocates, fills highest-priority first, and truncates the lowest-priority bucket (history before chunks) when the budget is tight.

// csharp: ASP.NET Core minimal API — budgeted, deduped context assembly
app.MapPost("/context/assemble", async (
    AssembleRequest req, IRetrievalClient retrieval, ITokenizer tok,
    CancellationToken ct) =>
{
    const int Budget = 3500;
    var used = 0;
    var parts = new List<string>();

    // 1. System prompt + guardrails — non-negotiable, always included
    used += tok.Count(SystemPrompt);
    parts.Add(SystemPrompt);

    // 2. Filter tool schemas to ones plausibly relevant to this query
    var tools = ToolCatalog.Relevant(req.Rewritten);
    used += tok.Count(tools);
    parts.Add(tools);

    // 3. Retrieve compressed, deduped chunks from the Python service
    var r = await retrieval.RetrieveAsync(req.Rewritten, req.Expansions, req.Tenant, ct);
    if (r.Empty)
        return Results.Ok(new ContextResult(Empty: true, Context: "", Citations: []));

    var seen = new HashSet<string>();
    var citations = new List<Citation>();
    foreach (var c in r.Chunks)
    {
        var hash = Sha.Of(c.Content);
        if (!seen.Add(hash)) continue;                 // dedupe identical text
        var cost = tok.Count(c.Content);
        if (used + cost > Budget - 250) break;          // reserve answer headroom
        used += cost;
        parts.Add($"[{citations.Count + 1}] {c.Content}");
        citations.Add(new Citation(citations.Count + 1, c.Source, c.Score));
    }

    // 4. History fills whatever budget remains, newest-first
    foreach (var m in req.History.AsEnumerable().Reverse())
    {
        var cost = tok.Count(m.Content);
        if (used + cost > Budget - 250) break;
        used += cost;
        parts.Insert(2, m.Content);
    }

    Log.Information("Context assembled tenant={Tenant} tokens={Used}", req.Tenant, used);
    return Results.Ok(new ContextResult(false, string.Join("\n\n", parts), citations));
}).WithName("AssembleContext");

And the Python side that does the actual compression — extract only query-relevant sentences so a 900-token chunk shrinks to the 180 tokens that matter:

# python: contextual compression — keep only sentences that earn their tokens
import re

def compress(chunk: str, query: str, keep_ratio: float = 0.4) -> str:
    sentences = re.split(r"(?<=[.!?])\s+", chunk)
    if len(sentences) <= 2:
        return chunk
    pairs = [(query, s) for s in sentences]
    scores = reranker.predict(pairs)                    # reuse the cross-encoder
    ranked = sorted(zip(sentences, scores), key=lambda x: x[1], reverse=True)
    keep_n = max(2, int(len(sentences) * keep_ratio))
    chosen = {s for s, _ in ranked[:keep_n]}
    # preserve original order so the prose still reads coherently
    return " ".join(s for s in sentences if s in chosen)

Diagnostic — assert the budget is actually being honored in production:

az monitor app-insights query --app mattrx-help --analytics-query \
  "traces | where message has 'Context assembled' | summarize p95=percentile(toint(customDimensions.tokens),95)"

Mattrx metric: budgeting + dedup + compression cut context tokens from ~14,000 to ~3,500 and cost per query from $0.021 toward $0.008, while improving answer quality — less noise in the middle of the window means the model stops getting distracted.

Section D — Grounding and citations (and the refuse path)

The most dangerous answer is a fluent, confident, wrong one with no citation. Grounding means the model answers only from the assembled context and attaches inline citations to the chunks it used. And when retrieval comes back empty, the system must refuse — not improvise.

Before

The model answered freely from parametric memory. No citations, no refusal — it would invent a plausible-sounding config setting that never existed.

UNGROUNDED
  context (maybe empty) --> model --> fluent answer (no sources)
                                      --> 18% wrong, support stops trusting it

After

The pipeline ends with two rules enforced in C#: (1) if Empty, short-circuit to an honest refusal that routes the user to a human — never call the model; (2) otherwise instruct the model to ground every claim and emit [n] citations that map back to the assembled chunks. Here is the full grounded flow:

GROUNDED CONTEXT-ASSEMBLY PIPELINE
  user question
       |
       v
  [C#] query rewrite + expansion ----------------+
       |                                          |
       v                                          |
  [Py] hybrid search (BM25 + vector, 40) ---------+--> Azure AI Search
       |
       v
  [Py] cross-encoder re-rank --> top 6 --> compress
       |
       v
  [C#] assemble: budget 3500, dedup, prioritize
       |
       +--(empty?)--> REFUSE + route to human  ( ~6% of queries )
       |
       v
  [C#] grounded generation w/ inline [n] citations --> answer + sources

// csharp: grounding + explicit refuse path
public async Task<HelpAnswer> AnswerAsync(string raw, string tenant,
    IReadOnlyList<Turn> history, CancellationToken ct)
{
    var plan = await _rewriter.PlanAsync(raw, tenant, ct);
    var ctx = await _context.AssembleAsync(plan, tenant, history, ct);

    if (ctx.Empty)   // retrieval found nothing trustworthy — do NOT guess
    {
        Log.Information("Refuse: empty retrieval tenant={Tenant}", tenant);
        return HelpAnswer.Refused(
            "I don't have a documented answer for that. I've routed you to support.");
    }

    var sys = new ChatHistory(
        "Answer ONLY from the provided context. Cite sources inline as [n]. " +
        "If the context does not contain the answer, say so. Do not use outside knowledge.");
    sys.AddUserMessage($"Context:\n{ctx.Context}\n\nQuestion: {raw}");

    var reply = await _chat.GetChatMessageContentAsync(
        sys, new OpenAIPromptExecutionSettings { Temperature = 0.1 }, cancellationToken: ct);

    return new HelpAnswer(reply.Content!, ctx.Citations, Refused: false);
}

Diagnostic — watch the refuse rate; a sudden drop to ~0% means grounding silently regressed and the model is improvising again:

az monitor app-insights query --app mattrx-help --analytics-query \
  "traces | where message has 'Refuse:' | summarize refusals=count() by bin(timestamp, 1d)"

Mattrx metric: grounding + the refuse path took the wrong-answer rate from 18% to 3%, lifted faithfulness eval to 0.96, and made support trust the ~520 monthly deflections — because every answer now carries a clickable source, and "I don't know" is a valid, honest output 6% of the time.

Aggregate metrics

Metric	Before (naive RAG)	After (engineered context)
Wrong-answer / hallucination rate	18%	3%
Context tokens per request	~14,000	~3,500
Cost per AI query	$0.021	$0.008
Faithfulness eval score	0.71	0.96
Answer-relevance eval	0.68	0.91
Recall@5 (retrieval)	0.71	0.94
Tickets deflected / month	520 (low trust)	520 (trusted, cited)
C# Context API p95	n/a	140 ms
Python retrieval p95	120 ms	95 ms
Empty-retrieval refuse rate	0% (guessed)	6% (honest)

Pre-ship checklist

Honest stuff

Cross-encoders cost latency. Re-ranking adds ~30-50 ms. If you are under a 300 ms hard p95 with simple, well-tagged docs, vector-only may genuinely be enough. Measure before you add it.
Query rewriting can over-rewrite. A confident rewrite of an already-precise question (an exact error code) can hurt recall. Keep temperature=0, and keep the keyword channel so exact tokens survive a bad rewrite.
Compression is lossy by design. Aggressive keep_ratio can drop the one caveat sentence that mattered. Tune it per corpus and gate changes behind your faithfulness eval, not vibes.
Token budgeting needs a real tokenizer. Estimating tokens as chars/4 will silently overflow on code-heavy or non-English docs. Use the model's actual tokenizer in C#.
The refuse path will annoy people. A 6% "I don't know" rate is correct behavior, but stakeholders read it as "the bot is dumb." You have to defend it with the hallucination numbers.
Re-rankers go stale. A generic ms-marco cross-encoder is fine to start; on a specialized corpus you may eventually need a domain-tuned one. Don't tune it before you've shipped the basics.
Hybrid search hides config bugs. If BM25 quietly returns nothing (wrong analyzer), vector results mask it and you'll never notice until a keyword-heavy query fails. Test each channel in isolation.
More context is not more quality. Past a point, adding chunks lowers answer quality via "lost in the middle." The win here came from removing tokens, not adding them.

The closing mental model

The context window is a budget you spend on relevance, not a bucket you fill with hope.

Three enforceable habits:

Budget every byte. No context enters the window without an allocation, a dedup check, and a token cost. If it doesn't fit the budget, it doesn't ship — full stop.
Rewrite, retrieve hybrid, re-rank — always in that order. Never embed a raw user question. Never trust a single retrieval channel. Never skip the re-rank.
Refuse before you guess. Empty or low-score retrieval is a first-class outcome. Wire the refuse path before you wire the happy path, or you will ship the 18%.

Continue the series

This is Part 1 of 6 in Context Engineering for Enterprise AI. (you are here) -> Next: Part 2 — The Memory Layer — when retrieved context isn't enough and the system needs to remember across turns, sessions, and tenants. The full series:

TL;DR

Concern	Prompt engineering	Context engineering	Verdict
What you control	Wording of the instruction	What data enters the window	Context wins
Failure mode	Model "ignores" the prompt	Wrong/missing/stale chunks	Fix the input
Retrieval	Vector top-k, dump it	Hybrid + re-rank + compress	Engineered
Token cost	Grows with naive context	Budgeted + deduped + compressed	3.5k vs 14k
Hallucination	Hope	Grounding + refuse path	3% vs 18%
Where it lives	Prompt string	C# Context API + Python retrieval svc	Split by concern

Production metrics after shipping context management to Mattrx Help:

Wrong-answer / hallucination rate: 18% (naive RAG) -> 3% (engineered context).
Context tokens per request: ~14,000 (naive dump) -> ~3,500 (engineered + compressed).
Mattrx Help still deflects ~520 support tickets/month — now with citations users trust.
Faithfulness/groundedness eval score: 0.96; answer-relevance eval: 0.91.
Cost per AI query: $0.021 -> $0.008 (less context, fewer retries).
Retrieval recall@5 after re-rank: 0.94 (up from 0.71 vector-only).
C# Context API p95: 140 ms for assembly (excludes generation).
Python retrieval service p95: 95 ms for hybrid + cross-encoder re-rank.
Empty-retrieval refuse rate: 6% of queries — answered honestly instead of guessed.
Prompt-injection attempts blocked at the boundary: ~40/week.

The one mental shift

Stop asking "what prompt makes the model answer correctly?" Start asking "what is the smallest, most relevant, most trustworthy set of tokens that makes the correct answer the only reasonable one?" The model is a function of its context window. Engineer the window, not the wording.

The running example

Section A — Query understanding and rewriting

Before

We embedded the raw user question and searched. No rewriting, no expansion, no decomposition.

# python: naive retrieval — embed the raw question, dump top-k
from openai import AzureOpenAI

client = AzureOpenAI(azure_endpoint=ENDPOINT, api_key=KEY, api_version="2024-10-21")

def retrieve(question: str) -> list[str]:
    emb = client.embeddings.create(
        model="text-embedding-3-large", input=question
    ).data[0].embedding
    results = search_client.search(
        search_text=None,
        vector_queries=[{"vector": emb, "k": 8, "fields": "contentVector"}],
    )
    return [r["content"] for r in results]  # 8 chunks, no idea if relevant

After

// csharp: ASP.NET Core — query understanding before retrieval
public sealed record QueryPlan(string Rewritten, string[] Expansions, bool MultiHop);

public sealed class QueryRewriter(IChatCompletionService chat)
{
    public async Task<QueryPlan> PlanAsync(string raw, string tenant, CancellationToken ct)
    {
        var history = new ChatHistory(
            "Rewrite the user question into a single self-contained search query. " +
            "Then give 2 expansion queries covering synonyms/renamed features. " +
            "Set multiHop=true only if it needs multiple retrievals. " +
            "Reply as strict JSON: {rewritten, expansions[], multiHop}.");
        history.AddUserMessage(raw);

        var reply = await chat.GetChatMessageContentAsync(
            history,
            new OpenAIPromptExecutionSettings { Temperature = 0, MaxTokens = 220 },
            cancellationToken: ct);

        var plan = JsonSerializer.Deserialize<QueryPlan>(reply.Content!)
                   ?? new QueryPlan(raw, Array.Empty<string>(), false);
        Log.Information("QueryPlan tenant={Tenant} multiHop={Hop}", tenant, plan.MultiHop);
        return plan;
    }
}

Diagnostic to confirm the rewrite is actually firing (and not silently falling back to raw):

az monitor app-insights query --app mattrx-help \
  --analytics-query "traces | where message has 'QueryPlan' | summarize count() by tostring(customDimensions.multiHop)"

Mattrx metric: query rewriting lifted recall@5 from 0.71 to 0.86 on its own, before any re-ranking — the single highest-leverage change in the whole pipeline.

Section B — Hybrid retrieval + re-ranking

Before

Vector-only, top-k=8, no keyword channel, no re-rank. Whatever cosine similarity returned first won, including near-duplicates from three doc versions.

NAIVE RETRIEVAL
  question --> embed --> vector top-8 --> dump all 8 into prompt
                                          (dupes, stale versions, off-topic)

After

# python: FastAPI retrieval service — hybrid search + cross-encoder re-rank
from fastapi import FastAPI
from pydantic import BaseModel
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from sentence_transformers import CrossEncoder
from openai import AzureOpenAI

app = FastAPI()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)
aoai = AzureOpenAI(azure_endpoint=ENDPOINT, api_key=KEY, api_version="2024-10-21")

class RetrieveReq(BaseModel):
    rewritten: str
    expansions: list[str] = []
    tenant: str
    top_k: int = 6

@app.post("/retrieve")
def retrieve(req: RetrieveReq):
    emb = aoai.embeddings.create(
        model="text-embedding-3-large", input=req.rewritten
    ).data[0].embedding

    # Hybrid: BM25 keyword channel + vector channel, tenant-scoped
    raw = search_client.search(
        search_text=req.rewritten,                       # keyword (BM25)
        vector_queries=[VectorizedQuery(vector=emb, k_nearest_neighbors=40,
                                        fields="contentVector")],
        filter=f"tenantId eq '{req.tenant}'",
        top=40,
    )
    pool = [{"id": r["id"], "content": r["content"], "source": r["url"]} for r in raw]
    if not pool:
        return {"chunks": [], "empty": True}

    # Cross-encoder re-rank: score (query, chunk) pairs, keep best
    pairs = [(req.rewritten, c["content"]) for c in pool]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(pool, scores), key=lambda x: x[1], reverse=True)
    keep = [{**c, "score": float(s)} for c, s in ranked[: req.top_k] if s > 0.20]
    return {"chunks": keep, "empty": len(keep) == 0}

Diagnostic — eyeball the re-rank score distribution to tune the cutoff:

curl -s localhost:8001/retrieve -d '{"rewritten":"dashboard empty after migration","tenant":"acme"}' \
  | python -c "import sys,json;print([round(c['score'],2) for c in json.load(sys.stdin)['chunks']])"

Section C — Context assembly and token budgeting

Before

The C# orchestrator concatenated system prompt + full history + all 8 chunks + every tool schema and shipped it. No budget, no dedup, no compression.

// csharp: BEFORE — concatenate everything, hope it fits
var prompt = systemPrompt
    + string.Join("\n", history.Select(m => m.Content))   // entire conversation
    + string.Join("\n\n", chunks)                          // all 8 chunks raw
    + JsonSerializer.Serialize(allToolSchemas);            // every tool, always
// ~14,000 tokens. $0.021/query. Answer quality: a coin flip.

After

CONTEXT WINDOW BUDGET  (target ~3,500 tokens)
+----------------------------+---------+----------------------------+
| System prompt + guardrails |   450   | ##########                 |
| Tool schemas (filtered)    |   300   | #######                    |
| Conversation history (n=3) |   600   | #############              |
| Retrieved chunks (comp.)   |  1900   | ######################## . |
| Answer headroom            |   250   | #####                      |
+----------------------------+---------+----------------------------+
                              =  3500   (was ~14,000 unbudgeted)

The assembly logic lives in the C# Context API. It allocates, fills highest-priority first, and truncates the lowest-priority bucket (history before chunks) when the budget is tight.

// csharp: ASP.NET Core minimal API — budgeted, deduped context assembly
app.MapPost("/context/assemble", async (
    AssembleRequest req, IRetrievalClient retrieval, ITokenizer tok,
    CancellationToken ct) =>
{
    const int Budget = 3500;
    var used = 0;
    var parts = new List<string>();

    // 1. System prompt + guardrails — non-negotiable, always included
    used += tok.Count(SystemPrompt);
    parts.Add(SystemPrompt);

    // 2. Filter tool schemas to ones plausibly relevant to this query
    var tools = ToolCatalog.Relevant(req.Rewritten);
    used += tok.Count(tools);
    parts.Add(tools);

    // 3. Retrieve compressed, deduped chunks from the Python service
    var r = await retrieval.RetrieveAsync(req.Rewritten, req.Expansions, req.Tenant, ct);
    if (r.Empty)
        return Results.Ok(new ContextResult(Empty: true, Context: "", Citations: []));

    var seen = new HashSet<string>();
    var citations = new List<Citation>();
    foreach (var c in r.Chunks)
    {
        var hash = Sha.Of(c.Content);
        if (!seen.Add(hash)) continue;                 // dedupe identical text
        var cost = tok.Count(c.Content);
        if (used + cost > Budget - 250) break;          // reserve answer headroom
        used += cost;
        parts.Add($"[{citations.Count + 1}] {c.Content}");
        citations.Add(new Citation(citations.Count + 1, c.Source, c.Score));
    }

    // 4. History fills whatever budget remains, newest-first
    foreach (var m in req.History.AsEnumerable().Reverse())
    {
        var cost = tok.Count(m.Content);
        if (used + cost > Budget - 250) break;
        used += cost;
        parts.Insert(2, m.Content);
    }

    Log.Information("Context assembled tenant={Tenant} tokens={Used}", req.Tenant, used);
    return Results.Ok(new ContextResult(false, string.Join("\n\n", parts), citations));
}).WithName("AssembleContext");

And the Python side that does the actual compression — extract only query-relevant sentences so a 900-token chunk shrinks to the 180 tokens that matter:

# python: contextual compression — keep only sentences that earn their tokens
import re

def compress(chunk: str, query: str, keep_ratio: float = 0.4) -> str:
    sentences = re.split(r"(?<=[.!?])\s+", chunk)
    if len(sentences) <= 2:
        return chunk
    pairs = [(query, s) for s in sentences]
    scores = reranker.predict(pairs)                    # reuse the cross-encoder
    ranked = sorted(zip(sentences, scores), key=lambda x: x[1], reverse=True)
    keep_n = max(2, int(len(sentences) * keep_ratio))
    chosen = {s for s, _ in ranked[:keep_n]}
    # preserve original order so the prose still reads coherently
    return " ".join(s for s in sentences if s in chosen)

Diagnostic — assert the budget is actually being honored in production:

az monitor app-insights query --app mattrx-help --analytics-query \
  "traces | where message has 'Context assembled' | summarize p95=percentile(toint(customDimensions.tokens),95)"

Section D — Grounding and citations (and the refuse path)

Before

The model answered freely from parametric memory. No citations, no refusal — it would invent a plausible-sounding config setting that never existed.

UNGROUNDED
  context (maybe empty) --> model --> fluent answer (no sources)
                                      --> 18% wrong, support stops trusting it

After

GROUNDED CONTEXT-ASSEMBLY PIPELINE
  user question
       |
       v
  [C#] query rewrite + expansion ----------------+
       |                                          |
       v                                          |
  [Py] hybrid search (BM25 + vector, 40) ---------+--> Azure AI Search
       |
       v
  [Py] cross-encoder re-rank --> top 6 --> compress
       |
       v
  [C#] assemble: budget 3500, dedup, prioritize
       |
       +--(empty?)--> REFUSE + route to human  ( ~6% of queries )
       |
       v
  [C#] grounded generation w/ inline [n] citations --> answer + sources

// csharp: grounding + explicit refuse path
public async Task<HelpAnswer> AnswerAsync(string raw, string tenant,
    IReadOnlyList<Turn> history, CancellationToken ct)
{
    var plan = await _rewriter.PlanAsync(raw, tenant, ct);
    var ctx = await _context.AssembleAsync(plan, tenant, history, ct);

    if (ctx.Empty)   // retrieval found nothing trustworthy — do NOT guess
    {
        Log.Information("Refuse: empty retrieval tenant={Tenant}", tenant);
        return HelpAnswer.Refused(
            "I don't have a documented answer for that. I've routed you to support.");
    }

    var sys = new ChatHistory(
        "Answer ONLY from the provided context. Cite sources inline as [n]. " +
        "If the context does not contain the answer, say so. Do not use outside knowledge.");
    sys.AddUserMessage($"Context:\n{ctx.Context}\n\nQuestion: {raw}");

    var reply = await _chat.GetChatMessageContentAsync(
        sys, new OpenAIPromptExecutionSettings { Temperature = 0.1 }, cancellationToken: ct);

    return new HelpAnswer(reply.Content!, ctx.Citations, Refused: false);
}

Diagnostic — watch the refuse rate; a sudden drop to ~0% means grounding silently regressed and the model is improvising again:

az monitor app-insights query --app mattrx-help --analytics-query \
  "traces | where message has 'Refuse:' | summarize refusals=count() by bin(timestamp, 1d)"

Aggregate metrics

Metric	Before (naive RAG)	After (engineered context)
Wrong-answer / hallucination rate	18%	3%
Context tokens per request	~14,000	~3,500
Cost per AI query	$0.021	$0.008
Faithfulness eval score	0.71	0.96
Answer-relevance eval	0.68	0.91
Recall@5 (retrieval)	0.71	0.94
Tickets deflected / month	520 (low trust)	520 (trusted, cited)
C# Context API p95	n/a	140 ms
Python retrieval p95	120 ms	95 ms
Empty-retrieval refuse rate	0% (guessed)	6% (honest)

Pre-ship checklist

Honest stuff

Cross-encoders cost latency. Re-ranking adds ~30-50 ms. If you are under a 300 ms hard p95 with simple, well-tagged docs, vector-only may genuinely be enough. Measure before you add it.
Query rewriting can over-rewrite. A confident rewrite of an already-precise question (an exact error code) can hurt recall. Keep temperature=0, and keep the keyword channel so exact tokens survive a bad rewrite.
Compression is lossy by design. Aggressive keep_ratio can drop the one caveat sentence that mattered. Tune it per corpus and gate changes behind your faithfulness eval, not vibes.
Token budgeting needs a real tokenizer. Estimating tokens as chars/4 will silently overflow on code-heavy or non-English docs. Use the model's actual tokenizer in C#.
The refuse path will annoy people. A 6% "I don't know" rate is correct behavior, but stakeholders read it as "the bot is dumb." You have to defend it with the hallucination numbers.
Re-rankers go stale. A generic ms-marco cross-encoder is fine to start; on a specialized corpus you may eventually need a domain-tuned one. Don't tune it before you've shipped the basics.
Hybrid search hides config bugs. If BM25 quietly returns nothing (wrong analyzer), vector results mask it and you'll never notice until a keyword-heavy query fails. Test each channel in isolation.
More context is not more quality. Past a point, adding chunks lowers answer quality via "lost in the middle." The win here came from removing tokens, not adding them.

The closing mental model

The context window is a budget you spend on relevance, not a bucket you fill with hope.

Three enforceable habits:

Budget every byte. No context enters the window without an allocation, a dedup check, and a token cost. If it doesn't fit the budget, it doesn't ship — full stop.
Rewrite, retrieve hybrid, re-rank — always in that order. Never embed a raw user question. Never trust a single retrieval channel. Never skip the re-rank.
Refuse before you guess. Empty or low-score retrieval is a first-class outcome. Wire the refuse path before you wire the happy path, or you will ship the 18%.

Get the next issue

Keep reading

Get the next issue

Keep reading