Production RAG with Azure OpenAI + Azure SQL — A Complete Guide for .NET Teams (with C# Code)
Deep-dive into Retrieval-Augmented Generation: the dual problem (LLMs do not know your enterprise data + you cannot send confidential data to public APIs), full architecture, ingestion + retrieval + generation pipelines in C# with Azure OpenAI and the new Azure SQL VECTOR type, real code, security, costs, hallucination prevention, and the pitfalls that ruin production RAG systems.
- Author
- Randhir Jassal
- Published
- Reading time
- 22 min read
RAG — Retrieval-Augmented Generation — is the standard pattern enterprises now use to feed proprietary data to large language models (LLMs) without sending that data to public API training pipelines, without fine-tuning, and without the hallucination problem you get when an LLM tries to answer questions it has no knowledge of.
This guide is the complete production picture: the actual problems enterprises face, why RAG solves them, the architecture, real C# code using Azure OpenAI + Azure SQL with the new VECTOR column type, security and cost concerns, and the eight pitfalls that ruin most first RAG implementations.
The dual problem RAG solves
Every enterprise wanting to add LLM-powered features hits the same two walls.
Problem 1 — The LLM has no idea about your enterprise data
GPT-4o knows the public internet up to its training cutoff. It does NOT know:
- Your company's 4,000-page employee handbook
- Your last 5 years of customer support tickets
- Your product catalog with internal SKUs and warehouse locations
- The 200 internal Confluence pages on how your deployment pipeline works
- The compliance documents specific to your industry / region
- Yesterday's prices, today's stock levels
Ask GPT-4o "what's our return policy for items damaged during shipping?" and you get either a generic web-trained answer (wrong) or a hallucination (worse — confidently wrong).
Problem 2 — You cannot just send enterprise data to a public API
Most companies that have private data also have:
- Data residency laws (GDPR, HIPAA, RBI guidelines for Indian fintech)
- Customer contracts forbidding third-party data sharing
- Trade secrets they don't want vendor models training on
- Confidentiality classifications that prohibit cloud transit to certain regions
Sending raw documents to a public OpenAI endpoint is often a non-starter. The data must stay inside the company's Azure tenant, encrypted at rest, audited, deletable on request.
What RAG does
RAG combines:
- Retrieval — a search system over your proprietary data that returns the most relevant chunks for a given query
- Augmented Generation — the LLM receives the user's question PLUS those retrieved chunks as context, and is instructed to answer using ONLY that context
Result:
- The LLM answers using your data, never inventing facts not in the context
- Your data stays in your storage (Azure SQL, Azure AI Search) — only the retrieved chunks + the question travel to the LLM endpoint
- With Azure OpenAI in your own Azure tenant, the entire pipeline is within your data boundary, encrypted, audited, region-pinned
- You can cite sources back to the user — "I found this answer in section 4.2 of the employee handbook"
This is why every enterprise launching an internal chatbot, support copilot, or document Q&A feature in 2026 uses RAG.
Architecture in one diagram
RAG with Azure OpenAI + Azure SQL
──────────────────────────────────
INGESTION PIPELINE (one-time + incremental)
Documents (PDF, Word, Markdown, HTML, Confluence, SharePoint)
│
▼
┌──────────────────────┐
│ Text Extractor │ pdf → text via Azure Document
│ (Form Recognizer, │ Intelligence; preserve page numbers
│ or pdfpig, etc.) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Chunker │ Split into 500-1000 token chunks
│ (semantic / │ with 100-token overlap, preserve
│ recursive split) │ doc + page metadata
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Azure OpenAI │ text-embedding-3-large
│ Embeddings API │ → 3072-dim float[] per chunk
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Azure SQL Database │ knowledge_chunks table:
│ VECTOR(3072) col │ (id, doc_id, content, metadata,
│ │ embedding VECTOR(3072))
└──────────────────────┘
─── QUERY PIPELINE (every user message) ───
User question: "What's our policy on damaged shipments?"
│
▼
┌──────────────────────┐
│ Same embedder │ text-embedding-3-large
│ (MUST match index) │ → 3072-dim float[]
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Vector Search │ SELECT TOP 5 ... ORDER BY
│ Azure SQL │ VECTOR_DISTANCE('cosine', ...)
│ + optional keyword │ (hybrid: vector + BM25 / FTS)
│ re-rank │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Context Builder │ Pack top-5 chunks into a prompt
│ + Citation Tagger │ with [1], [2], [3] index markers
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Azure OpenAI Chat │ GPT-4o or gpt-4o-mini
│ System: "Answer ONLY │ temperature: 0.0
│ from context. Cite. │ max_tokens: 800
│ Say so if missing." │
└──────────┬───────────┘
│
▼
Answer to user + citations
(with source links back to the original docs)
Everything inside ONE Azure tenant + region
No data leaves your data boundary
Vector index sits next to your transactional data
Three properties to internalize:
- The embedding model MUST be identical between ingestion and query time. Mixing models = garbage retrieval. Pin the deployment name in config.
- The LLM never sees your full corpus. It only sees the top-K retrieved chunks. This is what keeps the prompt small and predictable.
- Everything stays in your tenant. Azure OpenAI deployments live in your Azure subscription. Your SQL DB never leaves your region. The data boundary is closed.
Implementation in C# — the three services
Step 1 — Database schema (Azure SQL with VECTOR type)
Azure SQL Database supports a native VECTOR column type (rolled out 2024). It's the cleanest path for .NET shops already on Azure SQL.
-- Create the chunks table
CREATE TABLE knowledge_chunks (
id BIGINT IDENTITY PRIMARY KEY,
document_id NVARCHAR(255) NOT NULL,
document_title NVARCHAR(500) NOT NULL,
chunk_index INT NOT NULL,
content NVARCHAR(MAX) NOT NULL,
metadata NVARCHAR(MAX) NOT NULL, -- JSON: page, section, url
embedding VECTOR(3072) NOT NULL, -- text-embedding-3-large dim
created_at DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME()
);
-- Index for fast vector search (Azure SQL clustered columnstore variant)
CREATE INDEX idx_chunks_document ON knowledge_chunks(document_id);
Add an index on document_id so you can re-ingest by removing + re-inserting one doc at a time. The VECTOR column itself is queried with VECTOR_DISTANCE — Azure SQL handles the math.
Step 2 — Ingestion service (C#)
public class RagIngestionService
{
private readonly OpenAIClient _openAi;
private readonly string _connectionString;
private readonly string _embeddingDeployment;
private readonly ILogger<RagIngestionService> _log;
public RagIngestionService(
OpenAIClient openAi,
IConfiguration cfg,
ILogger<RagIngestionService> log)
{
_openAi = openAi;
_connectionString = cfg.GetConnectionString("AzureSql")!;
_embeddingDeployment = cfg["AzureOpenAI:EmbeddingDeployment"]!; // e.g. "text-embedding-3-large"
_log = log;
}
public async Task IngestAsync(IngestionRequest req, CancellationToken ct)
{
// 1. Chunk the document
var chunks = ChunkText(req.Content, chunkSize: 800, overlap: 100).ToList();
_log.LogInformation("Ingesting {Doc} as {N} chunks", req.DocumentId, chunks.Count);
// 2. Generate embeddings for the batch (cheaper than one-at-a-time)
var embedResp = await _openAi.GetEmbeddingsAsync(
new EmbeddingsOptions(_embeddingDeployment, chunks),
ct);
// 3. Bulk insert into Azure SQL
await using var sql = new SqlConnection(_connectionString);
await sql.OpenAsync(ct);
// Remove prior chunks for this document (idempotent re-ingestion)
await using (var del = sql.CreateCommand())
{
del.CommandText = "DELETE FROM knowledge_chunks WHERE document_id = @docId";
del.Parameters.AddWithValue("@docId", req.DocumentId);
await del.ExecuteNonQueryAsync(ct);
}
for (int i = 0; i < chunks.Count; i++)
{
var embedding = embedResp.Value.Data[i].Embedding.ToArray();
await using var cmd = sql.CreateCommand();
cmd.CommandText = @"
INSERT INTO knowledge_chunks
(document_id, document_title, chunk_index, content, metadata, embedding)
VALUES
(@docId, @title, @idx, @content, @metadata, @embedding)";
cmd.Parameters.AddWithValue("@docId", req.DocumentId);
cmd.Parameters.AddWithValue("@title", req.Title);
cmd.Parameters.AddWithValue("@idx", i);
cmd.Parameters.AddWithValue("@content", chunks[i]);
cmd.Parameters.AddWithValue("@metadata", JsonSerializer.Serialize(req.Metadata));
cmd.Parameters.Add(new SqlParameter("@embedding", SqlDbType.Variant)
{
Value = new SqlVector<float>(embedding)
});
await cmd.ExecuteNonQueryAsync(ct);
}
_log.LogInformation("Ingested {Doc} ({N} chunks, {Bytes} embedding bytes)",
req.DocumentId, chunks.Count, chunks.Count * 3072 * 4);
}
private static IEnumerable<string> ChunkText(string text, int chunkSize, int overlap)
{
// Token-aware chunking. Simple word-split works for most prose; for code
// or structured docs use a recursive splitter (split on \n\n, then \n,
// then sentences, then words).
var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < words.Length; i += chunkSize - overlap)
{
int len = Math.Min(chunkSize, words.Length - i);
yield return string.Join(' ', words.Skip(i).Take(len));
if (i + len >= words.Length) yield break;
}
}
}
public record IngestionRequest(
string DocumentId,
string Title,
string Content,
IDictionary<string, string> Metadata);
Key details to notice:
- Idempotent re-ingestion —
DELETEthen re-insert per document. Update a doc by ingesting it again. - Batch embedding call — Azure OpenAI accepts arrays. One round-trip per document, not per chunk.
- Metadata as JSON — flexible without schema migrations. Store
page,section,url,confidentiality_level, etc.
Step 3 — Query / answer service (C#)
public class RagChatService
{
private readonly OpenAIClient _openAi;
private readonly string _connectionString;
private readonly string _embeddingDeployment;
private readonly string _chatDeployment;
public RagChatService(OpenAIClient openAi, IConfiguration cfg)
{
_openAi = openAi;
_connectionString = cfg.GetConnectionString("AzureSql")!;
_embeddingDeployment = cfg["AzureOpenAI:EmbeddingDeployment"]!;
_chatDeployment = cfg["AzureOpenAI:ChatDeployment"]!; // e.g. "gpt-4o" or "gpt-4o-mini"
}
public async Task<RagAnswer> AnswerAsync(string question, CancellationToken ct)
{
// 1. Embed the question (same model as ingestion)
var embedResp = await _openAi.GetEmbeddingsAsync(
new EmbeddingsOptions(_embeddingDeployment, new[] { question }), ct);
var queryVector = embedResp.Value.Data[0].Embedding.ToArray();
// 2. Retrieve top-5 most similar chunks from Azure SQL
var retrieved = new List<RetrievedChunk>();
await using (var sql = new SqlConnection(_connectionString))
{
await sql.OpenAsync(ct);
await using var cmd = sql.CreateCommand();
cmd.CommandText = @"
SELECT TOP 5
document_id,
document_title,
chunk_index,
content,
metadata,
VECTOR_DISTANCE('cosine', embedding, @qvec) AS distance
FROM knowledge_chunks
ORDER BY distance ASC;";
cmd.Parameters.Add(new SqlParameter("@qvec", SqlDbType.Variant)
{
Value = new SqlVector<float>(queryVector)
});
await using var reader = await cmd.ExecuteReaderAsync(ct);
while (await reader.ReadAsync(ct))
{
retrieved.Add(new RetrievedChunk(
DocumentId: reader.GetString(0),
DocumentTitle: reader.GetString(1),
ChunkIndex: reader.GetInt32(2),
Content: reader.GetString(3),
Metadata: JsonSerializer.Deserialize<Dictionary<string, string>>(reader.GetString(4))
?? new(),
Distance: reader.GetDouble(5)
));
}
}
// 3. If retrieval scored too poorly, bail with "I don't know"
const double MAX_DISTANCE = 0.45; // cosine; tune for your data
if (retrieved.Count == 0 || retrieved[0].Distance > MAX_DISTANCE)
{
return new RagAnswer(
Question: question,
Answer: "I couldn't find anything in our knowledge base relevant to that question. " +
"Please rephrase or ask your team lead.",
Citations: Array.Empty<RetrievedChunk>(),
ConfidenceScore: 0.0
);
}
// 4. Build the augmented prompt
var contextBlock = string.Join("\n\n---\n\n",
retrieved.Select((c, i) =>
$"[{i + 1}] (from \"{c.DocumentTitle}\", chunk {c.ChunkIndex}):\n{c.Content}"));
var systemPrompt = """
You are a helpful enterprise assistant. Answer the user's question using ONLY the
provided context below. Cite sources using [1], [2], etc. at the end of each
factual claim. If the context does not contain the answer, say:
"I couldn't find a clear answer in the provided documents." — do NOT invent facts.
Keep answers concise. Use bullet points where it improves clarity.
""";
var userPrompt = $"""
Context:
{contextBlock}
Question: {question}
""";
// 5. Call the chat completion
var chatResp = await _openAi.GetChatCompletionsAsync(
new ChatCompletionsOptions
{
DeploymentName = _chatDeployment,
Messages =
{
new ChatRequestSystemMessage(systemPrompt),
new ChatRequestUserMessage(userPrompt)
},
Temperature = 0.0f, // deterministic
MaxTokens = 800,
}, ct);
var answer = chatResp.Value.Choices[0].Message.Content;
return new RagAnswer(
Question: question,
Answer: answer,
Citations: retrieved.ToArray(),
ConfidenceScore: 1.0 - retrieved[0].Distance // rough heuristic
);
}
}
public record RetrievedChunk(
string DocumentId,
string DocumentTitle,
int ChunkIndex,
string Content,
Dictionary<string, string> Metadata,
double Distance);
public record RagAnswer(
string Question,
string Answer,
IReadOnlyList<RetrievedChunk> Citations,
double ConfidenceScore);
Step 4 — DI registration + controller
// Program.cs
builder.Services.AddSingleton(sp =>
{
var cfg = sp.GetRequiredService<IConfiguration>();
return new OpenAIClient(
new Uri(cfg["AzureOpenAI:Endpoint"]!),
new DefaultAzureCredential()); // Managed Identity — no API keys in code
});
builder.Services.AddScoped<RagIngestionService>();
builder.Services.AddScoped<RagChatService>();
// Controllers/RagController.cs
[ApiController]
[Route("rag")]
public class RagController : ControllerBase
{
private readonly RagChatService _chat;
public RagController(RagChatService chat) => _chat = chat;
[HttpPost("ask")]
[Authorize] // require auth — RAG over enterprise data MUST be authenticated
public async Task<ActionResult<RagAnswer>> Ask(
[FromBody] AskRequest req, CancellationToken ct)
{
if (string.IsNullOrWhiteSpace(req.Question))
return BadRequest("Question is required");
if (req.Question.Length > 1000)
return BadRequest("Question too long");
var answer = await _chat.AnswerAsync(req.Question, ct);
return Ok(answer);
}
}
public record AskRequest(string Question);
That's the full pipeline — ingest, retrieve, answer — in three services + one controller. Total: under 200 lines of real code.
Production concerns
Security + privacy
- Managed Identity, not API keys.
DefaultAzureCredentiallets the app authenticate to Azure OpenAI without storing keys. Rotate keys never; rotate the identity's permissions through Azure RBAC. - Row-level security on
knowledge_chunks. If different users see different documents, add atenant_idorconfidentiality_levelcolumn and enforce via SQL row-level security policies. The retrieval query filters BEFORE vector search. - Audit every query. Persist
user_id,question, retrieved chunk IDs, the LLM response, timestamps. Useful for incident review and quality work. - Block prompt injection. Add a system prompt that explicitly ignores "ignore previous instructions" patterns. Strip user input that looks like instructions to the model.
- PII scrubbing on ingestion. Run documents through Azure AI Language's PII detection BEFORE chunking and embedding. Don't index what you can't legally retrieve later.
Cost
| Item | Cost (Azure, approx., 2026) |
|---|---|
| text-embedding-3-large | ~$0.13 per 1M tokens |
| GPT-4o input | ~$5 per 1M tokens |
| GPT-4o output | ~$15 per 1M tokens |
| GPT-4o-mini input | ~$0.15 per 1M tokens (~30x cheaper) |
| Azure SQL with VECTOR | Standard SQL pricing — vectors add ~50% storage |
For a typical "internal documentation Q&A" use case at 1000 questions/day with 5 chunks of ~600 tokens each:
- Embeddings: 1000 × 100 tokens (the question) ≈ ₹2/day
- LLM input: 1000 × (300 system + 3000 context + 100 question) ≈ ₹140/day
- LLM output: 1000 × 200 tokens ≈ ₹250/day
- Total: ~₹12,000/month for the LLM bill
If costs explode: switch GPT-4o → GPT-4o-mini (10x cheaper, often sufficient for Q&A use cases). Use semantic caching to avoid re-asking the LLM for repeated questions.
Latency
| Stage | Latency (p50) |
|---|---|
| Embed question | ~50-150 ms |
| Vector search in Azure SQL | ~30-80 ms (with index) |
| Generate answer (GPT-4o, ~300 token reply) | ~2-4 s |
| Generate answer (GPT-4o-mini) | ~0.5-1.5 s |
User-perceived total: typically 2-5 seconds for GPT-4o, 1-2 seconds for mini. Stream the response (use GetChatCompletionsStreamingAsync) to show first tokens within 500ms — perceived latency drops dramatically.
Freshness
When a document changes, you must re-ingest it. Options:
| Approach | When |
|---|---|
| Manual re-ingest button | Internal tools with few authors |
| Cron job nightly | Sources that change daily (SharePoint, Confluence) |
| Event-driven (Service Bus subscription) | High-volume, near-real-time updates |
| Lambda on blob upload (Azure Function) | "Drop a PDF in this container, it gets indexed" UX |
Stale RAG = wrong answers with confident citations. Build observability around "oldest unrefreshed document" before launching.
The eight pitfalls that ruin production RAG
1. Chunks too small or too large
300 tokens = fragments without context, retrieval works but the answer is incomplete. 2000 tokens = the LLM context fills with one chunk's worth of detail, no room for diverse sources, costs explode.
Sweet spot for prose: 500-1000 tokens with 100-token overlap. For structured docs (code, tables), chunk by structure not by word count.
2. Mixing embedding models
You ingest with text-embedding-3-large and someone changes config to text-embedding-ada-002 for queries. Vectors live in different spaces — distances are nonsense. Retrieval returns garbage.
Pin the embedding deployment name in config and assert it on app startup. If the model changes, re-ingest everything.
3. No "I don't know" path
The LLM, by default, will try to answer EVEN with bad context. It hallucinates plausibly. Users trust it.
Always set a retrieval distance threshold. If the top result is too far, return "I couldn't find an answer" — never invoke the LLM with weak context. Worth showing to users.
4. Pure vector search misses keyword matches
Vector search is great for semantic similarity, terrible for exact terms. "Order #INV-2024-008923" via vector search returns documents about orders and invoices, not the specific record.
Hybrid search — combine vector similarity + BM25 / Azure SQL full-text search — and re-rank. The two methods compensate for each other's weaknesses.
5. No re-ranker
Vector search retrieves the top-50 candidates. Sending them ALL to the LLM is expensive. Sending only top-5 by vector distance often misses the best match.
Add a cross-encoder re-ranker (Cohere Rerank API, or a local cross-encoder model) between retrieval and generation. Take top-50 by vector, re-rank to top-5 by cross-encoder, send those to the LLM. Massive quality jump.
6. Ignoring document metadata in retrieval
Retrieval finds the right CONTENT but ignores that the user is in the EU and the document is "US-only". Or the user is a junior employee asking about salary data.
Filter before vector search by tenant, region, confidentiality level, document type. Azure SQL makes this trivial — just add WHERE clauses before the ORDER BY VECTOR_DISTANCE.
7. No streaming
Generating a 500-token answer takes 3-5 seconds with GPT-4o. Showing nothing for 5 seconds feels broken.
Stream the response with GetChatCompletionsStreamingAsync and pipe to the client via Server-Sent Events. First token arrives in ~500ms; users see motion immediately.
8. Treating every question as a retrieval question
User: "What time is it?" → your system embeds, retrieves random documents, and generates a confused answer.
Add a router that classifies the question first. Greetings, small talk, math, time — answer directly. Only domain queries go through RAG.
RAG vs fine-tuning vs continued pretraining
| RAG | Fine-tuning | Continued pretraining | |
|---|---|---|---|
| Updates content | Add a chunk → live | Re-train | Re-train (expensive) |
| Cite sources | Yes, naturally | No (model "remembers") | No |
| Cost | Pay per query | Pay once to train + small inference | Massive train cost |
| Privacy | Data stays in your DB | Data baked into model weights | Same |
| Latency | Adds retrieval step (~100ms) | Native LLM speed | Native |
| Hallucination risk | Low (with threshold + cite) | Higher | Higher |
| Right for | Enterprise Q&A, support, search | Style / format / classification | Building your own base model |
For 95% of enterprise "feed our data to AI" use cases, RAG is the right answer. Fine-tuning is mostly used for tone-of-voice, output format, or classification problems — NOT for adding knowledge.
When RAG is the right tool
- Q&A over a corpus of >50 documents that changes
- Support / customer service copilots over internal KB
- Code search / "how do we do X in our codebase"
- Compliance Q&A — legal, HR, security policies
- E-commerce product Q&A from catalog
- Enterprise search with summarization
When RAG is the wrong tool
- Math / calculations → use function calling + a real calculator
- Real-time data (stock prices, live inventory) → call APIs, don't index
- Tasks where you need creative generation, not factual recall
- Highly personal user-specific reasoning (chat memory, planning) → use memory + agent patterns
Production checklist
- ✅ Pin embedding model name in config, assert at startup
- ✅ Use Managed Identity to talk to Azure OpenAI — no API keys in code
- ✅ Row-level security or pre-filter on retrieval for multi-tenant / classified data
- ✅ Retrieval distance threshold + "I don't know" path
- ✅ Hybrid search (vector + keyword) for entity / number / acronym queries
- ✅ Optional re-ranker for quality
- ✅ Stream the LLM response with SSE
- ✅ Audit log every query (user, question, retrieved IDs, response, latency)
- ✅ Re-ingestion mechanism for stale documents
- ✅ Semantic cache (Redis) for repeated questions
- ✅ Cost dashboard with daily Azure OpenAI token spend
- ✅ Guardrails: input length limits, prompt-injection-resistant system prompt, output content filter (Azure provides one)
- ✅ Health probe that runs a fixed test question + expected citation, alerts if quality degrades
Summary
RAG with Azure OpenAI + Azure SQL solves the real enterprise problem: giving an LLM access to your proprietary data without sending that data outside your tenant, without fine-tuning, without hallucinations on domain-specific questions, and with citation back to sources.
The architecture is three services: ingest documents → embed chunks → store in Azure SQL VECTOR column; embed user question → vector-search top chunks → prompt the LLM with context. Total production code is under 300 lines.
The real work isn't writing the pipeline — it's the eight pitfalls (chunking strategy, embedding consistency, the "I don't know" path, hybrid search, re-ranking, metadata filtering, streaming, intent routing) that separate a demo from a system you can trust in production.
Pair RAG with a proper observability stack, cost dashboard, and re-ingestion mechanism. With Azure OpenAI in your tenant, Azure SQL holding the index, and Managed Identity for auth, the entire pipeline stays inside your data boundary — which is exactly what your security team needs to hear before they approve the launch.
📚 Test your knowledge → Practice with our RAG interview questions — chunking strategies, RAG vs fine-tuning, hallucination prevention, security gotchas, and the production patterns that actually matter.
Get the next issue
A short, curated email with the newest posts and questions.