Production RAG with Azure OpenAI + Azure SQL — A Complete Guide for .NET Teams (with C# Code)

RAG — Retrieval-Augmented Generation — is the standard pattern enterprises now use to feed proprietary data to large language models (LLMs) without sending that data to public API training pipelines, without fine-tuning, and without the hallucination problem you get when an LLM tries to answer questions it has no knowledge of.

This guide is the complete production picture: the actual problems enterprises face, why RAG solves them, the architecture, real C# code using Azure OpenAI + Azure SQL with the new VECTOR column type, security and cost concerns, and the eight pitfalls that ruin most first RAG implementations.

The dual problem RAG solves

Every enterprise wanting to add LLM-powered features hits the same two walls.

Problem 1 — The LLM has no idea about your enterprise data

GPT-4o knows the public internet up to its training cutoff. It does NOT know:

Your company's 4,000-page employee handbook
Your last 5 years of customer support tickets
Your product catalog with internal SKUs and warehouse locations
The 200 internal Confluence pages on how your deployment pipeline works
The compliance documents specific to your industry / region
Yesterday's prices, today's stock levels

Ask GPT-4o "what's our return policy for items damaged during shipping?" and you get either a generic web-trained answer (wrong) or a hallucination (worse — confidently wrong).

Problem 2 — You cannot just send enterprise data to a public API

Most companies that have private data also have:

Data residency laws (GDPR, HIPAA, RBI guidelines for Indian fintech)
Customer contracts forbidding third-party data sharing
Trade secrets they don't want vendor models training on
Confidentiality classifications that prohibit cloud transit to certain regions

Sending raw documents to a public OpenAI endpoint is often a non-starter. The data must stay inside the company's Azure tenant, encrypted at rest, audited, deletable on request.

What RAG does

RAG combines:

Retrieval — a search system over your proprietary data that returns the most relevant chunks for a given query
Augmented Generation — the LLM receives the user's question PLUS those retrieved chunks as context, and is instructed to answer using ONLY that context

Result:

The LLM answers using your data, never inventing facts not in the context
Your data stays in your storage (Azure SQL, Azure AI Search) — only the retrieved chunks + the question travel to the LLM endpoint
With Azure OpenAI in your own Azure tenant, the entire pipeline is within your data boundary, encrypted, audited, region-pinned
You can cite sources back to the user — "I found this answer in section 4.2 of the employee handbook"

This is why every enterprise launching an internal chatbot, support copilot, or document Q&A feature in 2026 uses RAG.

Architecture in one diagram

                              RAG with Azure OpenAI + Azure SQL
                              ──────────────────────────────────

                              INGESTION PIPELINE (one-time + incremental)

       Documents (PDF, Word, Markdown, HTML, Confluence, SharePoint)
              │
              ▼
       ┌──────────────────────┐
       │  Text Extractor       │   pdf → text via Azure Document
       │  (Form Recognizer,    │   Intelligence; preserve page numbers
       │   or pdfpig, etc.)    │
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Chunker              │   Split into 500-1000 token chunks
       │  (semantic /          │   with 100-token overlap, preserve
       │   recursive split)    │   doc + page metadata
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Azure OpenAI         │   text-embedding-3-large
       │  Embeddings API       │   → 3072-dim float[] per chunk
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Azure SQL Database   │   knowledge_chunks table:
       │   VECTOR(3072) col    │   (id, doc_id, content, metadata,
       │                       │    embedding VECTOR(3072))
       └──────────────────────┘

                              ─── QUERY PIPELINE (every user message) ───

       User question: "What's our policy on damaged shipments?"
              │
              ▼
       ┌──────────────────────┐
       │  Same embedder        │   text-embedding-3-large
       │  (MUST match index)   │   → 3072-dim float[]
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Vector Search        │   SELECT TOP 5 ... ORDER BY
       │  Azure SQL            │   VECTOR_DISTANCE('cosine', ...)
       │   + optional keyword  │   (hybrid: vector + BM25 / FTS)
       │     re-rank           │
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Context Builder      │   Pack top-5 chunks into a prompt
       │  + Citation Tagger    │   with [1], [2], [3] index markers
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Azure OpenAI Chat    │   GPT-4o or gpt-4o-mini
       │  System: "Answer ONLY │   temperature: 0.0
       │  from context. Cite.  │   max_tokens: 800
       │  Say so if missing."  │
       └──────────┬───────────┘
                  │
                  ▼
        Answer to user + citations
        (with source links back to the original docs)

   Everything inside ONE Azure tenant + region
   No data leaves your data boundary
   Vector index sits next to your transactional data

Three properties to internalize:

The embedding model MUST be identical between ingestion and query time. Mixing models = garbage retrieval. Pin the deployment name in config.
The LLM never sees your full corpus. It only sees the top-K retrieved chunks. This is what keeps the prompt small and predictable.
Everything stays in your tenant. Azure OpenAI deployments live in your Azure subscription. Your SQL DB never leaves your region. The data boundary is closed.

Implementation in C# — the three services

Step 1 — Database schema (Azure SQL with VECTOR type)

Azure SQL Database supports a native VECTOR column type (rolled out 2024). It's the cleanest path for .NET shops already on Azure SQL.

-- Create the chunks table
CREATE TABLE knowledge_chunks (
    id              BIGINT IDENTITY PRIMARY KEY,
    document_id     NVARCHAR(255) NOT NULL,
    document_title  NVARCHAR(500) NOT NULL,
    chunk_index     INT NOT NULL,
    content         NVARCHAR(MAX) NOT NULL,
    metadata        NVARCHAR(MAX) NOT NULL,       -- JSON: page, section, url
    embedding       VECTOR(3072) NOT NULL,         -- text-embedding-3-large dim
    created_at      DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME()
);

-- Index for fast vector search (Azure SQL clustered columnstore variant)
CREATE INDEX idx_chunks_document ON knowledge_chunks(document_id);

Add an index on document_id so you can re-ingest by removing + re-inserting one doc at a time. The VECTOR column itself is queried with VECTOR_DISTANCE — Azure SQL handles the math.

Step 2 — Ingestion service (C#)

public class RagIngestionService
{
    private readonly OpenAIClient _openAi;
    private readonly string _connectionString;
    private readonly string _embeddingDeployment;
    private readonly ILogger<RagIngestionService> _log;

    public RagIngestionService(
        OpenAIClient openAi,
        IConfiguration cfg,
        ILogger<RagIngestionService> log)
    {
        _openAi = openAi;
        _connectionString = cfg.GetConnectionString("AzureSql")!;
        _embeddingDeployment = cfg["AzureOpenAI:EmbeddingDeployment"]!; // e.g. "text-embedding-3-large"
        _log = log;
    }

    public async Task IngestAsync(IngestionRequest req, CancellationToken ct)
    {
        // 1. Chunk the document
        var chunks = ChunkText(req.Content, chunkSize: 800, overlap: 100).ToList();
        _log.LogInformation("Ingesting {Doc} as {N} chunks", req.DocumentId, chunks.Count);

        // 2. Generate embeddings for the batch (cheaper than one-at-a-time)
        var embedResp = await _openAi.GetEmbeddingsAsync(
            new EmbeddingsOptions(_embeddingDeployment, chunks),
            ct);

        // 3. Bulk insert into Azure SQL
        await using var sql = new SqlConnection(_connectionString);
        await sql.OpenAsync(ct);

        // Remove prior chunks for this document (idempotent re-ingestion)
        await using (var del = sql.CreateCommand())
        {
            del.CommandText = "DELETE FROM knowledge_chunks WHERE document_id = @docId";
            del.Parameters.AddWithValue("@docId", req.DocumentId);
            await del.ExecuteNonQueryAsync(ct);
        }

        for (int i = 0; i < chunks.Count; i++)
        {
            var embedding = embedResp.Value.Data[i].Embedding.ToArray();
            await using var cmd = sql.CreateCommand();
            cmd.CommandText = @"
                INSERT INTO knowledge_chunks
                    (document_id, document_title, chunk_index, content, metadata, embedding)
                VALUES
                    (@docId, @title, @idx, @content, @metadata, @embedding)";
            cmd.Parameters.AddWithValue("@docId",    req.DocumentId);
            cmd.Parameters.AddWithValue("@title",    req.Title);
            cmd.Parameters.AddWithValue("@idx",      i);
            cmd.Parameters.AddWithValue("@content",  chunks[i]);
            cmd.Parameters.AddWithValue("@metadata", JsonSerializer.Serialize(req.Metadata));
            cmd.Parameters.Add(new SqlParameter("@embedding", SqlDbType.Variant)
            {
                Value = new SqlVector<float>(embedding)
            });
            await cmd.ExecuteNonQueryAsync(ct);
        }

        _log.LogInformation("Ingested {Doc} ({N} chunks, {Bytes} embedding bytes)",
            req.DocumentId, chunks.Count, chunks.Count * 3072 * 4);
    }

    private static IEnumerable<string> ChunkText(string text, int chunkSize, int overlap)
    {
        // Token-aware chunking. Simple word-split works for most prose; for code
        // or structured docs use a recursive splitter (split on \n\n, then \n,
        // then sentences, then words).
        var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        for (int i = 0; i < words.Length; i += chunkSize - overlap)
        {
            int len = Math.Min(chunkSize, words.Length - i);
            yield return string.Join(' ', words.Skip(i).Take(len));
            if (i + len >= words.Length) yield break;
        }
    }
}

public record IngestionRequest(
    string DocumentId,
    string Title,
    string Content,
    IDictionary<string, string> Metadata);

Key details to notice:

Idempotent re-ingestion — DELETE then re-insert per document. Update a doc by ingesting it again.
Batch embedding call — Azure OpenAI accepts arrays. One round-trip per document, not per chunk.
Metadata as JSON — flexible without schema migrations. Store page, section, url, confidentiality_level, etc.

Step 3 — Query / answer service (C#)

public class RagChatService
{
    private readonly OpenAIClient _openAi;
    private readonly string _connectionString;
    private readonly string _embeddingDeployment;
    private readonly string _chatDeployment;

    public RagChatService(OpenAIClient openAi, IConfiguration cfg)
    {
        _openAi = openAi;
        _connectionString = cfg.GetConnectionString("AzureSql")!;
        _embeddingDeployment = cfg["AzureOpenAI:EmbeddingDeployment"]!;
        _chatDeployment = cfg["AzureOpenAI:ChatDeployment"]!; // e.g. "gpt-4o" or "gpt-4o-mini"
    }

    public async Task<RagAnswer> AnswerAsync(string question, CancellationToken ct)
    {
        // 1. Embed the question (same model as ingestion)
        var embedResp = await _openAi.GetEmbeddingsAsync(
            new EmbeddingsOptions(_embeddingDeployment, new[] { question }), ct);
        var queryVector = embedResp.Value.Data[0].Embedding.ToArray();

        // 2. Retrieve top-5 most similar chunks from Azure SQL
        var retrieved = new List<RetrievedChunk>();
        await using (var sql = new SqlConnection(_connectionString))
        {
            await sql.OpenAsync(ct);
            await using var cmd = sql.CreateCommand();
            cmd.CommandText = @"
                SELECT TOP 5
                    document_id,
                    document_title,
                    chunk_index,
                    content,
                    metadata,
                    VECTOR_DISTANCE('cosine', embedding, @qvec) AS distance
                FROM knowledge_chunks
                ORDER BY distance ASC;";
            cmd.Parameters.Add(new SqlParameter("@qvec", SqlDbType.Variant)
            {
                Value = new SqlVector<float>(queryVector)
            });

            await using var reader = await cmd.ExecuteReaderAsync(ct);
            while (await reader.ReadAsync(ct))
            {
                retrieved.Add(new RetrievedChunk(
                    DocumentId: reader.GetString(0),
                    DocumentTitle: reader.GetString(1),
                    ChunkIndex: reader.GetInt32(2),
                    Content: reader.GetString(3),
                    Metadata: JsonSerializer.Deserialize<Dictionary<string, string>>(reader.GetString(4))
                              ?? new(),
                    Distance: reader.GetDouble(5)
                ));
            }
        }

        // 3. If retrieval scored too poorly, bail with "I don't know"
        const double MAX_DISTANCE = 0.45;   // cosine; tune for your data
        if (retrieved.Count == 0 || retrieved[0].Distance > MAX_DISTANCE)
        {
            return new RagAnswer(
                Question: question,
                Answer: "I couldn't find anything in our knowledge base relevant to that question. " +
                        "Please rephrase or ask your team lead.",
                Citations: Array.Empty<RetrievedChunk>(),
                ConfidenceScore: 0.0
            );
        }

        // 4. Build the augmented prompt
        var contextBlock = string.Join("\n\n---\n\n",
            retrieved.Select((c, i) =>
                $"[{i + 1}] (from \"{c.DocumentTitle}\", chunk {c.ChunkIndex}):\n{c.Content}"));

        var systemPrompt = """
            You are a helpful enterprise assistant. Answer the user's question using ONLY the
            provided context below. Cite sources using [1], [2], etc. at the end of each
            factual claim. If the context does not contain the answer, say:
            "I couldn't find a clear answer in the provided documents." — do NOT invent facts.
            Keep answers concise. Use bullet points where it improves clarity.
            """;

        var userPrompt = $"""
            Context:
            {contextBlock}

            Question: {question}
            """;

        // 5. Call the chat completion
        var chatResp = await _openAi.GetChatCompletionsAsync(
            new ChatCompletionsOptions
            {
                DeploymentName = _chatDeployment,
                Messages =
                {
                    new ChatRequestSystemMessage(systemPrompt),
                    new ChatRequestUserMessage(userPrompt)
                },
                Temperature = 0.0f,         // deterministic
                MaxTokens = 800,
            }, ct);

        var answer = chatResp.Value.Choices[0].Message.Content;

        return new RagAnswer(
            Question: question,
            Answer: answer,
            Citations: retrieved.ToArray(),
            ConfidenceScore: 1.0 - retrieved[0].Distance  // rough heuristic
        );
    }
}

public record RetrievedChunk(
    string DocumentId,
    string DocumentTitle,
    int ChunkIndex,
    string Content,
    Dictionary<string, string> Metadata,
    double Distance);

public record RagAnswer(
    string Question,
    string Answer,
    IReadOnlyList<RetrievedChunk> Citations,
    double ConfidenceScore);

Step 4 — DI registration + controller

// Program.cs
builder.Services.AddSingleton(sp =>
{
    var cfg = sp.GetRequiredService<IConfiguration>();
    return new OpenAIClient(
        new Uri(cfg["AzureOpenAI:Endpoint"]!),
        new DefaultAzureCredential());   // Managed Identity — no API keys in code
});
builder.Services.AddScoped<RagIngestionService>();
builder.Services.AddScoped<RagChatService>();

// Controllers/RagController.cs
[ApiController]
[Route("rag")]
public class RagController : ControllerBase
{
    private readonly RagChatService _chat;
    public RagController(RagChatService chat) => _chat = chat;

    [HttpPost("ask")]
    [Authorize]   // require auth — RAG over enterprise data MUST be authenticated
    public async Task<ActionResult<RagAnswer>> Ask(
        [FromBody] AskRequest req, CancellationToken ct)
    {
        if (string.IsNullOrWhiteSpace(req.Question))
            return BadRequest("Question is required");
        if (req.Question.Length > 1000)
            return BadRequest("Question too long");

        var answer = await _chat.AnswerAsync(req.Question, ct);
        return Ok(answer);
    }
}

public record AskRequest(string Question);

That's the full pipeline — ingest, retrieve, answer — in three services + one controller. Total: under 200 lines of real code.

Production concerns

Security + privacy

Managed Identity, not API keys. DefaultAzureCredential lets the app authenticate to Azure OpenAI without storing keys. Rotate keys never; rotate the identity's permissions through Azure RBAC.
Row-level security on knowledge_chunks. If different users see different documents, add a tenant_id or confidentiality_level column and enforce via SQL row-level security policies. The retrieval query filters BEFORE vector search.
Audit every query. Persist user_id, question, retrieved chunk IDs, the LLM response, timestamps. Useful for incident review and quality work.
Block prompt injection. Add a system prompt that explicitly ignores "ignore previous instructions" patterns. Strip user input that looks like instructions to the model.
PII scrubbing on ingestion. Run documents through Azure AI Language's PII detection BEFORE chunking and embedding. Don't index what you can't legally retrieve later.

Cost

Item	Cost (Azure, approx., 2026)
text-embedding-3-large	~$0.13 per 1M tokens
GPT-4o input	~$5 per 1M tokens
GPT-4o output	~$15 per 1M tokens
GPT-4o-mini input	~$0.15 per 1M tokens (~30x cheaper)
Azure SQL with VECTOR	Standard SQL pricing — vectors add ~50% storage

For a typical "internal documentation Q&A" use case at 1000 questions/day with 5 chunks of ~600 tokens each:

Embeddings: 1000 × 100 tokens (the question) ≈ ₹2/day
LLM input: 1000 × (300 system + 3000 context + 100 question) ≈ ₹140/day
LLM output: 1000 × 200 tokens ≈ ₹250/day
Total: ~₹12,000/month for the LLM bill

If costs explode: switch GPT-4o → GPT-4o-mini (10x cheaper, often sufficient for Q&A use cases). Use semantic caching to avoid re-asking the LLM for repeated questions.

Latency

Stage	Latency (p50)
Embed question	~50-150 ms
Vector search in Azure SQL	~30-80 ms (with index)
Generate answer (GPT-4o, ~300 token reply)	~2-4 s
Generate answer (GPT-4o-mini)	~0.5-1.5 s

User-perceived total: typically 2-5 seconds for GPT-4o, 1-2 seconds for mini. Stream the response (use GetChatCompletionsStreamingAsync) to show first tokens within 500ms — perceived latency drops dramatically.

Freshness

When a document changes, you must re-ingest it. Options:

Approach	When
Manual re-ingest button	Internal tools with few authors
Cron job nightly	Sources that change daily (SharePoint, Confluence)
Event-driven (Service Bus subscription)	High-volume, near-real-time updates
Lambda on blob upload (Azure Function)	"Drop a PDF in this container, it gets indexed" UX

Stale RAG = wrong answers with confident citations. Build observability around "oldest unrefreshed document" before launching.

The eight pitfalls that ruin production RAG

1. Chunks too small or too large

300 tokens = fragments without context, retrieval works but the answer is incomplete. 2000 tokens = the LLM context fills with one chunk's worth of detail, no room for diverse sources, costs explode.

Sweet spot for prose: 500-1000 tokens with 100-token overlap. For structured docs (code, tables), chunk by structure not by word count.

2. Mixing embedding models

You ingest with text-embedding-3-large and someone changes config to text-embedding-ada-002 for queries. Vectors live in different spaces — distances are nonsense. Retrieval returns garbage.

Pin the embedding deployment name in config and assert it on app startup. If the model changes, re-ingest everything.

3. No "I don't know" path

The LLM, by default, will try to answer EVEN with bad context. It hallucinates plausibly. Users trust it.

Always set a retrieval distance threshold. If the top result is too far, return "I couldn't find an answer" — never invoke the LLM with weak context. Worth showing to users.

4. Pure vector search misses keyword matches

Vector search is great for semantic similarity, terrible for exact terms. "Order #INV-2024-008923" via vector search returns documents about orders and invoices, not the specific record.

Hybrid search — combine vector similarity + BM25 / Azure SQL full-text search — and re-rank. The two methods compensate for each other's weaknesses.

5. No re-ranker

Vector search retrieves the top-50 candidates. Sending them ALL to the LLM is expensive. Sending only top-5 by vector distance often misses the best match.

Add a cross-encoder re-ranker (Cohere Rerank API, or a local cross-encoder model) between retrieval and generation. Take top-50 by vector, re-rank to top-5 by cross-encoder, send those to the LLM. Massive quality jump.

6. Ignoring document metadata in retrieval

Retrieval finds the right CONTENT but ignores that the user is in the EU and the document is "US-only". Or the user is a junior employee asking about salary data.

Filter before vector search by tenant, region, confidentiality level, document type. Azure SQL makes this trivial — just add WHERE clauses before the ORDER BY VECTOR_DISTANCE.

7. No streaming

Generating a 500-token answer takes 3-5 seconds with GPT-4o. Showing nothing for 5 seconds feels broken.

Stream the response with GetChatCompletionsStreamingAsync and pipe to the client via Server-Sent Events. First token arrives in ~500ms; users see motion immediately.

8. Treating every question as a retrieval question

User: "What time is it?" → your system embeds, retrieves random documents, and generates a confused answer.

Add a router that classifies the question first. Greetings, small talk, math, time — answer directly. Only domain queries go through RAG.

RAG vs fine-tuning vs continued pretraining

	RAG	Fine-tuning	Continued pretraining
Updates content	Add a chunk → live	Re-train	Re-train (expensive)
Cite sources	Yes, naturally	No (model "remembers")	No
Cost	Pay per query	Pay once to train + small inference	Massive train cost
Privacy	Data stays in your DB	Data baked into model weights	Same
Latency	Adds retrieval step (~100ms)	Native LLM speed	Native
Hallucination risk	Low (with threshold + cite)	Higher	Higher
Right for	Enterprise Q&A, support, search	Style / format / classification	Building your own base model

For 95% of enterprise "feed our data to AI" use cases, RAG is the right answer. Fine-tuning is mostly used for tone-of-voice, output format, or classification problems — NOT for adding knowledge.

When RAG is the right tool

Q&A over a corpus of >50 documents that changes
Support / customer service copilots over internal KB
Code search / "how do we do X in our codebase"
Compliance Q&A — legal, HR, security policies
E-commerce product Q&A from catalog
Enterprise search with summarization

When RAG is the wrong tool

Math / calculations → use function calling + a real calculator
Real-time data (stock prices, live inventory) → call APIs, don't index
Tasks where you need creative generation, not factual recall
Highly personal user-specific reasoning (chat memory, planning) → use memory + agent patterns

Production checklist

✅ Pin embedding model name in config, assert at startup
✅ Use Managed Identity to talk to Azure OpenAI — no API keys in code
✅ Row-level security or pre-filter on retrieval for multi-tenant / classified data
✅ Retrieval distance threshold + "I don't know" path
✅ Hybrid search (vector + keyword) for entity / number / acronym queries
✅ Optional re-ranker for quality
✅ Stream the LLM response with SSE
✅ Audit log every query (user, question, retrieved IDs, response, latency)
✅ Re-ingestion mechanism for stale documents
✅ Semantic cache (Redis) for repeated questions
✅ Cost dashboard with daily Azure OpenAI token spend
✅ Guardrails: input length limits, prompt-injection-resistant system prompt, output content filter (Azure provides one)
✅ Health probe that runs a fixed test question + expected citation, alerts if quality degrades

Summary

RAG with Azure OpenAI + Azure SQL solves the real enterprise problem: giving an LLM access to your proprietary data without sending that data outside your tenant, without fine-tuning, without hallucinations on domain-specific questions, and with citation back to sources.

The architecture is three services: ingest documents → embed chunks → store in Azure SQL VECTOR column; embed user question → vector-search top chunks → prompt the LLM with context. Total production code is under 300 lines.

The real work isn't writing the pipeline — it's the eight pitfalls (chunking strategy, embedding consistency, the "I don't know" path, hybrid search, re-ranking, metadata filtering, streaming, intent routing) that separate a demo from a system you can trust in production.

Pair RAG with a proper observability stack, cost dashboard, and re-ingestion mechanism. With Azure OpenAI in your tenant, Azure SQL holding the index, and Managed Identity for auth, the entire pipeline stays inside your data boundary — which is exactly what your security team needs to hear before they approve the launch.

📚 Test your knowledge → Practice with our RAG interview questions — chunking strategies, RAG vs fine-tuning, hallucination prevention, security gotchas, and the production patterns that actually matter.

The dual problem RAG solves

Every enterprise wanting to add LLM-powered features hits the same two walls.

Problem 1 — The LLM has no idea about your enterprise data

GPT-4o knows the public internet up to its training cutoff. It does NOT know:

Your company's 4,000-page employee handbook
Your last 5 years of customer support tickets
Your product catalog with internal SKUs and warehouse locations
The 200 internal Confluence pages on how your deployment pipeline works
The compliance documents specific to your industry / region
Yesterday's prices, today's stock levels

Ask GPT-4o "what's our return policy for items damaged during shipping?" and you get either a generic web-trained answer (wrong) or a hallucination (worse — confidently wrong).

Problem 2 — You cannot just send enterprise data to a public API

Most companies that have private data also have:

Data residency laws (GDPR, HIPAA, RBI guidelines for Indian fintech)
Customer contracts forbidding third-party data sharing
Trade secrets they don't want vendor models training on
Confidentiality classifications that prohibit cloud transit to certain regions

Sending raw documents to a public OpenAI endpoint is often a non-starter. The data must stay inside the company's Azure tenant, encrypted at rest, audited, deletable on request.

What RAG does

RAG combines:

Retrieval — a search system over your proprietary data that returns the most relevant chunks for a given query
Augmented Generation — the LLM receives the user's question PLUS those retrieved chunks as context, and is instructed to answer using ONLY that context

Result:

The LLM answers using your data, never inventing facts not in the context
Your data stays in your storage (Azure SQL, Azure AI Search) — only the retrieved chunks + the question travel to the LLM endpoint
With Azure OpenAI in your own Azure tenant, the entire pipeline is within your data boundary, encrypted, audited, region-pinned
You can cite sources back to the user — "I found this answer in section 4.2 of the employee handbook"

This is why every enterprise launching an internal chatbot, support copilot, or document Q&A feature in 2026 uses RAG.

Architecture in one diagram

                              RAG with Azure OpenAI + Azure SQL
                              ──────────────────────────────────

                              INGESTION PIPELINE (one-time + incremental)

       Documents (PDF, Word, Markdown, HTML, Confluence, SharePoint)
              │
              ▼
       ┌──────────────────────┐
       │  Text Extractor       │   pdf → text via Azure Document
       │  (Form Recognizer,    │   Intelligence; preserve page numbers
       │   or pdfpig, etc.)    │
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Chunker              │   Split into 500-1000 token chunks
       │  (semantic /          │   with 100-token overlap, preserve
       │   recursive split)    │   doc + page metadata
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Azure OpenAI         │   text-embedding-3-large
       │  Embeddings API       │   → 3072-dim float[] per chunk
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Azure SQL Database   │   knowledge_chunks table:
       │   VECTOR(3072) col    │   (id, doc_id, content, metadata,
       │                       │    embedding VECTOR(3072))
       └──────────────────────┘

                              ─── QUERY PIPELINE (every user message) ───

       User question: "What's our policy on damaged shipments?"
              │
              ▼
       ┌──────────────────────┐
       │  Same embedder        │   text-embedding-3-large
       │  (MUST match index)   │   → 3072-dim float[]
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Vector Search        │   SELECT TOP 5 ... ORDER BY
       │  Azure SQL            │   VECTOR_DISTANCE('cosine', ...)
       │   + optional keyword  │   (hybrid: vector + BM25 / FTS)
       │     re-rank           │
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Context Builder      │   Pack top-5 chunks into a prompt
       │  + Citation Tagger    │   with [1], [2], [3] index markers
       └──────────┬───────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │  Azure OpenAI Chat    │   GPT-4o or gpt-4o-mini
       │  System: "Answer ONLY │   temperature: 0.0
       │  from context. Cite.  │   max_tokens: 800
       │  Say so if missing."  │
       └──────────┬───────────┘
                  │
                  ▼
        Answer to user + citations
        (with source links back to the original docs)

   Everything inside ONE Azure tenant + region
   No data leaves your data boundary
   Vector index sits next to your transactional data

Three properties to internalize:

The embedding model MUST be identical between ingestion and query time. Mixing models = garbage retrieval. Pin the deployment name in config.
The LLM never sees your full corpus. It only sees the top-K retrieved chunks. This is what keeps the prompt small and predictable.
Everything stays in your tenant. Azure OpenAI deployments live in your Azure subscription. Your SQL DB never leaves your region. The data boundary is closed.

Implementation in C# — the three services

Step 1 — Database schema (Azure SQL with VECTOR type)

Azure SQL Database supports a native VECTOR column type (rolled out 2024). It's the cleanest path for .NET shops already on Azure SQL.

-- Create the chunks table
CREATE TABLE knowledge_chunks (
    id              BIGINT IDENTITY PRIMARY KEY,
    document_id     NVARCHAR(255) NOT NULL,
    document_title  NVARCHAR(500) NOT NULL,
    chunk_index     INT NOT NULL,
    content         NVARCHAR(MAX) NOT NULL,
    metadata        NVARCHAR(MAX) NOT NULL,       -- JSON: page, section, url
    embedding       VECTOR(3072) NOT NULL,         -- text-embedding-3-large dim
    created_at      DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME()
);

-- Index for fast vector search (Azure SQL clustered columnstore variant)
CREATE INDEX idx_chunks_document ON knowledge_chunks(document_id);

Add an index on document_id so you can re-ingest by removing + re-inserting one doc at a time. The VECTOR column itself is queried with VECTOR_DISTANCE — Azure SQL handles the math.

Step 2 — Ingestion service (C#)

public class RagIngestionService
{
    private readonly OpenAIClient _openAi;
    private readonly string _connectionString;
    private readonly string _embeddingDeployment;
    private readonly ILogger<RagIngestionService> _log;

    public RagIngestionService(
        OpenAIClient openAi,
        IConfiguration cfg,
        ILogger<RagIngestionService> log)
    {
        _openAi = openAi;
        _connectionString = cfg.GetConnectionString("AzureSql")!;
        _embeddingDeployment = cfg["AzureOpenAI:EmbeddingDeployment"]!; // e.g. "text-embedding-3-large"
        _log = log;
    }

    public async Task IngestAsync(IngestionRequest req, CancellationToken ct)
    {
        // 1. Chunk the document
        var chunks = ChunkText(req.Content, chunkSize: 800, overlap: 100).ToList();
        _log.LogInformation("Ingesting {Doc} as {N} chunks", req.DocumentId, chunks.Count);

        // 2. Generate embeddings for the batch (cheaper than one-at-a-time)
        var embedResp = await _openAi.GetEmbeddingsAsync(
            new EmbeddingsOptions(_embeddingDeployment, chunks),
            ct);

        // 3. Bulk insert into Azure SQL
        await using var sql = new SqlConnection(_connectionString);
        await sql.OpenAsync(ct);

        // Remove prior chunks for this document (idempotent re-ingestion)
        await using (var del = sql.CreateCommand())
        {
            del.CommandText = "DELETE FROM knowledge_chunks WHERE document_id = @docId";
            del.Parameters.AddWithValue("@docId", req.DocumentId);
            await del.ExecuteNonQueryAsync(ct);
        }

        for (int i = 0; i < chunks.Count; i++)
        {
            var embedding = embedResp.Value.Data[i].Embedding.ToArray();
            await using var cmd = sql.CreateCommand();
            cmd.CommandText = @"
                INSERT INTO knowledge_chunks
                    (document_id, document_title, chunk_index, content, metadata, embedding)
                VALUES
                    (@docId, @title, @idx, @content, @metadata, @embedding)";
            cmd.Parameters.AddWithValue("@docId",    req.DocumentId);
            cmd.Parameters.AddWithValue("@title",    req.Title);
            cmd.Parameters.AddWithValue("@idx",      i);
            cmd.Parameters.AddWithValue("@content",  chunks[i]);
            cmd.Parameters.AddWithValue("@metadata", JsonSerializer.Serialize(req.Metadata));
            cmd.Parameters.Add(new SqlParameter("@embedding", SqlDbType.Variant)
            {
                Value = new SqlVector<float>(embedding)
            });
            await cmd.ExecuteNonQueryAsync(ct);
        }

        _log.LogInformation("Ingested {Doc} ({N} chunks, {Bytes} embedding bytes)",
            req.DocumentId, chunks.Count, chunks.Count * 3072 * 4);
    }

    private static IEnumerable<string> ChunkText(string text, int chunkSize, int overlap)
    {
        // Token-aware chunking. Simple word-split works for most prose; for code
        // or structured docs use a recursive splitter (split on \n\n, then \n,
        // then sentences, then words).
        var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        for (int i = 0; i < words.Length; i += chunkSize - overlap)
        {
            int len = Math.Min(chunkSize, words.Length - i);
            yield return string.Join(' ', words.Skip(i).Take(len));
            if (i + len >= words.Length) yield break;
        }
    }
}

public record IngestionRequest(
    string DocumentId,
    string Title,
    string Content,
    IDictionary<string, string> Metadata);

Key details to notice:

Idempotent re-ingestion — DELETE then re-insert per document. Update a doc by ingesting it again.
Batch embedding call — Azure OpenAI accepts arrays. One round-trip per document, not per chunk.
Metadata as JSON — flexible without schema migrations. Store page, section, url, confidentiality_level, etc.

Step 3 — Query / answer service (C#)

public class RagChatService
{
    private readonly OpenAIClient _openAi;
    private readonly string _connectionString;
    private readonly string _embeddingDeployment;
    private readonly string _chatDeployment;

    public RagChatService(OpenAIClient openAi, IConfiguration cfg)
    {
        _openAi = openAi;
        _connectionString = cfg.GetConnectionString("AzureSql")!;
        _embeddingDeployment = cfg["AzureOpenAI:EmbeddingDeployment"]!;
        _chatDeployment = cfg["AzureOpenAI:ChatDeployment"]!; // e.g. "gpt-4o" or "gpt-4o-mini"
    }

    public async Task<RagAnswer> AnswerAsync(string question, CancellationToken ct)
    {
        // 1. Embed the question (same model as ingestion)
        var embedResp = await _openAi.GetEmbeddingsAsync(
            new EmbeddingsOptions(_embeddingDeployment, new[] { question }), ct);
        var queryVector = embedResp.Value.Data[0].Embedding.ToArray();

        // 2. Retrieve top-5 most similar chunks from Azure SQL
        var retrieved = new List<RetrievedChunk>();
        await using (var sql = new SqlConnection(_connectionString))
        {
            await sql.OpenAsync(ct);
            await using var cmd = sql.CreateCommand();
            cmd.CommandText = @"
                SELECT TOP 5
                    document_id,
                    document_title,
                    chunk_index,
                    content,
                    metadata,
                    VECTOR_DISTANCE('cosine', embedding, @qvec) AS distance
                FROM knowledge_chunks
                ORDER BY distance ASC;";
            cmd.Parameters.Add(new SqlParameter("@qvec", SqlDbType.Variant)
            {
                Value = new SqlVector<float>(queryVector)
            });

            await using var reader = await cmd.ExecuteReaderAsync(ct);
            while (await reader.ReadAsync(ct))
            {
                retrieved.Add(new RetrievedChunk(
                    DocumentId: reader.GetString(0),
                    DocumentTitle: reader.GetString(1),
                    ChunkIndex: reader.GetInt32(2),
                    Content: reader.GetString(3),
                    Metadata: JsonSerializer.Deserialize<Dictionary<string, string>>(reader.GetString(4))
                              ?? new(),
                    Distance: reader.GetDouble(5)
                ));
            }
        }

        // 3. If retrieval scored too poorly, bail with "I don't know"
        const double MAX_DISTANCE = 0.45;   // cosine; tune for your data
        if (retrieved.Count == 0 || retrieved[0].Distance > MAX_DISTANCE)
        {
            return new RagAnswer(
                Question: question,
                Answer: "I couldn't find anything in our knowledge base relevant to that question. " +
                        "Please rephrase or ask your team lead.",
                Citations: Array.Empty<RetrievedChunk>(),
                ConfidenceScore: 0.0
            );
        }

        // 4. Build the augmented prompt
        var contextBlock = string.Join("\n\n---\n\n",
            retrieved.Select((c, i) =>
                $"[{i + 1}] (from \"{c.DocumentTitle}\", chunk {c.ChunkIndex}):\n{c.Content}"));

        var systemPrompt = """
            You are a helpful enterprise assistant. Answer the user's question using ONLY the
            provided context below. Cite sources using [1], [2], etc. at the end of each
            factual claim. If the context does not contain the answer, say:
            "I couldn't find a clear answer in the provided documents." — do NOT invent facts.
            Keep answers concise. Use bullet points where it improves clarity.
            """;

        var userPrompt = $"""
            Context:
            {contextBlock}

            Question: {question}
            """;

        // 5. Call the chat completion
        var chatResp = await _openAi.GetChatCompletionsAsync(
            new ChatCompletionsOptions
            {
                DeploymentName = _chatDeployment,
                Messages =
                {
                    new ChatRequestSystemMessage(systemPrompt),
                    new ChatRequestUserMessage(userPrompt)
                },
                Temperature = 0.0f,         // deterministic
                MaxTokens = 800,
            }, ct);

        var answer = chatResp.Value.Choices[0].Message.Content;

        return new RagAnswer(
            Question: question,
            Answer: answer,
            Citations: retrieved.ToArray(),
            ConfidenceScore: 1.0 - retrieved[0].Distance  // rough heuristic
        );
    }
}

public record RetrievedChunk(
    string DocumentId,
    string DocumentTitle,
    int ChunkIndex,
    string Content,
    Dictionary<string, string> Metadata,
    double Distance);

public record RagAnswer(
    string Question,
    string Answer,
    IReadOnlyList<RetrievedChunk> Citations,
    double ConfidenceScore);

Step 4 — DI registration + controller

// Program.cs
builder.Services.AddSingleton(sp =>
{
    var cfg = sp.GetRequiredService<IConfiguration>();
    return new OpenAIClient(
        new Uri(cfg["AzureOpenAI:Endpoint"]!),
        new DefaultAzureCredential());   // Managed Identity — no API keys in code
});
builder.Services.AddScoped<RagIngestionService>();
builder.Services.AddScoped<RagChatService>();

// Controllers/RagController.cs
[ApiController]
[Route("rag")]
public class RagController : ControllerBase
{
    private readonly RagChatService _chat;
    public RagController(RagChatService chat) => _chat = chat;

    [HttpPost("ask")]
    [Authorize]   // require auth — RAG over enterprise data MUST be authenticated
    public async Task<ActionResult<RagAnswer>> Ask(
        [FromBody] AskRequest req, CancellationToken ct)
    {
        if (string.IsNullOrWhiteSpace(req.Question))
            return BadRequest("Question is required");
        if (req.Question.Length > 1000)
            return BadRequest("Question too long");

        var answer = await _chat.AnswerAsync(req.Question, ct);
        return Ok(answer);
    }
}

public record AskRequest(string Question);

That's the full pipeline — ingest, retrieve, answer — in three services + one controller. Total: under 200 lines of real code.

Production concerns

Security + privacy

Managed Identity, not API keys. DefaultAzureCredential lets the app authenticate to Azure OpenAI without storing keys. Rotate keys never; rotate the identity's permissions through Azure RBAC.
Row-level security on knowledge_chunks. If different users see different documents, add a tenant_id or confidentiality_level column and enforce via SQL row-level security policies. The retrieval query filters BEFORE vector search.
Audit every query. Persist user_id, question, retrieved chunk IDs, the LLM response, timestamps. Useful for incident review and quality work.
Block prompt injection. Add a system prompt that explicitly ignores "ignore previous instructions" patterns. Strip user input that looks like instructions to the model.
PII scrubbing on ingestion. Run documents through Azure AI Language's PII detection BEFORE chunking and embedding. Don't index what you can't legally retrieve later.

Cost

Item	Cost (Azure, approx., 2026)
text-embedding-3-large	~$0.13 per 1M tokens
GPT-4o input	~$5 per 1M tokens
GPT-4o output	~$15 per 1M tokens
GPT-4o-mini input	~$0.15 per 1M tokens (~30x cheaper)
Azure SQL with VECTOR	Standard SQL pricing — vectors add ~50% storage

For a typical "internal documentation Q&A" use case at 1000 questions/day with 5 chunks of ~600 tokens each:

Embeddings: 1000 × 100 tokens (the question) ≈ ₹2/day
LLM input: 1000 × (300 system + 3000 context + 100 question) ≈ ₹140/day
LLM output: 1000 × 200 tokens ≈ ₹250/day
Total: ~₹12,000/month for the LLM bill

If costs explode: switch GPT-4o → GPT-4o-mini (10x cheaper, often sufficient for Q&A use cases). Use semantic caching to avoid re-asking the LLM for repeated questions.

Latency

Stage	Latency (p50)
Embed question	~50-150 ms
Vector search in Azure SQL	~30-80 ms (with index)
Generate answer (GPT-4o, ~300 token reply)	~2-4 s
Generate answer (GPT-4o-mini)	~0.5-1.5 s

Freshness

When a document changes, you must re-ingest it. Options:

Approach	When
Manual re-ingest button	Internal tools with few authors
Cron job nightly	Sources that change daily (SharePoint, Confluence)
Event-driven (Service Bus subscription)	High-volume, near-real-time updates
Lambda on blob upload (Azure Function)	"Drop a PDF in this container, it gets indexed" UX

Stale RAG = wrong answers with confident citations. Build observability around "oldest unrefreshed document" before launching.

	RAG	Fine-tuning	Continued pretraining
Updates content	Add a chunk → live	Re-train	Re-train (expensive)
Cite sources	Yes, naturally	No (model "remembers")	No
Cost	Pay per query	Pay once to train + small inference	Massive train cost
Privacy	Data stays in your DB	Data baked into model weights	Same
Latency	Adds retrieval step (~100ms)	Native LLM speed	Native
Hallucination risk	Low (with threshold + cite)	Higher	Higher
Right for	Enterprise Q&A, support, search	Style / format / classification	Building your own base model

When RAG is the right tool

Q&A over a corpus of >50 documents that changes
Support / customer service copilots over internal KB
Code search / "how do we do X in our codebase"
Compliance Q&A — legal, HR, security policies
E-commerce product Q&A from catalog
Enterprise search with summarization

When RAG is the wrong tool

Math / calculations → use function calling + a real calculator
Real-time data (stock prices, live inventory) → call APIs, don't index
Tasks where you need creative generation, not factual recall
Highly personal user-specific reasoning (chat memory, planning) → use memory + agent patterns

Production checklist

✅ Pin embedding model name in config, assert at startup
✅ Use Managed Identity to talk to Azure OpenAI — no API keys in code
✅ Row-level security or pre-filter on retrieval for multi-tenant / classified data
✅ Retrieval distance threshold + "I don't know" path
✅ Hybrid search (vector + keyword) for entity / number / acronym queries
✅ Optional re-ranker for quality
✅ Stream the LLM response with SSE
✅ Audit log every query (user, question, retrieved IDs, response, latency)
✅ Re-ingestion mechanism for stale documents
✅ Semantic cache (Redis) for repeated questions
✅ Cost dashboard with daily Azure OpenAI token spend
✅ Guardrails: input length limits, prompt-injection-resistant system prompt, output content filter (Azure provides one)
✅ Health probe that runs a fixed test question + expected citation, alerts if quality degrades

Get the next issue

Keep reading

Get the next issue

Keep reading