How do you prevent hallucinations and improve answer quality in RAG?
Hallucination = the LLM confidently produces text that's not supported by the provided context (or worse, plain wrong). In a RAG system this is the #1 production concern because users TRUST cited answers.
The six techniques that actually work
1. System prompt that forces context-only answers + "I don't know"
The single highest-leverage fix.
var systemPrompt = """
You are an enterprise assistant. Answer the user's question using ONLY the
provided context below. If the answer is not in the context, respond with:
"I couldn't find a clear answer in the provided documents."
Do NOT use any prior knowledge. Do NOT speculate. Do NOT invent facts.
For every factual claim, cite the source using [1], [2], [3] markers
matching the context numbering.
Keep answers concise. Use bullet points where it improves clarity.
""";
Models like GPT-4o respect this instruction reliably when temperature is 0.0. Without the explicit "say I don't know" clause, they fall back to general knowledge or invent.
2. Retrieval distance threshold
If the best retrieved chunk is too far (semantically dissimilar) from the question, do NOT call the LLM. Return "I couldn't find an answer" directly.
const double MAX_DISTANCE = 0.45; // cosine; tune per data
if (retrieved.Count == 0 || retrieved[0].Distance > MAX_DISTANCE)
{
return new RagAnswer(
Answer: "I couldn't find any relevant information in our knowledge base.",
Citations: Array.Empty<RetrievedChunk>(),
ConfidenceScore: 0.0
);
}
This is what prevents the system from inventing plausible-sounding rubbish when the user asks about something not in your corpus.
3. Temperature 0.0 (deterministic)
Default to Temperature = 0.0f for factual Q&A. Higher temperatures introduce randomness — which means more variations on the truth, including invented ones.
For creative tasks (drafting, brainstorming) temperature > 0 is fine. For RAG over factual docs, keep it at 0.
4. Hybrid search — vector + keyword
Pure vector search misses exact terms ("Order #INV-2024-008923" matches documents about orders generally, not the specific record).
Combine:
- Vector similarity (for semantic match)
- BM25 / SQL full-text search (for keyword match)
- Re-rank the union to top-K
In Azure SQL:
-- Vector top-50
WITH vector_results AS (
SELECT TOP 50 id, content, metadata,
VECTOR_DISTANCE('cosine', embedding, @qvec) AS vec_score
FROM knowledge_chunks
ORDER BY vec_score ASC
),
-- Full-text top-50
text_results AS (
SELECT TOP 50 id, content, metadata,
rank AS text_score
FROM knowledge_chunks
WHERE CONTAINS(content, @keywords)
)
-- Reciprocal rank fusion
SELECT ... ORDER BY (1.0/vec_rank + 1.0/text_rank) DESC;
Hybrid gives a noticeable quality bump, especially on queries with proper nouns, IDs, or specific numbers.
5. Cross-encoder re-ranker
Vector search retrieves the top-50 candidates fast but coarsely. A cross-encoder model (Cohere Rerank, Azure AI Search semantic ranker, or a local one) compares each candidate AGAINST the query and re-ranks more precisely.
Vector search ───▶ top-50 candidates ───▶ Re-ranker ───▶ top-5 to LLM
(~50 ms) (broad recall) (precise) (high precision)
Re-rankers add ~100-300ms but dramatically improve retrieval quality on ambiguous queries. Worth it for production.
6. Self-check / answer verification pass
For high-stakes answers (medical, legal, financial), add a second LLM call to verify the first:
var verifyPrompt = $"""
Question: {question}
Proposed answer: {firstAnswer}
Source context:
{context}
Does every factual claim in the proposed answer appear in the source context?
Respond with: SUPPORTED, PARTIAL, or NOT_SUPPORTED. If NOT_SUPPORTED, list
which claims are not in the context.
""";
var verification = await _openAi.GetChatCompletionsAsync(/* ... */);
if (verification.StartsWith("NOT_SUPPORTED"))
{
// Return generic "I don't know" instead of the unverified answer
}
Doubles your LLM cost — use only for high-stakes flows.
Quality measurement — without it you're guessing
Build a golden question set:
public record GoldenQuestion(
string Question,
string ExpectedAnswer,
string[] MustCiteDocumentIds);
// 30-50 representative questions that an internal expert has answered
Run them through your pipeline whenever you change anything (chunking, embedding model, retrieval params, system prompt). Score:
- Retrieval recall — were the expected source chunks in the top-K? (automated)
- Answer faithfulness — does every claim trace to retrieved context? (LLM-graded or human-graded)
- Answer correctness — does the answer match the expected answer? (human-graded)
Track these over time. If a config change drops recall from 92% to 78%, revert.
Common mistakes
Mistake 1 — High temperature for factual Q&A
Temperature = 0.7f makes answers feel "natural" but invites hallucination. Use 0.0.
Mistake 2 — No retrieval threshold
Sending the LLM weak context and trusting it to "say it doesn't know" — sometimes it does, often it doesn't. Reject weak retrieval at the threshold layer, not the LLM layer.
Mistake 3 — Treating top-1 as enough
Top-1 retrieval is fragile. Top-5 lets the LLM see corroborating sources. The LLM's answer often combines partial information from chunks 1, 3, and 4.
Mistake 4 — Trusting the LLM's self-reported confidence
Don't ask the LLM "are you sure?" — it has no calibrated confidence. Use retrieval distance as your confidence signal.
Mistake 5 — Not verifying citations are real
LLMs can invent [3] when there's no chunk 3. Post-process the answer: strip citation markers, find the highest one referenced, verify it's within your top-K count.
Mistake 6 — Long, complex system prompts
Models follow short, clear instructions better than long ones. 200-300 word system prompt is plenty. Cut everything else.
Production patterns by stakes
| Use case | Recommended controls |
|---|---|
| Internal FAQ chatbot | Threshold + "I don't know" + cite |
| Customer-facing support | Above + re-ranker + audit log + human review queue for low-confidence |
| Medical / legal / financial advice | Above + self-check pass + always show retrieved snippets + human-in-the-loop for high-impact |
| Compliance Q&A | Above + explicit "this is informational, not legal advice" + audit + version control on source docs |
Interview-grade summary
"Hallucination prevention in RAG comes from six layers: a system prompt that forces context-only answers with an explicit 'I don't know' clause; a retrieval distance threshold that rejects weak matches before calling the LLM; temperature 0.0 for determinism; hybrid search (vector + keyword) for query types vector misses; a cross-encoder re-ranker for precision; and optional verification passes for high-stakes flows. Measure on a golden question set — without measurement you're guessing whether quality improved or regressed."