RAG vs fine-tuning — when do you choose which?
Default: RAG. Fine-tuning has a specific narrow role most people apply too broadly.
The single most common interview question on LLM productionization, and the most common mistake teams make in real life.
Side-by-side
| RAG | Fine-tuning | |
|---|---|---|
| What it changes | What the model SEES at query time | What the model has INTERNALIZED |
| Adds knowledge | ✅ Yes — adds documents to the index | ❌ Not really — even fine-tuning to "remember" is unreliable |
| Updates content | ✅ Add a chunk → queryable instantly | ❌ Re-train each time data changes |
| Citation of sources | ✅ Naturally — you know which chunk fed the answer | ❌ Model has "absorbed" the data; you can't point to where |
| Changes style / format | ⚠️ Possible via prompt | ✅ Native — change the model's output shape |
| Changes language register | ⚠️ Via prompt | ✅ Native |
| Inference cost | Same model + retrieval cost | Same |
| One-time training cost | None | $thousands per fine-tune (depending on data + base model) |
| Privacy of training data | Stays in your DB | Baked into model weights — can't un-bake |
| Hallucination risk | Low when prompt forces context-only answers | Higher — the model still hallucinates plausibly |
| Right for | Adding knowledge | Changing style / format / classification |
When fine-tuning is actually the right tool
- Specific output format / structure — "Always output JSON matching this schema." Doable via prompt but fine-tuning is more reliable for high-volume calls.
- Domain-specific terminology / register — A medical-document model, a legal-brief model, a code-review model with your company's style.
- Classification / extraction — "Given a customer email, return one of {complaint, query, praise, escalation}." Fine-tuning a small model is cheap and accurate.
- Distillation — Train a small (cheap) model to mimic GPT-4o's behavior on a narrow task.
In all four, fine-tuning shapes how the model responds, not what it knows.
When fine-tuning is the wrong tool (but tempting)
- "Let's fine-tune the model on our docs so it knows our company." → ❌ This is a RAG problem.
- "Let's fine-tune it on our 50,000 support tickets to learn our products." → ❌ RAG. The fine-tune absorbs patterns but won't reliably recall facts.
- "We need real-time prices." → ❌ Function-calling, not fine-tuning.
A fine-tuned model still hallucinates with confidence — there's no "I don't know" mechanism baked into the weights.
Hybrid: RAG + a small fine-tuned classifier
Common real-world pattern:
User input ─▶ Fine-tuned router classifier ─▶ Decides intent
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
greeting RAG (Q&A) function call
(template) (retrieve+gen) (real API)
Fine-tune a small / cheap model (gpt-4o-mini or a fine-tunable open model) for the router — fast, cheap classification. Then RAG handles knowledge queries; function-calling handles real-time data. Best of both.
Cost difference at scale
For 1M queries/month at 1000 tokens each:
| Pattern | Monthly cost (rough, Azure 2026) |
|---|---|
| Pure GPT-4o + RAG | ~₹50,000 |
| Fine-tuned GPT-4o + RAG | ~₹50,000 + one-time fine-tune ~₹50,000 |
| GPT-4o-mini + RAG | ~₹5,000 |
| Fine-tuned GPT-4o-mini + RAG | ~₹5,000 + one-time ~₹15,000 |
The biggest cost lever isn't fine-tuning. It's choosing a cheaper base model (mini) for tasks where the gap doesn't matter.
Common interview trap
"Our company wants to make an internal chatbot that knows our HR policies. Should we fine-tune the model on our handbook?"
Answer: No. Use RAG. The handbook is knowledge, not style. Fine-tuning would:
- Cost more
- Bake the data into a model file that's awkward to update
- Still hallucinate when the user asks about something not in the handbook
- Make it impossible to add citations back to source
Instead, chunk the handbook, embed it, store in Azure SQL VECTOR / Azure AI Search, retrieve the top-K relevant chunks per question, ask the LLM to answer using only those chunks with citations.
Interview-grade summary
"RAG adds knowledge at query time by retrieving relevant chunks and feeding them to the LLM. Fine-tuning changes how the model responds — its style, format, or classification behavior. For 95% of 'feed our data to AI' use cases, RAG is the right answer; fine-tuning would be the wrong answer that costs more and works worse. Fine-tune for style, format, or classification — never for adding knowledge."