How do you chunk documents effectively for RAG?
Chunking is the single biggest quality lever in a RAG system. Get it wrong and retrieval returns garbage no matter how good your embedding model is.
The core trade-off
Smaller chunks Larger chunks
────────────── ──────────────
✅ Higher precision ✅ More context per chunk
(chunk matches query exactly) (LLM has fuller picture)
❌ Less context ❌ Less precise retrieval
(LLM gets fragments, may misunderstand) (one chunk drowns out diversity)
❌ More chunks per document ❌ Hits model context limits faster
(more storage + slower search) (5 large chunks = full prompt budget)
The default that works for most prose
500-1000 tokens per chunk with 100-200 tokens of overlap between adjacent chunks.
- 500-1000 tokens ≈ 400-800 English words ≈ 2-4 paragraphs
- Overlap prevents an answer that straddles chunk boundaries from being split
private static IEnumerable<string> ChunkText(string text, int chunkSize = 800, int overlap = 100)
{
var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < words.Length; i += chunkSize - overlap)
{
int len = Math.Min(chunkSize, words.Length - i);
yield return string.Join(' ', words.Skip(i).Take(len));
if (i + len >= words.Length) yield break;
}
}
This is the "good enough" baseline. Move past it only if quality measurements demand it.
Recursive splitter — better quality for free
Naïve word-split breaks sentences mid-thought. A recursive splitter tries to split at natural boundaries in order:
1. Try splitting on double-newline ("\n\n") — paragraphs
2. If chunk still too big, split on single-newline ("\n") — lines
3. If still too big, split on sentences ("." "! "?)
4. If still too big, split on words
LangChain's RecursiveCharacterTextSplitter is the reference. In .NET, Semantic Kernel has equivalents. Quality jump is real — chunks are more coherent.
Structure-aware chunking — by far the best
Different content needs different chunking strategies:
| Content type | Strategy |
|---|---|
| Markdown / docs | Split by headings (#, ##) — preserves section context |
| Code | Split by function / class boundaries — preserves syntactic units |
| Tables | Keep the whole table as one chunk if possible — splitting destroys meaning |
| PDFs with layout | Use Azure Document Intelligence layout model — preserves columns, tables, images |
| Q&A pairs | One Q+A = one chunk — never split them |
| Conversation transcripts | Chunk by speaker turn or topic shift |
A 200-line markdown file has roughly:
- 30
# H1sections - Hundreds of words
Chunking by H1 gives ~30 chunks. Each chunk has a self-contained section. Vector search now hits "did the user ask about THIS section" — much higher precision.
Always include metadata in the chunk header
Bare text:
"... refunds must be requested within 30 days ..."
vs metadata-prefixed:
[Section 4.2 "Return Policy" — page 12 of Employee Handbook v3]
... refunds must be requested within 30 days ...
The metadata prefix:
- Helps the LLM cite sources correctly
- Improves embedding quality — the chunk has more context
- Lets you filter results post-retrieval (e.g. only return chunks from "Section 4.2")
var chunkWithContext = $"""
[Document: {document.Title}, Section: {section.Heading}, Page: {page}]
{chunkText}
""";
Overlap — why it matters
Without overlap, if the answer spans the boundary between chunk 3 and chunk 4:
- Chunk 3 ends mid-sentence: "...the policy applies when..."
- Chunk 4 starts mid-sentence: "...the customer reports damage within 30 days."
Vector search may retrieve chunk 3 OR chunk 4, but neither has the full answer.
With 100-token overlap, both chunks contain the full sentence. Whichever is retrieved, the LLM gets the complete fact.
100-token overlap ≈ 10-15% of an 800-token chunk. Cost = ~15% more storage + ~15% more chunks to embed. Worth it.
Common chunking mistakes
Mistake 1 — Chunking by fixed character count
// ❌ Splits in the middle of words and sentences
for (int i = 0; i < text.Length; i += 1000)
yield return text.Substring(i, Math.Min(1000, text.Length - i));
Use token-aware or word-aware splitting.
Mistake 2 — Chunks too small (< 200 tokens)
The LLM gets fragments and can't make sense of the question. Retrieval precision is high but the answer quality is low.
Mistake 3 — Chunks too large (> 1500 tokens)
You can fit fewer chunks in the LLM context window. With top-5 retrieval each at 1500 tokens, that's 7,500 tokens of context — half of GPT-4o's input budget. Less room for system prompt + question + multi-turn history.
Mistake 4 — Same chunking strategy for all content types
A legal contract and a code snippet need different chunking. Build a per-content-type strategy.
Mistake 5 — No metadata in the chunk
The LLM can't cite sources. The chunk lacks the broader context. Always include [Document, Section, Page] at the top.
Mistake 6 — Not measuring chunk quality
You can't improve what you don't measure. Build a golden set of 20-50 representative questions with expected answers. Run them through your pipeline whenever you change chunking. Track:
- Retrieval recall: did the right chunk make it into top-K?
- Answer quality: human-graded 1-5 against the expected answer
When to chunk by other means
- Question-style FAQs — one Q+A = one chunk; never split
- Tables — one table = one chunk; split only if extremely large
- Multi-modal (PDFs with images) — use Azure Document Intelligence + describe images with a vision model + index the description
Interview-grade summary
"Default to 500-1000 token chunks with 100-token overlap, split via a recursive splitter that respects paragraph and sentence boundaries. For structured content (markdown, code, tables) use structure-aware chunking — by heading, by function, by row. Always prefix each chunk with its document and section metadata. Measure retrieval recall on a golden question set; iterate. Chunking is the single biggest quality lever in RAG — bigger than embedding model choice, bigger than re-ranking — get it right first."