How do you chunk documents effectively for RAG?

Question

Randhir Jassal · Accepted Answer

Chunking is the single biggest quality lever in a RAG system. Get it wrong and retrieval returns garbage no matter how good your embedding model is.

The core trade-off

Smaller chunks                                 Larger chunks
──────────────                                 ──────────────
✅ Higher precision                            ✅ More context per chunk
   (chunk matches query exactly)                  (LLM has fuller picture)
❌ Less context                                ❌ Less precise retrieval
   (LLM gets fragments, may misunderstand)         (one chunk drowns out diversity)
❌ More chunks per document                    ❌ Hits model context limits faster
   (more storage + slower search)                  (5 large chunks = full prompt budget)

The default that works for most prose

500-1000 tokens per chunk with 100-200 tokens of overlap between adjacent chunks.

500-1000 tokens ≈ 400-800 English words ≈ 2-4 paragraphs
Overlap prevents an answer that straddles chunk boundaries from being split

private static IEnumerable<string> ChunkText(string text, int chunkSize = 800, int overlap = 100)
{
    var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
    for (int i = 0; i < words.Length; i += chunkSize - overlap)
    {
        int len = Math.Min(chunkSize, words.Length - i);
        yield return string.Join(' ', words.Skip(i).Take(len));
        if (i + len >= words.Length) yield break;
    }
}

This is the "good enough" baseline. Move past it only if quality measurements demand it.

Recursive splitter — better quality for free

Naïve word-split breaks sentences mid-thought. A recursive splitter tries to split at natural boundaries in order:

1. Try splitting on double-newline ("\n\n") — paragraphs
2. If chunk still too big, split on single-newline ("\n") — lines
3. If still too big, split on sentences ("." "! "?)
4. If still too big, split on words

LangChain's RecursiveCharacterTextSplitter is the reference. In .NET, Semantic Kernel has equivalents. Quality jump is real — chunks are more coherent.

Structure-aware chunking — by far the best

Different content needs different chunking strategies:

Content type	Strategy
Markdown / docs	Split by headings (#, ##) — preserves section context
Code	Split by function / class boundaries — preserves syntactic units
Tables	Keep the whole table as one chunk if possible — splitting destroys meaning
PDFs with layout	Use Azure Document Intelligence layout model — preserves columns, tables, images
Q&A pairs	One Q+A = one chunk — never split them
Conversation transcripts	Chunk by speaker turn or topic shift

A 200-line markdown file has roughly:

30 # H1 sections
Hundreds of words

Chunking by H1 gives ~30 chunks. Each chunk has a self-contained section. Vector search now hits "did the user ask about THIS section" — much higher precision.

Always include metadata in the chunk header

Bare text:

"... refunds must be requested within 30 days ..."

vs metadata-prefixed:

[Section 4.2 "Return Policy" — page 12 of Employee Handbook v3]

... refunds must be requested within 30 days ...

The metadata prefix:

Helps the LLM cite sources correctly
Improves embedding quality — the chunk has more context
Lets you filter results post-retrieval (e.g. only return chunks from "Section 4.2")

var chunkWithContext = $"""
    [Document: {document.Title}, Section: {section.Heading}, Page: {page}]
    
    {chunkText}
    """;

Overlap — why it matters

Without overlap, if the answer spans the boundary between chunk 3 and chunk 4:

Chunk 3 ends mid-sentence: "...the policy applies when..."
Chunk 4 starts mid-sentence: "...the customer reports damage within 30 days."

Vector search may retrieve chunk 3 OR chunk 4, but neither has the full answer.

With 100-token overlap, both chunks contain the full sentence. Whichever is retrieved, the LLM gets the complete fact.

100-token overlap ≈ 10-15% of an 800-token chunk. Cost = ~15% more storage + ~15% more chunks to embed. Worth it.

Common chunking mistakes

Mistake 1 — Chunking by fixed character count

// ❌ Splits in the middle of words and sentences
for (int i = 0; i < text.Length; i += 1000)
    yield return text.Substring(i, Math.Min(1000, text.Length - i));

Use token-aware or word-aware splitting.

Mistake 2 — Chunks too small (< 200 tokens)

The LLM gets fragments and can't make sense of the question. Retrieval precision is high but the answer quality is low.

Mistake 3 — Chunks too large (> 1500 tokens)

You can fit fewer chunks in the LLM context window. With top-5 retrieval each at 1500 tokens, that's 7,500 tokens of context — half of GPT-4o's input budget. Less room for system prompt + question + multi-turn history.

Mistake 4 — Same chunking strategy for all content types

A legal contract and a code snippet need different chunking. Build a per-content-type strategy.

Mistake 5 — No metadata in the chunk

The LLM can't cite sources. The chunk lacks the broader context. Always include [Document, Section, Page] at the top.

Mistake 6 — Not measuring chunk quality

You can't improve what you don't measure. Build a golden set of 20-50 representative questions with expected answers. Run them through your pipeline whenever you change chunking. Track:

Retrieval recall: did the right chunk make it into top-K?
Answer quality: human-graded 1-5 against the expected answer

When to chunk by other means

Question-style FAQs — one Q+A = one chunk; never split
Tables — one table = one chunk; split only if extremely large
Multi-modal (PDFs with images) — use Azure Document Intelligence + describe images with a vision model + index the description

Interview-grade summary

"Default to 500-1000 token chunks with 100-token overlap, split via a recursive splitter that respects paragraph and sentence boundaries. For structured content (markdown, code, tables) use structure-aware chunking — by heading, by function, by row. Always prefix each chunk with its document and section metadata. Measure retrieval recall on a golden question set; iterate. Chunking is the single biggest quality lever in RAG — bigger than embedding model choice, bigger than re-ranking — get it right first."

How do you chunk documents effectively for RAG?

The core trade-off

The default that works for most prose

Recursive splitter — better quality for free

Structure-aware chunking — by far the best

Always include metadata in the chunk header

Overlap — why it matters

Common chunking mistakes

When to chunk by other means

Interview-grade summary

How do you chunk documents effectively for RAG?

The core trade-off

The default that works for most prose

Recursive splitter — better quality for free

Structure-aware chunking — by far the best

Always include metadata in the chunk header

Overlap — why it matters

Common chunking mistakes

When to chunk by other means

Interview-grade summary

Why does Python dominate AI/ML development — what are the real reasons?

Tokens, context windows, and the O(n²) attention cost — what every dev should know

LLM sampling parameters — temperature, top-p, top-k — when to tune each

Why does Python dominate AI/ML development — what are the real reasons?

Tokens, context windows, and the O(n²) attention cost — what every dev should know

LLM sampling parameters — temperature, top-p, top-k — when to tune each

Related questions

Related questions