Tokens, context windows, and the O(n²) attention cost — what every dev should know

Question

Randhir Jassal · Accepted Answer

Three things developers underestimate about LLMs: tokens aren't words, context windows are bounded, and doubling context quadruples compute. Get these wrong and you'll be confused by costs, latency, and weird truncation bugs. Tokens vs words vs characters LLMs see tokens — sub-word units learned during training via Byte-Pair Encoding (BPE). The vocabulary is typically 32k-200k tokens. python
for i in range(10):
 print(i)
 Key observations: - English words ≈ 0.75-1.3 tokens average - Non-Latin scripts (Hindi, Chinese, Tamil) ≈ 2-5x more tokens per char than English - Numbers, code, JSON tokenize less efficiently than prose - You PAY per token — non-English content is more expensive Context windows Each model has a maximum context size — the total tokens (input + output) it can process in one call. | Model | Context window | |---|---| | GPT-3.5-turbo (old) | 4,096 | | GPT-4 (original) | 8,192 | | GPT-4o | 128,000 | | Claude 3.5 Sonnet | 200,000 | | Gemini 1.5 Pro | 2,000,000 | | Llama 3.1 | 128,000 | | Mistral Large | 128,000 | 128k tokens sounds enormous. In practice it means 96k words ≈ a short novel. Easy to hit if you're packing many documents into the context: - System prompt: 500 tokens - Chat history (20 turns): 10,000 tokens - 10 retrieved RAG chunks: 6,000 tokens - User question: 100 tokens - Reserve for response: 4,000 tokens - Total: 20,600 tokens — well under, but watch it grow If your application packs heavy context (legal docs, code repos), you'll hit the limit faster than you expect. The O(n²) cost Self-attention computes how much each token should pay attention to every other token. For n tokens, that's n × n score calculations. - 1,000 tokens → 1M scores - 10,000 tokens → 100M scores - 100,000 tokens → 10 BILLION scores Cost scales quadratically. Doubling context = 4× the compute. This is why: - 128k-context queries are SLOW — multi-second first-token latency - They're EXPENSIVE — input pricing reflects this - Very long context models have inference tricks (sliding window attention, sparse attention, ring attention) but quadratic in the base case Practical implications 1. Watch your prompt token count 2. Truncate / summarize old chat history For long-running conversations, drop or summarize old turns: 3. "Lost in the middle" — long context isn't fully usable Even with 128k tokens, models attend more to the START and END of context. Bury an important fact in the middle and the model may miss it. Workaround: put critical context up front or duplicate it at the end. For RAG, sort retrieved chunks by relevance and put most relevant near the beginning AND the user query at the end. 4. Streaming reduces perceived latency, not real latency A 500-token response takes 3-5 seconds with GPT-4o. Streaming shows tokens as they generate — first appears in 300ms. Total time unchanged, but UX dramatically better. 5. KV-cache makes per-token inference fast Naively, generating the 1000th token would re-process all 999 previous tokens — O(n²) per step. The KV-cache stores Keys and Values from past tokens, so each new token is O(n) instead. Memory cost: tens to hundreds of GB for long-context large models. This is why production inference systems care deeply about KV-cache management (PagedAttention in vLLM is built around this). Cost math example For a typical RAG application: 5 chunks × 600 tokens + system prompt + question = 3,500 input tokens. Response = 500 tokens. | Model | Input cost per 1M | Output cost per 1M | Cost per call | |---|---|---|---| | GPT-4o | $5 | $15 | $0.025 | | GPT-4o-mini | $0.15 | $0.60 | $0.0008 (30x cheaper) | | Claude 3.5 Sonnet | $3 | $15 | $0.018 | 10,000 calls/day with GPT-4o-mini = $8/day ≈ ₹680/day. With GPT-4o = $250/day ≈ ₹21,000/day. Model choice matters enormously at scale. Common interview follow-ups "Why is the context window finite if more is always better?" Three reasons: 1) attention is O(n²) — compute and memory cost; 2) training data of very long sequences is scarce; 3) "lost in the…

Tokens, context windows, and the O(n²) attention cost — what every dev should know

Tokens vs words vs characters

Context windows

The O(n²) cost

Practical implications

1. Watch your prompt token count

2. Truncate / summarize old chat history

3. "Lost in the middle" — long context isn't fully usable

4. Streaming reduces perceived latency, not real latency

5. KV-cache makes per-token inference fast

Cost math example

Common interview follow-ups

Interview-grade summary

Tokens, context windows, and the O(n²) attention cost — what every dev should know

Tokens vs words vs characters

Context windows

The O(n²) cost

Practical implications

1. Watch your prompt token count

2. Truncate / summarize old chat history

3. "Lost in the middle" — long context isn't fully usable

4. Streaming reduces perceived latency, not real latency

5. KV-cache makes per-token inference fast

Cost math example

Common interview follow-ups

Interview-grade summary

Why does Python dominate AI/ML development — what are the real reasons?

LLM sampling parameters — temperature, top-p, top-k — when to tune each

What is a Large Language Model (LLM) and how does it work internally?

Why does Python dominate AI/ML development — what are the real reasons?

LLM sampling parameters — temperature, top-p, top-k — when to tune each

What is a Large Language Model (LLM) and how does it work internally?

Model	Context window
GPT-3.5-turbo (old)	4,096
GPT-4 (original)	8,192
GPT-4o	128,000
Claude 3.5 Sonnet	200,000
Gemini 1.5 Pro	2,000,000
Llama 3.1	128,000
Mistral Large	128,000

Model	Input cost per 1M	Output cost per 1M	Cost per call
GPT-4o	$5	$15	~$0.025
GPT-4o-mini	$0.15	$0.60	~$0.0008 (~30x cheaper)
Claude 3.5 Sonnet	$3	$15	~$0.018

Related questions

Related questions