Tokens, context windows, and the O(n²) attention cost — what every dev should know
Three things developers underestimate about LLMs: tokens aren't words, context windows are bounded, and doubling context quadruples compute. Get these wrong and you'll be confused by costs, latency, and weird truncation bugs.
Tokens vs words vs characters
LLMs see tokens — sub-word units learned during training via Byte-Pair Encoding (BPE). The vocabulary is typically 32k-200k tokens.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
enc.encode("hello world")
# [15339, 1917] — 2 tokens
enc.encode("Bengaluru")
# [33, 1969, 24074] — 3 tokens
enc.encode("लर्निंग")
# [12060, 35930, 14013, 11574, 39429] — 5 tokens, expensive
enc.encode("```python\nfor i in range(10):\n print(i)\n```")
# 17 tokens
len(enc.encode("a" * 1000)) # 1000 'a's → ~250 tokens
Key observations:
- English words ≈ 0.75-1.3 tokens average
- Non-Latin scripts (Hindi, Chinese, Tamil) ≈ 2-5x more tokens per char than English
- Numbers, code, JSON tokenize less efficiently than prose
- You PAY per token — non-English content is more expensive
Context windows
Each model has a maximum context size — the total tokens (input + output) it can process in one call.
| Model | Context window |
|---|---|
| GPT-3.5-turbo (old) | 4,096 |
| GPT-4 (original) | 8,192 |
| GPT-4o | 128,000 |
| Claude 3.5 Sonnet | 200,000 |
| Gemini 1.5 Pro | 2,000,000 |
| Llama 3.1 | 128,000 |
| Mistral Large | 128,000 |
128k tokens sounds enormous. In practice it means ~96k words ≈ a short novel. Easy to hit if you're packing many documents into the context:
- System prompt: 500 tokens
- Chat history (20 turns): 10,000 tokens
- 10 retrieved RAG chunks: 6,000 tokens
- User question: 100 tokens
- Reserve for response: 4,000 tokens
- Total: 20,600 tokens — well under, but watch it grow
If your application packs heavy context (legal docs, code repos), you'll hit the limit faster than you expect.
The O(n²) cost
Self-attention computes how much each token should pay attention to every other token. For n tokens, that's n × n score calculations.
- 1,000 tokens → 1M scores
- 10,000 tokens → 100M scores
- 100,000 tokens → 10 BILLION scores
Cost scales quadratically. Doubling context = 4× the compute. This is why:
- 128k-context queries are SLOW — multi-second first-token latency
- They're EXPENSIVE — input pricing reflects this
- Very long context models have inference tricks (sliding window attention, sparse attention, ring attention) but quadratic in the base case
Practical implications
1. Watch your prompt token count
def estimate_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
# Before sending — verify you'll fit
total = estimate_tokens(system_prompt) + estimate_tokens(user_prompt)
if total + max_response_tokens > MODEL_CONTEXT_WINDOW:
raise ValueError(f"Prompt too long: {total} tokens")
2. Truncate / summarize old chat history
For long-running conversations, drop or summarize old turns:
def fit_history(messages: list[dict], budget_tokens: int) -> list[dict]:
# Keep system + last N messages that fit
system = [m for m in messages if m["role"] == "system"]
rest = [m for m in messages if m["role"] != "system"]
kept_rest = []
tokens_used = sum(estimate_tokens(m["content"]) for m in system)
for msg in reversed(rest):
msg_tokens = estimate_tokens(msg["content"])
if tokens_used + msg_tokens > budget_tokens:
break
kept_rest.insert(0, msg)
tokens_used += msg_tokens
return system + kept_rest
3. "Lost in the middle" — long context isn't fully usable
Even with 128k tokens, models attend more to the START and END of context. Bury an important fact in the middle and the model may miss it.
Workaround: put critical context up front or duplicate it at the end. For RAG, sort retrieved chunks by relevance and put most relevant near the beginning AND the user query at the end.
4. Streaming reduces perceived latency, not real latency
A 500-token response takes 3-5 seconds with GPT-4o. Streaming shows tokens as they generate — first appears in ~300ms. Total time unchanged, but UX dramatically better.
5. KV-cache makes per-token inference fast
Naively, generating the 1000th token would re-process all 999 previous tokens — O(n²) per step. The KV-cache stores Keys and Values from past tokens, so each new token is O(n) instead.
Memory cost: tens to hundreds of GB for long-context large models. This is why production inference systems care deeply about KV-cache management (PagedAttention in vLLM is built around this).
Cost math example
For a typical RAG application: 5 chunks × 600 tokens + system prompt + question = ~3,500 input tokens. Response = ~500 tokens.
| Model | Input cost per 1M | Output cost per 1M | Cost per call |
|---|---|---|---|
| GPT-4o | $5 | $15 | ~$0.025 |
| GPT-4o-mini | $0.15 | $0.60 | ~$0.0008 (~30x cheaper) |
| Claude 3.5 Sonnet | $3 | $15 | ~$0.018 |
10,000 calls/day with GPT-4o-mini = ~$8/day ≈ ₹680/day. With GPT-4o = ~$250/day ≈ ₹21,000/day.
Model choice matters enormously at scale.
Common interview follow-ups
"Why is the context window finite if more is always better?"
Three reasons: 1) attention is O(n²) — compute and memory cost; 2) training data of very long sequences is scarce; 3) "lost in the middle" — models can't usefully attend to all tokens equally at huge contexts. The 2M-token Gemini window is impressive but real-world usefulness drops past ~32k for most tasks.
"How do you handle a document larger than the context window?"
Three patterns:
- RAG (most common) — chunk + embed + retrieve top-K, only the relevant chunks enter the prompt
- Iterative summarization — summarize chunk by chunk, then summarize the summaries
- Sliding window — process overlapping windows, combine outputs
Interview-grade summary
"Tokens are sub-word units; pricing and context limits are token-based, not word-based. Context windows are finite (128k-2M tokens depending on model) and you must budget for prompt, history, retrieved context, AND reserved response space. Attention is O(n²) — doubling context quadruples compute. Watch out for the 'lost in the middle' effect on long contexts. For Q&A over data that doesn't fit, use RAG to retrieve relevant chunks instead of stuffing everything into the prompt."