LLM Foundations — How Large Language Models Actually Work (with Python Code)
A complete foundational guide to LLMs: what they are, the problems they solve, how transformers + attention + tokenization + sampling actually work internally, why Python dominates the AI ecosystem, real production code for every use case (chat, streaming, embeddings, tool-calling, local inference, RAG), and the pitfalls that ruin LLM applications.
- Author
- Randhir Jassal
- Published
- Reading time
- 28 min read
Large Language Models (LLMs) — GPT-4, Claude, Llama, Mistral, Gemini — are the foundation of every serious AI application built in the last two years. But for most developers they remain a black box: "send a prompt, get text back, hope it's right."
This guide is the complete foundational picture: what problems LLMs solve, why they're built in Python, how they actually work internally (tokenization, transformers, attention, sampling), and real production-grade Python code for every common use case. By the end, the API call client.chat.completions.create(...) will no longer be a mystery — you'll know exactly what happens at each layer between your prompt and the response.
The problems LLMs solve
Before LLMs, every NLP task needed its own model:
- Sentiment analysis → train a classifier on labeled reviews
- Translation → train a Seq2Seq model per language pair
- Summarization → train an encoder-decoder on article+summary pairs
- Q&A → train a span-extraction model on SQuAD
- Code completion → train a model on code corpora
Each model required thousands of labeled examples, weeks of training, and worked only on its narrow task. Adding a new task = new model.
LLMs replaced this with a single general-purpose model that handles them all via natural-language instructions.
# Same model. Different prompts. Different tasks.
client.chat.completions.create(model="gpt-4o", messages=[
{"role": "user", "content": "Translate to French: 'Where is the station?'"}
])
# → "Où est la gare?"
client.chat.completions.create(model="gpt-4o", messages=[
{"role": "user", "content": "Summarize this article in 2 sentences: ..."}
])
# → 2-sentence summary
client.chat.completions.create(model="gpt-4o", messages=[
{"role": "user", "content": "Extract the company name, amount and date from: 'Invoice #INV-2024 from Acme Corp for ₹45,000 dated 2026-05-22'"}
])
# → {"company": "Acme Corp", "amount": 45000, "date": "2026-05-22"}
One model, dozens of tasks, no per-task training data. That's the LLM revolution in one paragraph.
Beyond replacing classic NLP, LLMs unlocked entirely new capabilities:
| Capability | Pre-LLM era | LLM era |
|---|---|---|
| Code generation | Templates / autocomplete | "Write a Python function that..." |
| Document Q&A | Keyword search + extraction | Natural-language conversation with citations |
| Agents / tool use | Custom finite-state machines | "Decide which API to call" |
| Structured extraction | Regex + custom parsers | Schema-aware JSON output |
| Creative writing | Pure human | Drafts, edits, brainstorms |
| Multi-modal understanding | Separate vision + text models | Single model: image + text → answer |
Why Python dominates AI/ML development
Almost every AI library, model, framework, and tool ships Python-first. Reasons, in order of importance:
1. The library ecosystem is unmatched
| Library | What it does | Why it's in Python |
|---|---|---|
| PyTorch | Defines and trains neural networks | Researcher-friendly tensor API |
| TensorFlow / Keras | Same — Google's stack | Same |
| transformers (HuggingFace) | 1M+ pre-trained models, one API | Wraps PyTorch + TF + ONNX |
| NumPy | Numerical arrays — the foundation | Cython-fast, Python-ergonomic |
| pandas | Tabular data manipulation | The "Excel for code" of AI |
| scikit-learn | Classical ML (regression, trees, etc.) | Decade of refinement |
| sentence-transformers | Embeddings for semantic search | Built on transformers |
| LangChain / LlamaIndex | LLM orchestration frameworks | Python-first ecosystems |
| tiktoken | Fast tokenization for OpenAI models | OpenAI's reference impl |
| vLLM / llama.cpp | Fast local LLM inference | Python bindings + C++ core |
Every research paper publishes Python code. Every model on HuggingFace ships with Python loaders. Every cloud AI service has a Python SDK as the first-class client.
2. Researcher → production has the same language
A researcher trains a model in a Jupyter notebook. A platform engineer ships it to production. Both use Python. No "research code in Python, prod code in C++" handoff like the bad old days. The same model.generate(...) call works in both places.
3. Numeric arrays are first-class
Python (with NumPy) has the cleanest API for n-dimensional arrays of any mainstream language. tensor[batch, head, seq, dim] reads naturally. Slicing, broadcasting, einsum — all expressive.
In .NET / Java you can do the same math, but with verbose loops or third-party wrappers. The vocabulary of AI is multi-dimensional tensors; Python speaks it natively.
4. C/C++ underneath where speed matters
When people say "Python is slow", they're right — but it doesn't matter. The heavy compute (GPU operations, matrix multiplication, attention) runs in CUDA / C++ kernels. Python just orchestrates. It's the "glue" language; the math runs at C speed.
5. Community, papers, weights, tutorials — all Python
Search "fine-tune Llama 3 example" → 95% of results are Python notebooks. "Run Stable Diffusion locally" → diffusers, Python. The community gravity is enormous and self-reinforcing.
When Python isn't the right choice
- High-throughput inference services at the request/response layer — use Go or Rust for the API, call into Python or ONNX for the actual model
- Mobile / embedded inference — ONNX Runtime, TensorFlow Lite, Core ML; no Python on the device
- Real-time game / robotics — C++ for hard latency requirements
- Enterprise integration — .NET/Java where the rest of the system lives; call the LLM via HTTP
But for building, training, fine-tuning, evaluating, and prototyping — Python is unrivaled.
How LLMs actually work — the internal architecture
How an LLM Generates Text
────────────────────────
Input prompt: "The capital of France is"
│
▼
┌─────────────────────┐
│ 1. Tokenizer │ Byte-pair encoding (BPE)
│ │ ["The", " capital", " of", " France", " is"]
│ │ → [464, 5963, 286, 4881, 318]
└──────────┬──────────┘
│ token IDs (integers)
▼
┌─────────────────────┐
│ 2. Embeddings │ Each token → high-dim vector
│ + Positional info │ shape: [seq_len, d_model]
│ │ (e.g. d_model = 4096 for Llama 3 8B)
└──────────┬──────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 3. Transformer Blocks (×N layers — 32 to 80) │
│ │
│ ┌────────────────────────────────┐ │
│ │ Self-Attention (multi-head) │ │
│ │ Q · Kᵀ → softmax → · V │ ← O(n²) │
│ │ "which past tokens matter │ cost on │
│ │ for predicting next?" │ context │
│ └──────────┬────────────────────┘ │
│ │ │
│ residual + LayerNorm │
│ │ │
│ ▼ │
│ ┌────────────────────────────────┐ │
│ │ Feed-Forward (MLP) │ │
│ │ d_model → 4·d_model → d_model│ │
│ │ "what does this token mean │ │
│ │ given its context?" │ │
│ └──────────┬────────────────────┘ │
│ │ │
│ residual + LayerNorm │
│ │ │
└─────────────┼───────────────────────────────────┘
▼
┌─────────────────────┐
│ 4. Output Head │ Linear: d_model → vocab_size
│ │ logits = [vocab_size] probabilities
│ │ (vocab ≈ 32k-200k tokens depending on model)
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ 5. Sampling │ Temperature, top-k, top-p, greedy
│ │ Pick next token: " Paris"
└──────────┬──────────┘
│
▼
Append " Paris" to input. Repeat from step 1.
Stop on EOS token or max_tokens limit.
Final: "The capital of France is Paris."
Now let's walk through each step with real Python code.
Step 1 — Tokenization
LLMs don't see characters or words — they see tokens, which are sub-word units learned during training. GPT-4 has a vocabulary of ~100,000 tokens; Llama 3 has 128,256.
# Install: pip install tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Building LLM applications is fun!"
tokens = enc.encode(text)
print(tokens)
# [27418, 445, 11237, 8522, 374, 2523, 0]
# Decode back
print(enc.decode(tokens))
# 'Building LLM applications is fun!'
# Token counts matter — they determine cost AND fit in context window
print(f"Token count: {len(tokens)}")
# Token count: 7
# Rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words
The key implications:
- You pay per token, not per character or word. A 100-word email is ~133 tokens.
- Context windows are token-limited. GPT-4o has 128k tokens (~96k words ≈ a short novel).
- Numbers, code, and non-English text tokenize less efficiently.
"लर्निंग"(Hindi for "learning") might be 6 tokens; "learning" is 1.
For local / HuggingFace models, use the model's own tokenizer:
# pip install transformers
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tok.encode("Building LLM applications is fun!")
print(tokens, tok.decode(tokens))
Why sub-word tokenization (BPE)? Words are too sparse (millions); characters are too granular (long sequences). BPE finds the sweet spot — common words become single tokens, rare ones get split into pieces. "internationalization" might tokenize as ["international", "ization"].
Step 2 — Embeddings + position
Each token ID is mapped to a high-dimensional vector (4096 dims for Llama 3 8B; 12288 for GPT-3). Tokens with similar meanings end up near each other in this vector space.
But raw embeddings don't carry order — the model also needs to know that "dog bit man" is different from "man bit dog". So positional encoding is added.
# Conceptual — what happens inside
import torch
vocab_size = 128256
d_model = 4096
embedding_table = torch.randn(vocab_size, d_model) # learned during training
token_ids = torch.tensor([464, 5963, 286, 4881, 318])
token_embeddings = embedding_table[token_ids] # shape: [5, 4096]
# Positional encoding (modern models use RoPE — rotary position embeddings)
# We'll just illustrate with simple sinusoidal positions
positions = torch.arange(len(token_ids))
# Result: input to first transformer block = token_embeddings + position_info
After this step, the input to the first transformer layer is a [seq_len, d_model] matrix of vectors. Each row represents one token, in its position.
Step 3 — Transformer blocks (the actual "brain")
This is where the magic happens. Each block has two stages:
3a — Self-attention
The signature operation of transformers. For each token, the model decides which other tokens to pay attention to when figuring out what comes next.
# Simplified single-head self-attention (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class SimpleAttention(nn.Module):
def __init__(self, d_model):
super().__init__()
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
def forward(self, x):
# x: [seq_len, d_model]
Q = self.W_q(x) # [seq_len, d_model] — "what am I looking for?"
K = self.W_k(x) # [seq_len, d_model] — "what do I have to offer?"
V = self.W_v(x) # [seq_len, d_model] — "what's my actual content?"
# Attention scores: how relevant each token is to each other
scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
# shape: [seq_len, seq_len]
# Causal mask — token at position i can only see tokens 0..i
# (LLMs predict left-to-right; can't peek at the future)
mask = torch.tril(torch.ones_like(scores))
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
# weights[i][j] = "how much should token i pay attention to token j"
return weights @ V # [seq_len, d_model] — context-aware representation
The cost is O(n²) in the sequence length n — each token must compute scores against every other token. This is why context windows are bounded; doubling context = 4× the attention compute.
Multi-head attention runs multiple parallel attention operations (typically 32-128 "heads"), each with its own learned focus. One head might learn syntactic relationships, another semantic ones, another long-range coreference. The outputs are concatenated and projected.
3b — Feed-Forward Network (MLP)
After attention pools context across tokens, the MLP transforms each token individually with a non-linear function.
class SimpleMLP(nn.Module):
def __init__(self, d_model, d_ff=None):
super().__init__()
d_ff = d_ff or 4 * d_model # typical ratio
self.up = nn.Linear(d_model, d_ff)
self.down = nn.Linear(d_ff, d_model)
def forward(self, x):
return self.down(F.gelu(self.up(x)))
# GELU activation; modern models use SwiGLU or similar
Empirically, the MLP holds most of the model's factual knowledge — facts like "the capital of France is Paris" get stored in MLP weights. Attention shuffles context; MLP recalls and applies world knowledge.
Both stages have residual connections (x + sublayer(x)) and LayerNorm (or RMSNorm) — together they keep gradients well-behaved during training.
A typical model stacks 32-80 of these blocks. The output is a [seq_len, d_model] tensor where each row is a deep contextual representation of that token.
Step 4 — Output head + 5. Sampling
The final layer projects each token's representation to a probability distribution over the entire vocabulary.
# Output: logits over the vocabulary
output_head = nn.Linear(d_model, vocab_size)
logits = output_head(final_hidden) # shape: [seq_len, vocab_size]
# We only care about predicting the NEXT token (the last position)
next_token_logits = logits[-1] # shape: [vocab_size]
probs = F.softmax(next_token_logits, dim=-1)
Now we sample the next token. Sampling strategy = the most important inference-time knob.
# Greedy — always pick the most likely token (deterministic, often boring)
next_token = torch.argmax(probs)
# Temperature — flatten or sharpen the distribution
# temp < 1: more confident, more repetitive
# temp = 1: original distribution
# temp > 1: more random, more creative
def with_temperature(logits, temperature):
return logits / temperature
# Top-k — sample only from the k most likely tokens
def top_k_sample(probs, k=50):
top_values, top_indices = probs.topk(k)
top_values = top_values / top_values.sum() # renormalize
choice = torch.multinomial(top_values, num_samples=1)
return top_indices[choice]
# Top-p (nucleus) — sample from the smallest set whose probabilities sum to p
def top_p_sample(probs, p=0.9):
sorted_probs, sorted_indices = probs.sort(descending=True)
cumsum = sorted_probs.cumsum(dim=-1)
nucleus = cumsum <= p
nucleus[0] = True # always include at least 1
nucleus_probs = sorted_probs * nucleus
nucleus_probs = nucleus_probs / nucleus_probs.sum()
choice = torch.multinomial(nucleus_probs, num_samples=1)
return sorted_indices[choice]
Production rules:
- Factual Q&A (RAG, extraction):
temperature=0.0— deterministic - Creative writing, brainstorming:
temperature=0.7-1.0 - Code generation:
temperature=0.0-0.3 top_p=0.9andtop_k=50are good defaults for variability without garbage
The token is sampled, appended to the input sequence, and the whole forward pass repeats — one token at a time. This is why LLM inference latency scales linearly with output length.
The autoregressive loop
# Pseudo-code for the entire generation
prompt_ids = tokenizer.encode(prompt)
generated = list(prompt_ids)
for _ in range(max_tokens):
logits = model(generated) # [vocab_size]
next_id = sample(logits, temperature, top_p, top_k)
if next_id == tokenizer.eos_token_id:
break
generated.append(next_id)
text = tokenizer.decode(generated)
This is what client.chat.completions.create(...) does under the hood. The reason streaming responses work so well — you don't have to wait for the whole answer; the model produces tokens one at a time, and they can be flushed to the client as they're generated.
KV-cache — the inference optimization that makes LLMs viable
Naively, predicting token N+1 means re-running the model on ALL N tokens. That would be O(n³) for the whole generation — unusable.
The KV-cache stores the keys (K) and values (V) of each attention layer for all previously-processed tokens. When generating token N+1, the model only computes Q for the new token, but reuses cached K and V for previous ones. Cost drops to O(n) per new token.
Memory cost: 2 × num_layers × num_heads × head_dim × seq_len × dtype_size. For Llama 3 70B at 32k context, the KV cache is ~80 GB. This is why long-context inference needs lots of GPU memory.
Real Python code — every common use case
Use case 1 — Simple chat with OpenAI / Azure OpenAI
# pip install openai
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a string."}
],
temperature=0.0,
max_tokens=200,
)
print(response.choices[0].message.content)
print(f"Used {response.usage.total_tokens} tokens")
Use case 2 — Streaming responses (perceived latency)
A 500-token answer takes ~3-5 seconds. Showing nothing for 5 seconds feels broken. Stream tokens as they generate; first token appears in ~300ms.
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain quantum entanglement in 3 paragraphs"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
In a web app, pipe each chunk via Server-Sent Events (SSE) or WebSocket to the browser.
Use case 3 — Structured output (JSON schema)
LLMs can return JSON that conforms to a schema you define. Hugely useful for extraction, classification, agent steps.
from pydantic import BaseModel
from openai import OpenAI
class Invoice(BaseModel):
company: str
amount: float
currency: str
date: str
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract structured data from invoice text."},
{"role": "user", "content": "Invoice from Acme Corp for $1,250.00 USD dated 2026-05-22"}
],
response_format=Invoice,
)
invoice: Invoice = response.choices[0].message.parsed
print(invoice.company) # "Acme Corp"
print(invoice.amount) # 1250.0
The model returns guaranteed-valid JSON matching your Pydantic schema. No more regex parsing. No more "the model returned 'amount: $1,250'" headaches.
Use case 4 — Tool / function calling (agents)
LLMs can call your functions when they decide they need to.
import json
from openai import OpenAI
def get_weather(city: str) -> dict:
# In real code: call a weather API
return {"city": city, "temp_c": 28, "condition": "sunny"}
client = OpenAI()
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"],
},
},
}]
messages = [{"role": "user", "content": "What's the weather in Mumbai?"}]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools,
)
# Did the model decide to call a tool?
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = get_weather(**args)
# Send the result back to the model for the final answer
messages.append(response.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
final = client.chat.completions.create(model="gpt-4o-mini", messages=messages, tools=tools)
print(final.choices[0].message.content)
# "It's currently 28°C and sunny in Mumbai."
This pattern — "model decides which tool to call, you execute it, you feed the result back" — is the core of every agent framework (LangChain, LlamaIndex, AutoGen, CrewAI, etc.). They just package this loop with retry, planning, and multi-step coordination.
Use case 5 — Embeddings for semantic search
Embeddings turn text into vectors where similar meaning = nearby vectors. Foundation of RAG, semantic search, clustering.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-large",
input=[text],
)
return np.array(response.data[0].embedding)
q = embed("How do I deploy a Next.js app to Vercel?")
docs = [
embed("Vercel deploy guide: push to GitHub, connect repo, done."),
embed("Apple pie recipe: flour, butter, apples..."),
embed("Docker tutorial for beginners"),
]
# Cosine similarity (since OpenAI embeddings are normalized to unit length)
similarities = [float(q @ doc) for doc in docs]
print(similarities)
# [0.71, 0.06, 0.21] — first doc is clearly most relevant
For production, store these vectors in Postgres + pgvector, Azure SQL VECTOR type, Pinecone, Weaviate, or Azure AI Search — never compute on the fly per query.
Use case 6 — Local inference with HuggingFace
For private data that can't go to a public API, or for cost-sensitive workloads:
# pip install transformers accelerate torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # auto-place on GPU(s)
)
prompt = "Explain transformer attention in one paragraph."
input_ids = tok.apply_chat_template(
[{"role": "user", "content": prompt}],
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=300,
temperature=0.3,
top_p=0.9,
do_sample=True,
)
print(tok.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))
For production-grade local inference, use vLLM (PagedAttention, much faster) or llama.cpp (CPU-friendly quantized inference) instead of vanilla transformers.
Use case 7 — Token counting + cost estimation
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def estimate_cost(prompt: str, max_response: int = 500) -> float:
input_tokens = len(enc.encode(prompt))
# GPT-4o pricing (May 2026, USD per 1M tokens)
input_cost_per_m = 5.00
output_cost_per_m = 15.00
return (
(input_tokens / 1_000_000) * input_cost_per_m
+ (max_response / 1_000_000) * output_cost_per_m
)
p = "Summarize the entire War and Peace novel"
print(f"~${estimate_cost(p):.4f} per call")
Build this into your monitoring. A bad query that retrieves 10 huge chunks can cost 50x a normal query.
All the major use cases — what LLMs handle today
| Category | Use case | Notes |
|---|---|---|
| Chat | Customer support copilot | Stream + escalation to human |
| Internal help-desk bot | RAG over internal docs | |
| Code | Code completion (Copilot-style) | Fine-tuned on code corpora |
| Code review / explanation | Strong on common languages | |
| SQL generation from natural language | Good with schema in prompt | |
| Documents | RAG Q&A over PDFs | The single biggest enterprise use case |
| Summarization | Long-form → short-form | |
| Translation | High quality for major languages | |
| Extraction | Invoice / receipt parsing | Structured output via JSON schema |
| Contract clause extraction | Legal, HR, compliance | |
| Email classification + routing | Sentiment + topic | |
| Agents | Multi-step workflows | "Book me a flight under $500" — tool calling |
| Coding agents | Write + run + debug autonomously | |
| Customer service agents | Use tools to look up orders, issue refunds | |
| Multi-modal | Image + text Q&A | "What's wrong with this circuit diagram?" |
| Document layout understanding | PDF with tables, columns | |
| Speech transcription → analysis | Whisper + LLM | |
| Creative | Marketing copy, ad creative | Brainstorm 50 angles, pick 3 |
| Drafting emails / reports | Human-in-the-loop |
Open-source vs proprietary — when to pick which
| Factor | OpenAI / Anthropic / Google | Open-source (Llama, Mistral, Qwen, Phi) |
|---|---|---|
| Quality (general benchmark) | Highest (GPT-4o, Claude 3.5) | Llama 3.1 405B comparable; smaller models 6-12 months behind |
| Latency | Network-bound (~500ms-3s) | Can run locally — 50-200ms for small models |
| Cost (high volume) | Per-token API | Fixed GPU cost; cheaper above ~1M tokens/day |
| Privacy | Data leaves your tenant (mostly OK with Azure OpenAI in your subscription) | Fully on-prem possible |
| Fine-tuning | Limited (OpenAI has it, expensive) | Full control |
| Latest features | Day-1 (newest releases) | Often 3-6 months later |
| Operational burden | None | Lots — GPUs, serving infra, monitoring |
| Right for | Most apps, especially early stage | High volume + privacy-sensitive + technical team |
Pragmatic default: start with Azure OpenAI / GPT-4o-mini. Move to open-source when you have:
- A measurable cost crisis at scale, OR
- A specific privacy / regulatory requirement, OR
- A specific fine-tuning need
Don't run open-source models locally just to feel virtuous — running production-grade LLM inference well is its own engineering problem.
Production pitfalls
1. Ignoring the context window
GPT-4o has 128k tokens. Sounds huge until you're packing 50 documents + system prompt + chat history. Token-budget every component:
def fit_to_context(system: str, docs: list[str], history: list[dict],
max_total_tokens: int = 100_000) -> tuple[str, list[str], list[dict]]:
# Reserve room for the response
budget = max_total_tokens - 4000 # reserve 4k for response
# ... drop oldest history items, then truncate docs, until fit ...
2. No retry logic for transient failures
OpenAI returns 429 (rate limit), 500 (server error), 503 fairly often. Use exponential backoff:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(5), wait=wait_exponential(min=1, max=30))
def call_llm(prompt: str) -> str:
return client.chat.completions.create(...)
3. Trusting LLM output without validation
The model can return malformed JSON, hallucinate field values, return wrong types. Always:
- Use structured output (Pydantic schema)
- Validate ranges / formats
- Have an "I don't know" path
- Log all responses for audit
4. Not measuring quality
Build a golden set of 30-50 representative input/expected-output pairs. Run them whenever you change the prompt, model, or temperature. Without this you're guessing whether you improved or regressed.
5. Ignoring the "I don't know" path
The default LLM behavior is to always answer, even confidently wrong. For factual systems, force the model to say "I don't know" when context is insufficient:
system = """You answer using ONLY the provided context. If the answer isn't
in the context, respond with: "I couldn't find that information." Do not
guess or invent facts."""
6. Forgetting the human-in-the-loop layer for high-stakes actions
LLMs occasionally do dumb things confidently. For agents that take real-world actions (sending emails, charging cards, deleting data), require explicit user confirmation for irreversible operations.
7. No caching for repeated queries
Same question asked 100 times = 100 LLM calls = 100× cost. Semantic cache (vector similarity on the query) catches near-duplicate questions:
def cached_call(question: str):
q_emb = embed(question)
cached = vector_search(q_emb, threshold=0.95)
if cached: return cached
answer = call_llm(question)
store_cache(q_emb, answer, ttl=3600)
return answer
15-30% cache hit rate is common on Q&A traffic. Direct cost saving.
8. Not handling cost spikes
A single user with a 1MB prompt can cost ₹100. Multiply by 1000 users / day = ₹100,000. Set hard token limits on input length per request, and alert on daily spend exceeding budget.
The Python ecosystem you should know
If you're building serious LLM applications in Python, learn these libraries in order of priority:
| Tier | Library | What for |
|---|---|---|
| Must | openai / anthropic / google-generativeai | Call hosted models |
| Must | tiktoken (or transformers.AutoTokenizer) | Token counting |
| Must | pydantic | Structured output validation |
| Must | tenacity (retry) + httpx (async HTTP) | Production-grade calls |
| Should | langchain OR llama-index | Higher-level orchestration |
| Should | sentence-transformers | Free local embeddings |
| Should | chromadb or qdrant-client | Vector DB clients |
| Nice | transformers (HuggingFace) | Local model inference |
| Nice | vllm | Fast self-hosted inference |
| Nice | instructor | Cleaner structured output |
| Nice | pydantic-ai | Modern agent framework |
For a fresh project: start with raw openai SDK + Pydantic for structured output. Reach for LangChain only when you have multiple chained LLM steps + tool use; don't add the abstraction tax for a simple chat app.
What you should know about LLMs but probably don't
A few non-obvious facts that bite teams in production:
-
The model's "confidence" is unreliable. GPT will say "I am 99% sure" about complete fabrications. Don't trust self-reported confidence; use retrieval distance, schema validation, and human review for high-stakes flows.
-
Same prompt → different answers. Even at temperature 0, slight variations occur (especially across model versions). Pin model version (
gpt-4o-2024-08-06, notgpt-4o) for reproducibility. -
Context order matters. Models pay more attention to the start and end of context. Bury an important fact in the middle and the model may miss it ("lost in the middle" effect). Put critical context up front or at the end.
-
"Bigger" doesn't always mean "better". GPT-4o-mini outperforms GPT-3.5 on most tasks at 10x lower cost. For each use case, benchmark several models — don't default to the most expensive.
-
Function calling doesn't always work. Sometimes the model invents tool calls or argument types. Validate before executing; have a fallback "I'd like to call this tool but the args don't match" handler.
-
Long context ≠ effective context. Just because the model accepts 128k tokens doesn't mean it uses them well. Quality drops with very long contexts. Stay under ~32k for most use cases.
-
Streaming costs the same as non-streaming. You don't save tokens by streaming — same model, same compute. You save perceived latency only.
Summary
Large Language Models are general-purpose text predictors trained on massive amounts of data, built using the transformer architecture. They work by:
- Tokenizing input into sub-word units
- Embedding each token into a vector
- Running it through dozens of transformer blocks that mix attention (cross-token context) with MLPs (per-token knowledge)
- Producing a probability distribution over the vocabulary
- Sampling the next token, appending, and repeating
Python dominates the ecosystem because of NumPy/PyTorch tensor ergonomics, the HuggingFace + research community gravity, and the seamless researcher → production handoff.
To ship production LLM applications:
- Pick the right model for the use case (don't default to GPT-4o)
- Stream responses for UX
- Use structured output (Pydantic schemas) for any non-conversational output
- Always have an "I don't know" path
- Measure quality on a golden set
- Cache, retry, monitor cost
The model is the easy part. The hard part is everything else — context management, error handling, evaluation, cost control, and integrating LLM output into a system that real users can trust.
📚 Test your knowledge → Practice with our LLM interview questions — internals, sampling parameters, tokenization, ecosystem choices, and production gotchas.
Get the next issue
A short, curated email with the newest posts and questions.