LLM Foundations — How Large Language Models Actually Work (with Python Code)

Large Language Models (LLMs) — GPT-4, Claude, Llama, Mistral, Gemini — are the foundation of every serious AI application built in the last two years. But for most developers they remain a black box: "send a prompt, get text back, hope it's right."

This guide is the complete foundational picture: what problems LLMs solve, why they're built in Python, how they actually work internally (tokenization, transformers, attention, sampling), and real production-grade Python code for every common use case. By the end, the API call client.chat.completions.create(...) will no longer be a mystery — you'll know exactly what happens at each layer between your prompt and the response.

The problems LLMs solve

Before LLMs, every NLP task needed its own model:

Sentiment analysis → train a classifier on labeled reviews
Translation → train a Seq2Seq model per language pair
Summarization → train an encoder-decoder on article+summary pairs
Q&A → train a span-extraction model on SQuAD
Code completion → train a model on code corpora

Each model required thousands of labeled examples, weeks of training, and worked only on its narrow task. Adding a new task = new model.

LLMs replaced this with a single general-purpose model that handles them all via natural-language instructions.

# Same model. Different prompts. Different tasks.
client.chat.completions.create(model="gpt-4o", messages=[
    {"role": "user", "content": "Translate to French: 'Where is the station?'"}
])
# → "Où est la gare?"

client.chat.completions.create(model="gpt-4o", messages=[
    {"role": "user", "content": "Summarize this article in 2 sentences: ..."}
])
# → 2-sentence summary

client.chat.completions.create(model="gpt-4o", messages=[
    {"role": "user", "content": "Extract the company name, amount and date from: 'Invoice #INV-2024 from Acme Corp for ₹45,000 dated 2026-05-22'"}
])
# → {"company": "Acme Corp", "amount": 45000, "date": "2026-05-22"}

One model, dozens of tasks, no per-task training data. That's the LLM revolution in one paragraph.

Beyond replacing classic NLP, LLMs unlocked entirely new capabilities:

Capability	Pre-LLM era	LLM era
Code generation	Templates / autocomplete	"Write a Python function that..."
Document Q&A	Keyword search + extraction	Natural-language conversation with citations
Agents / tool use	Custom finite-state machines	"Decide which API to call"
Structured extraction	Regex + custom parsers	Schema-aware JSON output
Creative writing	Pure human	Drafts, edits, brainstorms
Multi-modal understanding	Separate vision + text models	Single model: image + text → answer

Why Python dominates AI/ML development

Almost every AI library, model, framework, and tool ships Python-first. Reasons, in order of importance:

1. The library ecosystem is unmatched

Library	What it does	Why it's in Python
PyTorch	Defines and trains neural networks	Researcher-friendly tensor API
TensorFlow / Keras	Same — Google's stack	Same
transformers (HuggingFace)	1M+ pre-trained models, one API	Wraps PyTorch + TF + ONNX
NumPy	Numerical arrays — the foundation	Cython-fast, Python-ergonomic
pandas	Tabular data manipulation	The "Excel for code" of AI
scikit-learn	Classical ML (regression, trees, etc.)	Decade of refinement
sentence-transformers	Embeddings for semantic search	Built on transformers
LangChain / LlamaIndex	LLM orchestration frameworks	Python-first ecosystems
tiktoken	Fast tokenization for OpenAI models	OpenAI's reference impl
vLLM / llama.cpp	Fast local LLM inference	Python bindings + C++ core

Every research paper publishes Python code. Every model on HuggingFace ships with Python loaders. Every cloud AI service has a Python SDK as the first-class client.

2. Researcher → production has the same language

A researcher trains a model in a Jupyter notebook. A platform engineer ships it to production. Both use Python. No "research code in Python, prod code in C++" handoff like the bad old days. The same model.generate(...) call works in both places.

3. Numeric arrays are first-class

Python (with NumPy) has the cleanest API for n-dimensional arrays of any mainstream language. tensor[batch, head, seq, dim] reads naturally. Slicing, broadcasting, einsum — all expressive.

In .NET / Java you can do the same math, but with verbose loops or third-party wrappers. The vocabulary of AI is multi-dimensional tensors; Python speaks it natively.

4. C/C++ underneath where speed matters

When people say "Python is slow", they're right — but it doesn't matter. The heavy compute (GPU operations, matrix multiplication, attention) runs in CUDA / C++ kernels. Python just orchestrates. It's the "glue" language; the math runs at C speed.

5. Community, papers, weights, tutorials — all Python

Search "fine-tune Llama 3 example" → 95% of results are Python notebooks. "Run Stable Diffusion locally" → diffusers, Python. The community gravity is enormous and self-reinforcing.

When Python isn't the right choice

High-throughput inference services at the request/response layer — use Go or Rust for the API, call into Python or ONNX for the actual model
Mobile / embedded inference — ONNX Runtime, TensorFlow Lite, Core ML; no Python on the device
Real-time game / robotics — C++ for hard latency requirements
Enterprise integration — .NET/Java where the rest of the system lives; call the LLM via HTTP

But for building, training, fine-tuning, evaluating, and prototyping — Python is unrivaled.

How LLMs actually work — the internal architecture

                              How an LLM Generates Text
                              ────────────────────────

   Input prompt: "The capital of France is"
        │
        ▼
   ┌─────────────────────┐
   │   1. Tokenizer       │   Byte-pair encoding (BPE)
   │                      │   ["The", " capital", " of", " France", " is"]
   │                      │   → [464, 5963, 286, 4881, 318]
   └──────────┬──────────┘
              │ token IDs (integers)
              ▼
   ┌─────────────────────┐
   │  2. Embeddings       │   Each token → high-dim vector
   │  + Positional info   │   shape: [seq_len, d_model]
   │                      │   (e.g. d_model = 4096 for Llama 3 8B)
   └──────────┬──────────┘
              │
              ▼
   ┌─────────────────────────────────────────────────┐
   │  3. Transformer Blocks (×N layers — 32 to 80)   │
   │                                                 │
   │  ┌────────────────────────────────┐             │
   │  │   Self-Attention (multi-head)  │             │
   │  │   Q · Kᵀ → softmax → · V       │   ← O(n²)   │
   │  │   "which past tokens matter    │     cost on │
   │  │    for predicting next?"        │     context │
   │  └──────────┬────────────────────┘             │
   │             │                                   │
   │       residual + LayerNorm                      │
   │             │                                   │
   │             ▼                                   │
   │  ┌────────────────────────────────┐             │
   │  │   Feed-Forward (MLP)           │             │
   │  │   d_model → 4·d_model → d_model│             │
   │  │   "what does this token mean   │             │
   │  │    given its context?"          │             │
   │  └──────────┬────────────────────┘             │
   │             │                                   │
   │       residual + LayerNorm                      │
   │             │                                   │
   └─────────────┼───────────────────────────────────┘
                 ▼
   ┌─────────────────────┐
   │  4. Output Head      │   Linear: d_model → vocab_size
   │                      │   logits = [vocab_size] probabilities
   │                      │   (vocab ≈ 32k-200k tokens depending on model)
   └──────────┬──────────┘
              │
              ▼
   ┌─────────────────────┐
   │  5. Sampling         │   Temperature, top-k, top-p, greedy
   │                      │   Pick next token: " Paris"
   └──────────┬──────────┘
              │
              ▼
       Append " Paris" to input. Repeat from step 1.
       Stop on EOS token or max_tokens limit.

       Final: "The capital of France is Paris."

Now let's walk through each step with real Python code.

Step 1 — Tokenization

LLMs don't see characters or words — they see tokens, which are sub-word units learned during training. GPT-4 has a vocabulary of ~100,000 tokens; Llama 3 has 128,256.

# Install: pip install tiktoken
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Building LLM applications is fun!"
tokens = enc.encode(text)
print(tokens)
# [27418, 445, 11237, 8522, 374, 2523, 0]

# Decode back
print(enc.decode(tokens))
# 'Building LLM applications is fun!'

# Token counts matter — they determine cost AND fit in context window
print(f"Token count: {len(tokens)}")
# Token count: 7

# Rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words

The key implications:

You pay per token, not per character or word. A 100-word email is ~133 tokens.
Context windows are token-limited. GPT-4o has 128k tokens (~96k words ≈ a short novel).
Numbers, code, and non-English text tokenize less efficiently. "लर्निंग" (Hindi for "learning") might be 6 tokens; "learning" is 1.

For local / HuggingFace models, use the model's own tokenizer:

# pip install transformers
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tok.encode("Building LLM applications is fun!")
print(tokens, tok.decode(tokens))

Why sub-word tokenization (BPE)? Words are too sparse (millions); characters are too granular (long sequences). BPE finds the sweet spot — common words become single tokens, rare ones get split into pieces. "internationalization" might tokenize as ["international", "ization"].

Step 2 — Embeddings + position

Each token ID is mapped to a high-dimensional vector (4096 dims for Llama 3 8B; 12288 for GPT-3). Tokens with similar meanings end up near each other in this vector space.

But raw embeddings don't carry order — the model also needs to know that "dog bit man" is different from "man bit dog". So positional encoding is added.

# Conceptual — what happens inside
import torch

vocab_size = 128256
d_model = 4096

embedding_table = torch.randn(vocab_size, d_model)   # learned during training
token_ids = torch.tensor([464, 5963, 286, 4881, 318])
token_embeddings = embedding_table[token_ids]   # shape: [5, 4096]

# Positional encoding (modern models use RoPE — rotary position embeddings)
# We'll just illustrate with simple sinusoidal positions
positions = torch.arange(len(token_ids))
# Result: input to first transformer block = token_embeddings + position_info

After this step, the input to the first transformer layer is a [seq_len, d_model] matrix of vectors. Each row represents one token, in its position.

Step 3 — Transformer blocks (the actual "brain")

This is where the magic happens. Each block has two stages:

3a — Self-attention

The signature operation of transformers. For each token, the model decides which other tokens to pay attention to when figuring out what comes next.

# Simplified single-head self-attention (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SimpleAttention(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        # x: [seq_len, d_model]
        Q = self.W_q(x)   # [seq_len, d_model] — "what am I looking for?"
        K = self.W_k(x)   # [seq_len, d_model] — "what do I have to offer?"
        V = self.W_v(x)   # [seq_len, d_model] — "what's my actual content?"

        # Attention scores: how relevant each token is to each other
        scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
        # shape: [seq_len, seq_len]

        # Causal mask — token at position i can only see tokens 0..i
        # (LLMs predict left-to-right; can't peek at the future)
        mask = torch.tril(torch.ones_like(scores))
        scores = scores.masked_fill(mask == 0, float('-inf'))

        weights = F.softmax(scores, dim=-1)
        # weights[i][j] = "how much should token i pay attention to token j"

        return weights @ V   # [seq_len, d_model] — context-aware representation

The cost is O(n²) in the sequence length n — each token must compute scores against every other token. This is why context windows are bounded; doubling context = 4× the attention compute.

Multi-head attention runs multiple parallel attention operations (typically 32-128 "heads"), each with its own learned focus. One head might learn syntactic relationships, another semantic ones, another long-range coreference. The outputs are concatenated and projected.

3b — Feed-Forward Network (MLP)

After attention pools context across tokens, the MLP transforms each token individually with a non-linear function.

class SimpleMLP(nn.Module):
    def __init__(self, d_model, d_ff=None):
        super().__init__()
        d_ff = d_ff or 4 * d_model   # typical ratio
        self.up = nn.Linear(d_model, d_ff)
        self.down = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.down(F.gelu(self.up(x)))
        # GELU activation; modern models use SwiGLU or similar

Empirically, the MLP holds most of the model's factual knowledge — facts like "the capital of France is Paris" get stored in MLP weights. Attention shuffles context; MLP recalls and applies world knowledge.

Both stages have residual connections (x + sublayer(x)) and LayerNorm (or RMSNorm) — together they keep gradients well-behaved during training.

A typical model stacks 32-80 of these blocks. The output is a [seq_len, d_model] tensor where each row is a deep contextual representation of that token.

Step 4 — Output head + 5. Sampling

The final layer projects each token's representation to a probability distribution over the entire vocabulary.

# Output: logits over the vocabulary
output_head = nn.Linear(d_model, vocab_size)
logits = output_head(final_hidden)   # shape: [seq_len, vocab_size]

# We only care about predicting the NEXT token (the last position)
next_token_logits = logits[-1]   # shape: [vocab_size]
probs = F.softmax(next_token_logits, dim=-1)

Now we sample the next token. Sampling strategy = the most important inference-time knob.

# Greedy — always pick the most likely token (deterministic, often boring)
next_token = torch.argmax(probs)

# Temperature — flatten or sharpen the distribution
# temp < 1: more confident, more repetitive
# temp = 1: original distribution
# temp > 1: more random, more creative
def with_temperature(logits, temperature):
    return logits / temperature

# Top-k — sample only from the k most likely tokens
def top_k_sample(probs, k=50):
    top_values, top_indices = probs.topk(k)
    top_values = top_values / top_values.sum()   # renormalize
    choice = torch.multinomial(top_values, num_samples=1)
    return top_indices[choice]

# Top-p (nucleus) — sample from the smallest set whose probabilities sum to p
def top_p_sample(probs, p=0.9):
    sorted_probs, sorted_indices = probs.sort(descending=True)
    cumsum = sorted_probs.cumsum(dim=-1)
    nucleus = cumsum <= p
    nucleus[0] = True   # always include at least 1
    nucleus_probs = sorted_probs * nucleus
    nucleus_probs = nucleus_probs / nucleus_probs.sum()
    choice = torch.multinomial(nucleus_probs, num_samples=1)
    return sorted_indices[choice]

Production rules:

Factual Q&A (RAG, extraction): temperature=0.0 — deterministic
Creative writing, brainstorming: temperature=0.7-1.0
Code generation: temperature=0.0-0.3
top_p=0.9 and top_k=50 are good defaults for variability without garbage

The token is sampled, appended to the input sequence, and the whole forward pass repeats — one token at a time. This is why LLM inference latency scales linearly with output length.

The autoregressive loop

# Pseudo-code for the entire generation
prompt_ids = tokenizer.encode(prompt)
generated = list(prompt_ids)

for _ in range(max_tokens):
    logits = model(generated)              # [vocab_size]
    next_id = sample(logits, temperature, top_p, top_k)
    if next_id == tokenizer.eos_token_id:
        break
    generated.append(next_id)

text = tokenizer.decode(generated)

This is what client.chat.completions.create(...) does under the hood. The reason streaming responses work so well — you don't have to wait for the whole answer; the model produces tokens one at a time, and they can be flushed to the client as they're generated.

KV-cache — the inference optimization that makes LLMs viable

Naively, predicting token N+1 means re-running the model on ALL N tokens. That would be O(n³) for the whole generation — unusable.

The KV-cache stores the keys (K) and values (V) of each attention layer for all previously-processed tokens. When generating token N+1, the model only computes Q for the new token, but reuses cached K and V for previous ones. Cost drops to O(n) per new token.

Memory cost: 2 × num_layers × num_heads × head_dim × seq_len × dtype_size. For Llama 3 70B at 32k context, the KV cache is ~80 GB. This is why long-context inference needs lots of GPU memory.

Real Python code — every common use case

Use case 1 — Simple chat with OpenAI / Azure OpenAI

# pip install openai
from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a string."}
    ],
    temperature=0.0,
    max_tokens=200,
)
print(response.choices[0].message.content)
print(f"Used {response.usage.total_tokens} tokens")

Use case 2 — Streaming responses (perceived latency)

A 500-token answer takes ~3-5 seconds. Showing nothing for 5 seconds feels broken. Stream tokens as they generate; first token appears in ~300ms.

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 3 paragraphs"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

In a web app, pipe each chunk via Server-Sent Events (SSE) or WebSocket to the browser.

Use case 3 — Structured output (JSON schema)

LLMs can return JSON that conforms to a schema you define. Hugely useful for extraction, classification, agent steps.

from pydantic import BaseModel
from openai import OpenAI

class Invoice(BaseModel):
    company: str
    amount: float
    currency: str
    date: str

client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract structured data from invoice text."},
        {"role": "user", "content": "Invoice from Acme Corp for $1,250.00 USD dated 2026-05-22"}
    ],
    response_format=Invoice,
)

invoice: Invoice = response.choices[0].message.parsed
print(invoice.company)   # "Acme Corp"
print(invoice.amount)    # 1250.0

The model returns guaranteed-valid JSON matching your Pydantic schema. No more regex parsing. No more "the model returned 'amount: $1,250'" headaches.

Use case 4 — Tool / function calling (agents)

LLMs can call your functions when they decide they need to.

import json
from openai import OpenAI

def get_weather(city: str) -> dict:
    # In real code: call a weather API
    return {"city": city, "temp_c": 28, "condition": "sunny"}

client = OpenAI()
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"],
        },
    },
}]

messages = [{"role": "user", "content": "What's the weather in Mumbai?"}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
)

# Did the model decide to call a tool?
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = get_weather(**args)

# Send the result back to the model for the final answer
messages.append(response.choices[0].message)
messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,
    "content": json.dumps(result),
})

final = client.chat.completions.create(model="gpt-4o-mini", messages=messages, tools=tools)
print(final.choices[0].message.content)
# "It's currently 28°C and sunny in Mumbai."

This pattern — "model decides which tool to call, you execute it, you feed the result back" — is the core of every agent framework (LangChain, LlamaIndex, AutoGen, CrewAI, etc.). They just package this loop with retry, planning, and multi-step coordination.

Use case 5 — Embeddings for semantic search

Embeddings turn text into vectors where similar meaning = nearby vectors. Foundation of RAG, semantic search, clustering.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=[text],
    )
    return np.array(response.data[0].embedding)

q = embed("How do I deploy a Next.js app to Vercel?")
docs = [
    embed("Vercel deploy guide: push to GitHub, connect repo, done."),
    embed("Apple pie recipe: flour, butter, apples..."),
    embed("Docker tutorial for beginners"),
]

# Cosine similarity (since OpenAI embeddings are normalized to unit length)
similarities = [float(q @ doc) for doc in docs]
print(similarities)
# [0.71, 0.06, 0.21] — first doc is clearly most relevant

For production, store these vectors in Postgres + pgvector, Azure SQL VECTOR type, Pinecone, Weaviate, or Azure AI Search — never compute on the fly per query.

Use case 6 — Local inference with HuggingFace

For private data that can't go to a public API, or for cost-sensitive workloads:

# pip install transformers accelerate torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",   # auto-place on GPU(s)
)

prompt = "Explain transformer attention in one paragraph."
input_ids = tok.apply_chat_template(
    [{"role": "user", "content": prompt}],
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=300,
        temperature=0.3,
        top_p=0.9,
        do_sample=True,
    )

print(tok.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))

For production-grade local inference, use vLLM (PagedAttention, much faster) or llama.cpp (CPU-friendly quantized inference) instead of vanilla transformers.

Use case 7 — Token counting + cost estimation

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def estimate_cost(prompt: str, max_response: int = 500) -> float:
    input_tokens = len(enc.encode(prompt))
    # GPT-4o pricing (May 2026, USD per 1M tokens)
    input_cost_per_m = 5.00
    output_cost_per_m = 15.00
    return (
        (input_tokens / 1_000_000) * input_cost_per_m
        + (max_response / 1_000_000) * output_cost_per_m
    )

p = "Summarize the entire War and Peace novel"
print(f"~${estimate_cost(p):.4f} per call")

Build this into your monitoring. A bad query that retrieves 10 huge chunks can cost 50x a normal query.

All the major use cases — what LLMs handle today

Category	Use case	Notes
Chat	Customer support copilot	Stream + escalation to human
	Internal help-desk bot	RAG over internal docs
Code	Code completion (Copilot-style)	Fine-tuned on code corpora
	Code review / explanation	Strong on common languages
	SQL generation from natural language	Good with schema in prompt
Documents	RAG Q&A over PDFs	The single biggest enterprise use case
	Summarization	Long-form → short-form
	Translation	High quality for major languages
Extraction	Invoice / receipt parsing	Structured output via JSON schema
	Contract clause extraction	Legal, HR, compliance
	Email classification + routing	Sentiment + topic
Agents	Multi-step workflows	"Book me a flight under $500" — tool calling
	Coding agents	Write + run + debug autonomously
	Customer service agents	Use tools to look up orders, issue refunds
Multi-modal	Image + text Q&A	"What's wrong with this circuit diagram?"
	Document layout understanding	PDF with tables, columns
	Speech transcription → analysis	Whisper + LLM
Creative	Marketing copy, ad creative	Brainstorm 50 angles, pick 3
	Drafting emails / reports	Human-in-the-loop

Open-source vs proprietary — when to pick which

Factor	OpenAI / Anthropic / Google	Open-source (Llama, Mistral, Qwen, Phi)
Quality (general benchmark)	Highest (GPT-4o, Claude 3.5)	Llama 3.1 405B comparable; smaller models 6-12 months behind
Latency	Network-bound (~500ms-3s)	Can run locally — 50-200ms for small models
Cost (high volume)	Per-token API	Fixed GPU cost; cheaper above ~1M tokens/day
Privacy	Data leaves your tenant (mostly OK with Azure OpenAI in your subscription)	Fully on-prem possible
Fine-tuning	Limited (OpenAI has it, expensive)	Full control
Latest features	Day-1 (newest releases)	Often 3-6 months later
Operational burden	None	Lots — GPUs, serving infra, monitoring
Right for	Most apps, especially early stage	High volume + privacy-sensitive + technical team

Pragmatic default: start with Azure OpenAI / GPT-4o-mini. Move to open-source when you have:

A measurable cost crisis at scale, OR
A specific privacy / regulatory requirement, OR
A specific fine-tuning need

Don't run open-source models locally just to feel virtuous — running production-grade LLM inference well is its own engineering problem.

Production pitfalls

1. Ignoring the context window

GPT-4o has 128k tokens. Sounds huge until you're packing 50 documents + system prompt + chat history. Token-budget every component:

def fit_to_context(system: str, docs: list[str], history: list[dict],
                   max_total_tokens: int = 100_000) -> tuple[str, list[str], list[dict]]:
    # Reserve room for the response
    budget = max_total_tokens - 4000  # reserve 4k for response
    # ... drop oldest history items, then truncate docs, until fit ...

2. No retry logic for transient failures

OpenAI returns 429 (rate limit), 500 (server error), 503 fairly often. Use exponential backoff:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(min=1, max=30))
def call_llm(prompt: str) -> str:
    return client.chat.completions.create(...)

3. Trusting LLM output without validation

The model can return malformed JSON, hallucinate field values, return wrong types. Always:

Use structured output (Pydantic schema)
Validate ranges / formats
Have an "I don't know" path
Log all responses for audit

4. Not measuring quality

Build a golden set of 30-50 representative input/expected-output pairs. Run them whenever you change the prompt, model, or temperature. Without this you're guessing whether you improved or regressed.

5. Ignoring the "I don't know" path

The default LLM behavior is to always answer, even confidently wrong. For factual systems, force the model to say "I don't know" when context is insufficient:

system = """You answer using ONLY the provided context. If the answer isn't
in the context, respond with: "I couldn't find that information." Do not
guess or invent facts."""

6. Forgetting the human-in-the-loop layer for high-stakes actions

LLMs occasionally do dumb things confidently. For agents that take real-world actions (sending emails, charging cards, deleting data), require explicit user confirmation for irreversible operations.

7. No caching for repeated queries

Same question asked 100 times = 100 LLM calls = 100× cost. Semantic cache (vector similarity on the query) catches near-duplicate questions:

def cached_call(question: str):
    q_emb = embed(question)
    cached = vector_search(q_emb, threshold=0.95)
    if cached: return cached
    answer = call_llm(question)
    store_cache(q_emb, answer, ttl=3600)
    return answer

15-30% cache hit rate is common on Q&A traffic. Direct cost saving.

8. Not handling cost spikes

A single user with a 1MB prompt can cost ₹100. Multiply by 1000 users / day = ₹100,000. Set hard token limits on input length per request, and alert on daily spend exceeding budget.

The Python ecosystem you should know

If you're building serious LLM applications in Python, learn these libraries in order of priority:

Tier	Library	What for
Must	`openai` / `anthropic` / `google-generativeai`	Call hosted models
Must	`tiktoken` (or `transformers.AutoTokenizer`)	Token counting
Must	`pydantic`	Structured output validation
Must	`tenacity` (retry) + `httpx` (async HTTP)	Production-grade calls
Should	`langchain` OR `llama-index`	Higher-level orchestration
Should	`sentence-transformers`	Free local embeddings
Should	`chromadb` or `qdrant-client`	Vector DB clients
Nice	`transformers` (HuggingFace)	Local model inference
Nice	`vllm`	Fast self-hosted inference
Nice	`instructor`	Cleaner structured output
Nice	`pydantic-ai`	Modern agent framework

For a fresh project: start with raw openai SDK + Pydantic for structured output. Reach for LangChain only when you have multiple chained LLM steps + tool use; don't add the abstraction tax for a simple chat app.

What you should know about LLMs but probably don't

A few non-obvious facts that bite teams in production:

The model's "confidence" is unreliable. GPT will say "I am 99% sure" about complete fabrications. Don't trust self-reported confidence; use retrieval distance, schema validation, and human review for high-stakes flows.
Same prompt → different answers. Even at temperature 0, slight variations occur (especially across model versions). Pin model version (gpt-4o-2024-08-06, not gpt-4o) for reproducibility.
Context order matters. Models pay more attention to the start and end of context. Bury an important fact in the middle and the model may miss it ("lost in the middle" effect). Put critical context up front or at the end.
"Bigger" doesn't always mean "better". GPT-4o-mini outperforms GPT-3.5 on most tasks at 10x lower cost. For each use case, benchmark several models — don't default to the most expensive.
Function calling doesn't always work. Sometimes the model invents tool calls or argument types. Validate before executing; have a fallback "I'd like to call this tool but the args don't match" handler.
Long context ≠ effective context. Just because the model accepts 128k tokens doesn't mean it uses them well. Quality drops with very long contexts. Stay under ~32k for most use cases.
Streaming costs the same as non-streaming. You don't save tokens by streaming — same model, same compute. You save perceived latency only.

Summary

Large Language Models are general-purpose text predictors trained on massive amounts of data, built using the transformer architecture. They work by:

Tokenizing input into sub-word units
Embedding each token into a vector
Running it through dozens of transformer blocks that mix attention (cross-token context) with MLPs (per-token knowledge)
Producing a probability distribution over the vocabulary
Sampling the next token, appending, and repeating

Python dominates the ecosystem because of NumPy/PyTorch tensor ergonomics, the HuggingFace + research community gravity, and the seamless researcher → production handoff.

To ship production LLM applications:

Pick the right model for the use case (don't default to GPT-4o)
Stream responses for UX
Use structured output (Pydantic schemas) for any non-conversational output
Always have an "I don't know" path
Measure quality on a golden set
Cache, retry, monitor cost

The model is the easy part. The hard part is everything else — context management, error handling, evaluation, cost control, and integrating LLM output into a system that real users can trust.

📚 Test your knowledge → Practice with our LLM interview questions — internals, sampling parameters, tokenization, ecosystem choices, and production gotchas.

The problems LLMs solve

Before LLMs, every NLP task needed its own model:

Sentiment analysis → train a classifier on labeled reviews
Translation → train a Seq2Seq model per language pair
Summarization → train an encoder-decoder on article+summary pairs
Q&A → train a span-extraction model on SQuAD
Code completion → train a model on code corpora

Each model required thousands of labeled examples, weeks of training, and worked only on its narrow task. Adding a new task = new model.

LLMs replaced this with a single general-purpose model that handles them all via natural-language instructions.

# Same model. Different prompts. Different tasks.
client.chat.completions.create(model="gpt-4o", messages=[
    {"role": "user", "content": "Translate to French: 'Where is the station?'"}
])
# → "Où est la gare?"

client.chat.completions.create(model="gpt-4o", messages=[
    {"role": "user", "content": "Summarize this article in 2 sentences: ..."}
])
# → 2-sentence summary

client.chat.completions.create(model="gpt-4o", messages=[
    {"role": "user", "content": "Extract the company name, amount and date from: 'Invoice #INV-2024 from Acme Corp for ₹45,000 dated 2026-05-22'"}
])
# → {"company": "Acme Corp", "amount": 45000, "date": "2026-05-22"}

One model, dozens of tasks, no per-task training data. That's the LLM revolution in one paragraph.

Beyond replacing classic NLP, LLMs unlocked entirely new capabilities:

Capability	Pre-LLM era	LLM era
Code generation	Templates / autocomplete	"Write a Python function that..."
Document Q&A	Keyword search + extraction	Natural-language conversation with citations
Agents / tool use	Custom finite-state machines	"Decide which API to call"
Structured extraction	Regex + custom parsers	Schema-aware JSON output
Creative writing	Pure human	Drafts, edits, brainstorms
Multi-modal understanding	Separate vision + text models	Single model: image + text → answer

Why Python dominates AI/ML development

Almost every AI library, model, framework, and tool ships Python-first. Reasons, in order of importance:

1. The library ecosystem is unmatched

Library	What it does	Why it's in Python
PyTorch	Defines and trains neural networks	Researcher-friendly tensor API
TensorFlow / Keras	Same — Google's stack	Same
transformers (HuggingFace)	1M+ pre-trained models, one API	Wraps PyTorch + TF + ONNX
NumPy	Numerical arrays — the foundation	Cython-fast, Python-ergonomic
pandas	Tabular data manipulation	The "Excel for code" of AI
scikit-learn	Classical ML (regression, trees, etc.)	Decade of refinement
sentence-transformers	Embeddings for semantic search	Built on transformers
LangChain / LlamaIndex	LLM orchestration frameworks	Python-first ecosystems
tiktoken	Fast tokenization for OpenAI models	OpenAI's reference impl
vLLM / llama.cpp	Fast local LLM inference	Python bindings + C++ core

Every research paper publishes Python code. Every model on HuggingFace ships with Python loaders. Every cloud AI service has a Python SDK as the first-class client.

2. Researcher → production has the same language

3. Numeric arrays are first-class

Python (with NumPy) has the cleanest API for n-dimensional arrays of any mainstream language. tensor[batch, head, seq, dim] reads naturally. Slicing, broadcasting, einsum — all expressive.

In .NET / Java you can do the same math, but with verbose loops or third-party wrappers. The vocabulary of AI is multi-dimensional tensors; Python speaks it natively.

4. C/C++ underneath where speed matters

5. Community, papers, weights, tutorials — all Python

Search "fine-tune Llama 3 example" → 95% of results are Python notebooks. "Run Stable Diffusion locally" → diffusers, Python. The community gravity is enormous and self-reinforcing.

When Python isn't the right choice

High-throughput inference services at the request/response layer — use Go or Rust for the API, call into Python or ONNX for the actual model
Mobile / embedded inference — ONNX Runtime, TensorFlow Lite, Core ML; no Python on the device
Real-time game / robotics — C++ for hard latency requirements
Enterprise integration — .NET/Java where the rest of the system lives; call the LLM via HTTP

But for building, training, fine-tuning, evaluating, and prototyping — Python is unrivaled.

How LLMs actually work — the internal architecture

                              How an LLM Generates Text
                              ────────────────────────

   Input prompt: "The capital of France is"
        │
        ▼
   ┌─────────────────────┐
   │   1. Tokenizer       │   Byte-pair encoding (BPE)
   │                      │   ["The", " capital", " of", " France", " is"]
   │                      │   → [464, 5963, 286, 4881, 318]
   └──────────┬──────────┘
              │ token IDs (integers)
              ▼
   ┌─────────────────────┐
   │  2. Embeddings       │   Each token → high-dim vector
   │  + Positional info   │   shape: [seq_len, d_model]
   │                      │   (e.g. d_model = 4096 for Llama 3 8B)
   └──────────┬──────────┘
              │
              ▼
   ┌─────────────────────────────────────────────────┐
   │  3. Transformer Blocks (×N layers — 32 to 80)   │
   │                                                 │
   │  ┌────────────────────────────────┐             │
   │  │   Self-Attention (multi-head)  │             │
   │  │   Q · Kᵀ → softmax → · V       │   ← O(n²)   │
   │  │   "which past tokens matter    │     cost on │
   │  │    for predicting next?"        │     context │
   │  └──────────┬────────────────────┘             │
   │             │                                   │
   │       residual + LayerNorm                      │
   │             │                                   │
   │             ▼                                   │
   │  ┌────────────────────────────────┐             │
   │  │   Feed-Forward (MLP)           │             │
   │  │   d_model → 4·d_model → d_model│             │
   │  │   "what does this token mean   │             │
   │  │    given its context?"          │             │
   │  └──────────┬────────────────────┘             │
   │             │                                   │
   │       residual + LayerNorm                      │
   │             │                                   │
   └─────────────┼───────────────────────────────────┘
                 ▼
   ┌─────────────────────┐
   │  4. Output Head      │   Linear: d_model → vocab_size
   │                      │   logits = [vocab_size] probabilities
   │                      │   (vocab ≈ 32k-200k tokens depending on model)
   └──────────┬──────────┘
              │
              ▼
   ┌─────────────────────┐
   │  5. Sampling         │   Temperature, top-k, top-p, greedy
   │                      │   Pick next token: " Paris"
   └──────────┬──────────┘
              │
              ▼
       Append " Paris" to input. Repeat from step 1.
       Stop on EOS token or max_tokens limit.

       Final: "The capital of France is Paris."

Now let's walk through each step with real Python code.

Step 1 — Tokenization

LLMs don't see characters or words — they see tokens, which are sub-word units learned during training. GPT-4 has a vocabulary of ~100,000 tokens; Llama 3 has 128,256.

# Install: pip install tiktoken
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Building LLM applications is fun!"
tokens = enc.encode(text)
print(tokens)
# [27418, 445, 11237, 8522, 374, 2523, 0]

# Decode back
print(enc.decode(tokens))
# 'Building LLM applications is fun!'

# Token counts matter — they determine cost AND fit in context window
print(f"Token count: {len(tokens)}")
# Token count: 7

# Rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words

The key implications:

You pay per token, not per character or word. A 100-word email is ~133 tokens.
Context windows are token-limited. GPT-4o has 128k tokens (~96k words ≈ a short novel).
Numbers, code, and non-English text tokenize less efficiently. "लर्निंग" (Hindi for "learning") might be 6 tokens; "learning" is 1.

For local / HuggingFace models, use the model's own tokenizer:

# pip install transformers
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tok.encode("Building LLM applications is fun!")
print(tokens, tok.decode(tokens))

Step 2 — Embeddings + position

Each token ID is mapped to a high-dimensional vector (4096 dims for Llama 3 8B; 12288 for GPT-3). Tokens with similar meanings end up near each other in this vector space.

But raw embeddings don't carry order — the model also needs to know that "dog bit man" is different from "man bit dog". So positional encoding is added.

# Conceptual — what happens inside
import torch

vocab_size = 128256
d_model = 4096

embedding_table = torch.randn(vocab_size, d_model)   # learned during training
token_ids = torch.tensor([464, 5963, 286, 4881, 318])
token_embeddings = embedding_table[token_ids]   # shape: [5, 4096]

# Positional encoding (modern models use RoPE — rotary position embeddings)
# We'll just illustrate with simple sinusoidal positions
positions = torch.arange(len(token_ids))
# Result: input to first transformer block = token_embeddings + position_info

After this step, the input to the first transformer layer is a [seq_len, d_model] matrix of vectors. Each row represents one token, in its position.

Step 3 — Transformer blocks (the actual "brain")

This is where the magic happens. Each block has two stages:

3a — Self-attention

The signature operation of transformers. For each token, the model decides which other tokens to pay attention to when figuring out what comes next.

# Simplified single-head self-attention (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SimpleAttention(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        # x: [seq_len, d_model]
        Q = self.W_q(x)   # [seq_len, d_model] — "what am I looking for?"
        K = self.W_k(x)   # [seq_len, d_model] — "what do I have to offer?"
        V = self.W_v(x)   # [seq_len, d_model] — "what's my actual content?"

        # Attention scores: how relevant each token is to each other
        scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
        # shape: [seq_len, seq_len]

        # Causal mask — token at position i can only see tokens 0..i
        # (LLMs predict left-to-right; can't peek at the future)
        mask = torch.tril(torch.ones_like(scores))
        scores = scores.masked_fill(mask == 0, float('-inf'))

        weights = F.softmax(scores, dim=-1)
        # weights[i][j] = "how much should token i pay attention to token j"

        return weights @ V   # [seq_len, d_model] — context-aware representation

The cost is O(n²) in the sequence length n — each token must compute scores against every other token. This is why context windows are bounded; doubling context = 4× the attention compute.

3b — Feed-Forward Network (MLP)

After attention pools context across tokens, the MLP transforms each token individually with a non-linear function.

class SimpleMLP(nn.Module):
    def __init__(self, d_model, d_ff=None):
        super().__init__()
        d_ff = d_ff or 4 * d_model   # typical ratio
        self.up = nn.Linear(d_model, d_ff)
        self.down = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.down(F.gelu(self.up(x)))
        # GELU activation; modern models use SwiGLU or similar

Both stages have residual connections (x + sublayer(x)) and LayerNorm (or RMSNorm) — together they keep gradients well-behaved during training.

A typical model stacks 32-80 of these blocks. The output is a [seq_len, d_model] tensor where each row is a deep contextual representation of that token.

Step 4 — Output head + 5. Sampling

The final layer projects each token's representation to a probability distribution over the entire vocabulary.

# Output: logits over the vocabulary
output_head = nn.Linear(d_model, vocab_size)
logits = output_head(final_hidden)   # shape: [seq_len, vocab_size]

# We only care about predicting the NEXT token (the last position)
next_token_logits = logits[-1]   # shape: [vocab_size]
probs = F.softmax(next_token_logits, dim=-1)

Now we sample the next token. Sampling strategy = the most important inference-time knob.

# Greedy — always pick the most likely token (deterministic, often boring)
next_token = torch.argmax(probs)

# Temperature — flatten or sharpen the distribution
# temp < 1: more confident, more repetitive
# temp = 1: original distribution
# temp > 1: more random, more creative
def with_temperature(logits, temperature):
    return logits / temperature

# Top-k — sample only from the k most likely tokens
def top_k_sample(probs, k=50):
    top_values, top_indices = probs.topk(k)
    top_values = top_values / top_values.sum()   # renormalize
    choice = torch.multinomial(top_values, num_samples=1)
    return top_indices[choice]

# Top-p (nucleus) — sample from the smallest set whose probabilities sum to p
def top_p_sample(probs, p=0.9):
    sorted_probs, sorted_indices = probs.sort(descending=True)
    cumsum = sorted_probs.cumsum(dim=-1)
    nucleus = cumsum <= p
    nucleus[0] = True   # always include at least 1
    nucleus_probs = sorted_probs * nucleus
    nucleus_probs = nucleus_probs / nucleus_probs.sum()
    choice = torch.multinomial(nucleus_probs, num_samples=1)
    return sorted_indices[choice]

Production rules:

Factual Q&A (RAG, extraction): temperature=0.0 — deterministic
Creative writing, brainstorming: temperature=0.7-1.0
Code generation: temperature=0.0-0.3
top_p=0.9 and top_k=50 are good defaults for variability without garbage

The token is sampled, appended to the input sequence, and the whole forward pass repeats — one token at a time. This is why LLM inference latency scales linearly with output length.

The autoregressive loop

# Pseudo-code for the entire generation
prompt_ids = tokenizer.encode(prompt)
generated = list(prompt_ids)

for _ in range(max_tokens):
    logits = model(generated)              # [vocab_size]
    next_id = sample(logits, temperature, top_p, top_k)
    if next_id == tokenizer.eos_token_id:
        break
    generated.append(next_id)

text = tokenizer.decode(generated)

KV-cache — the inference optimization that makes LLMs viable

Naively, predicting token N+1 means re-running the model on ALL N tokens. That would be O(n³) for the whole generation — unusable.

Memory cost: 2 × num_layers × num_heads × head_dim × seq_len × dtype_size. For Llama 3 70B at 32k context, the KV cache is ~80 GB. This is why long-context inference needs lots of GPU memory.

Real Python code — every common use case

Use case 1 — Simple chat with OpenAI / Azure OpenAI

# pip install openai
from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a string."}
    ],
    temperature=0.0,
    max_tokens=200,
)
print(response.choices[0].message.content)
print(f"Used {response.usage.total_tokens} tokens")

Use case 2 — Streaming responses (perceived latency)

A 500-token answer takes ~3-5 seconds. Showing nothing for 5 seconds feels broken. Stream tokens as they generate; first token appears in ~300ms.

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 3 paragraphs"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

In a web app, pipe each chunk via Server-Sent Events (SSE) or WebSocket to the browser.

Use case 3 — Structured output (JSON schema)

LLMs can return JSON that conforms to a schema you define. Hugely useful for extraction, classification, agent steps.

from pydantic import BaseModel
from openai import OpenAI

class Invoice(BaseModel):
    company: str
    amount: float
    currency: str
    date: str

client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract structured data from invoice text."},
        {"role": "user", "content": "Invoice from Acme Corp for $1,250.00 USD dated 2026-05-22"}
    ],
    response_format=Invoice,
)

invoice: Invoice = response.choices[0].message.parsed
print(invoice.company)   # "Acme Corp"
print(invoice.amount)    # 1250.0

The model returns guaranteed-valid JSON matching your Pydantic schema. No more regex parsing. No more "the model returned 'amount: $1,250'" headaches.

Use case 4 — Tool / function calling (agents)

LLMs can call your functions when they decide they need to.

import json
from openai import OpenAI

def get_weather(city: str) -> dict:
    # In real code: call a weather API
    return {"city": city, "temp_c": 28, "condition": "sunny"}

client = OpenAI()
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"],
        },
    },
}]

messages = [{"role": "user", "content": "What's the weather in Mumbai?"}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
)

# Did the model decide to call a tool?
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = get_weather(**args)

# Send the result back to the model for the final answer
messages.append(response.choices[0].message)
messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,
    "content": json.dumps(result),
})

final = client.chat.completions.create(model="gpt-4o-mini", messages=messages, tools=tools)
print(final.choices[0].message.content)
# "It's currently 28°C and sunny in Mumbai."

Use case 5 — Embeddings for semantic search

Embeddings turn text into vectors where similar meaning = nearby vectors. Foundation of RAG, semantic search, clustering.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=[text],
    )
    return np.array(response.data[0].embedding)

q = embed("How do I deploy a Next.js app to Vercel?")
docs = [
    embed("Vercel deploy guide: push to GitHub, connect repo, done."),
    embed("Apple pie recipe: flour, butter, apples..."),
    embed("Docker tutorial for beginners"),
]

# Cosine similarity (since OpenAI embeddings are normalized to unit length)
similarities = [float(q @ doc) for doc in docs]
print(similarities)
# [0.71, 0.06, 0.21] — first doc is clearly most relevant

For production, store these vectors in Postgres + pgvector, Azure SQL VECTOR type, Pinecone, Weaviate, or Azure AI Search — never compute on the fly per query.

Use case 6 — Local inference with HuggingFace

For private data that can't go to a public API, or for cost-sensitive workloads:

# pip install transformers accelerate torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",   # auto-place on GPU(s)
)

prompt = "Explain transformer attention in one paragraph."
input_ids = tok.apply_chat_template(
    [{"role": "user", "content": prompt}],
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=300,
        temperature=0.3,
        top_p=0.9,
        do_sample=True,
    )

print(tok.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))

For production-grade local inference, use vLLM (PagedAttention, much faster) or llama.cpp (CPU-friendly quantized inference) instead of vanilla transformers.

Use case 7 — Token counting + cost estimation

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def estimate_cost(prompt: str, max_response: int = 500) -> float:
    input_tokens = len(enc.encode(prompt))
    # GPT-4o pricing (May 2026, USD per 1M tokens)
    input_cost_per_m = 5.00
    output_cost_per_m = 15.00
    return (
        (input_tokens / 1_000_000) * input_cost_per_m
        + (max_response / 1_000_000) * output_cost_per_m
    )

p = "Summarize the entire War and Peace novel"
print(f"~${estimate_cost(p):.4f} per call")

Build this into your monitoring. A bad query that retrieves 10 huge chunks can cost 50x a normal query.

All the major use cases — what LLMs handle today

Category	Use case	Notes
Chat	Customer support copilot	Stream + escalation to human
	Internal help-desk bot	RAG over internal docs
Code	Code completion (Copilot-style)	Fine-tuned on code corpora
	Code review / explanation	Strong on common languages
	SQL generation from natural language	Good with schema in prompt
Documents	RAG Q&A over PDFs	The single biggest enterprise use case
	Summarization	Long-form → short-form
	Translation	High quality for major languages
Extraction	Invoice / receipt parsing	Structured output via JSON schema
	Contract clause extraction	Legal, HR, compliance
	Email classification + routing	Sentiment + topic
Agents	Multi-step workflows	"Book me a flight under $500" — tool calling
	Coding agents	Write + run + debug autonomously
	Customer service agents	Use tools to look up orders, issue refunds
Multi-modal	Image + text Q&A	"What's wrong with this circuit diagram?"
	Document layout understanding	PDF with tables, columns
	Speech transcription → analysis	Whisper + LLM
Creative	Marketing copy, ad creative	Brainstorm 50 angles, pick 3
	Drafting emails / reports	Human-in-the-loop

Open-source vs proprietary — when to pick which

Factor	OpenAI / Anthropic / Google	Open-source (Llama, Mistral, Qwen, Phi)
Quality (general benchmark)	Highest (GPT-4o, Claude 3.5)	Llama 3.1 405B comparable; smaller models 6-12 months behind
Latency	Network-bound (~500ms-3s)	Can run locally — 50-200ms for small models
Cost (high volume)	Per-token API	Fixed GPU cost; cheaper above ~1M tokens/day
Privacy	Data leaves your tenant (mostly OK with Azure OpenAI in your subscription)	Fully on-prem possible
Fine-tuning	Limited (OpenAI has it, expensive)	Full control
Latest features	Day-1 (newest releases)	Often 3-6 months later
Operational burden	None	Lots — GPUs, serving infra, monitoring
Right for	Most apps, especially early stage	High volume + privacy-sensitive + technical team

Pragmatic default: start with Azure OpenAI / GPT-4o-mini. Move to open-source when you have:

A measurable cost crisis at scale, OR
A specific privacy / regulatory requirement, OR
A specific fine-tuning need

Don't run open-source models locally just to feel virtuous — running production-grade LLM inference well is its own engineering problem.

Production pitfalls

1. Ignoring the context window

GPT-4o has 128k tokens. Sounds huge until you're packing 50 documents + system prompt + chat history. Token-budget every component:

def fit_to_context(system: str, docs: list[str], history: list[dict],
                   max_total_tokens: int = 100_000) -> tuple[str, list[str], list[dict]]:
    # Reserve room for the response
    budget = max_total_tokens - 4000  # reserve 4k for response
    # ... drop oldest history items, then truncate docs, until fit ...

2. No retry logic for transient failures

OpenAI returns 429 (rate limit), 500 (server error), 503 fairly often. Use exponential backoff:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(min=1, max=30))
def call_llm(prompt: str) -> str:
    return client.chat.completions.create(...)

3. Trusting LLM output without validation

The model can return malformed JSON, hallucinate field values, return wrong types. Always:

Use structured output (Pydantic schema)
Validate ranges / formats
Have an "I don't know" path
Log all responses for audit

4. Not measuring quality

5. Ignoring the "I don't know" path

The default LLM behavior is to always answer, even confidently wrong. For factual systems, force the model to say "I don't know" when context is insufficient:

system = """You answer using ONLY the provided context. If the answer isn't
in the context, respond with: "I couldn't find that information." Do not
guess or invent facts."""

6. Forgetting the human-in-the-loop layer for high-stakes actions

LLMs occasionally do dumb things confidently. For agents that take real-world actions (sending emails, charging cards, deleting data), require explicit user confirmation for irreversible operations.

7. No caching for repeated queries

Same question asked 100 times = 100 LLM calls = 100× cost. Semantic cache (vector similarity on the query) catches near-duplicate questions:

def cached_call(question: str):
    q_emb = embed(question)
    cached = vector_search(q_emb, threshold=0.95)
    if cached: return cached
    answer = call_llm(question)
    store_cache(q_emb, answer, ttl=3600)
    return answer

15-30% cache hit rate is common on Q&A traffic. Direct cost saving.

8. Not handling cost spikes

A single user with a 1MB prompt can cost ₹100. Multiply by 1000 users / day = ₹100,000. Set hard token limits on input length per request, and alert on daily spend exceeding budget.

The Python ecosystem you should know

If you're building serious LLM applications in Python, learn these libraries in order of priority:

Tier	Library	What for
Must	`openai` / `anthropic` / `google-generativeai`	Call hosted models
Must	`tiktoken` (or `transformers.AutoTokenizer`)	Token counting
Must	`pydantic`	Structured output validation
Must	`tenacity` (retry) + `httpx` (async HTTP)	Production-grade calls
Should	`langchain` OR `llama-index`	Higher-level orchestration
Should	`sentence-transformers`	Free local embeddings
Should	`chromadb` or `qdrant-client`	Vector DB clients
Nice	`transformers` (HuggingFace)	Local model inference
Nice	`vllm`	Fast self-hosted inference
Nice	`instructor`	Cleaner structured output
Nice	`pydantic-ai`	Modern agent framework

What you should know about LLMs but probably don't

A few non-obvious facts that bite teams in production:

The model's "confidence" is unreliable. GPT will say "I am 99% sure" about complete fabrications. Don't trust self-reported confidence; use retrieval distance, schema validation, and human review for high-stakes flows.
Same prompt → different answers. Even at temperature 0, slight variations occur (especially across model versions). Pin model version (gpt-4o-2024-08-06, not gpt-4o) for reproducibility.
Context order matters. Models pay more attention to the start and end of context. Bury an important fact in the middle and the model may miss it ("lost in the middle" effect). Put critical context up front or at the end.
"Bigger" doesn't always mean "better". GPT-4o-mini outperforms GPT-3.5 on most tasks at 10x lower cost. For each use case, benchmark several models — don't default to the most expensive.
Function calling doesn't always work. Sometimes the model invents tool calls or argument types. Validate before executing; have a fallback "I'd like to call this tool but the args don't match" handler.
Long context ≠ effective context. Just because the model accepts 128k tokens doesn't mean it uses them well. Quality drops with very long contexts. Stay under ~32k for most use cases.
Streaming costs the same as non-streaming. You don't save tokens by streaming — same model, same compute. You save perceived latency only.

Summary

Large Language Models are general-purpose text predictors trained on massive amounts of data, built using the transformer architecture. They work by:

Tokenizing input into sub-word units
Embedding each token into a vector
Running it through dozens of transformer blocks that mix attention (cross-token context) with MLPs (per-token knowledge)
Producing a probability distribution over the vocabulary
Sampling the next token, appending, and repeating

Python dominates the ecosystem because of NumPy/PyTorch tensor ergonomics, the HuggingFace + research community gravity, and the seamless researcher → production handoff.

To ship production LLM applications:

Pick the right model for the use case (don't default to GPT-4o)
Stream responses for UX
Use structured output (Pydantic schemas) for any non-conversational output
Always have an "I don't know" path
Measure quality on a golden set
Cache, retry, monitor cost

The model is the easy part. The hard part is everything else — context management, error handling, evaluation, cost control, and integrating LLM output into a system that real users can trust.

📚 Test your knowledge → Practice with our LLM interview questions — internals, sampling parameters, tokenization, ecosystem choices, and production gotchas.

Get the next issue

Keep reading

Get the next issue

Keep reading