What is a Large Language Model (LLM) and how does it work internally?
An LLM is a transformer-based neural network trained on massive amounts of text to predict the next token given a sequence of preceding tokens. Stack enough of them, train on enough data, and the result is a model that can write, reason, code, translate, and answer questions — all from one core capability: next-token prediction.
The 5-step inference pipeline
1. Tokenize → break text into sub-word IDs
2. Embed → map each ID to a high-dim vector
3. Transformer → N blocks of self-attention + MLP
4. Output head → project to vocabulary probabilities
5. Sample → pick next token, append, repeat
Step 1 — Tokenization (BPE)
The model has a fixed vocabulary (~32k-200k tokens). Text is split into sub-word units using Byte-Pair Encoding:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
enc.encode("Building LLM apps") # → [27418, 445, 11237, 8163]
Rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words.
Step 2 — Embeddings + positions
Each token ID is looked up in a learned embedding table — [vocab_size, d_model] shape. Position info is added (modern models use RoPE — rotary positional encoding).
For Llama 3 8B: d_model = 4096. For GPT-3 175B: d_model = 12288. Each token becomes a vector of that dimensionality.
Step 3 — Transformer blocks (the actual computation)
A stack of 32-80 identical blocks. Each block has two stages:
Self-attention — for each token, compute Q (query), K (key), V (value) vectors, then softmax(QK^T / sqrt(d)) · V. This lets each token "look at" relevant earlier tokens.
scores = Q @ K.transpose(-2, -1) / sqrt(d)
weights = softmax(scores) # which past tokens matter for this one?
context = weights @ V # weighted sum of values
Multi-head: 32-128 parallel attention computations, each with its own learned focus pattern.
Feed-forward (MLP) — per-token transformation through a non-linear layer. This is where most factual knowledge ("Paris is the capital of France") is stored.
Both stages have residual connections (x + sublayer(x)) and LayerNorm.
Step 4 — Output head
A linear layer projects the last token's representation to vocabulary-sized logits. Softmax → probability distribution over every possible next token.
Step 5 — Sampling
Pick the next token using temperature / top-p / top-k. Append it. Repeat from step 1.
This is the autoregressive loop — one token at a time. It's also why LLM latency scales linearly with output length.
Key intuitions
- It's pattern recognition at massive scale, not reasoning — though the patterns are deep enough that it FEELS like reasoning for many tasks.
- Knowledge lives in MLP weights — attention shuffles context; the MLP recalls facts.
- Attention is O(n²) — doubling context = 4× the compute. This is why context windows are bounded.
- KV-cache stores past attention keys/values during inference so each new token is O(n), not O(n²).
- The model has no notion of "true" — it predicts plausible next tokens. Hallucinations are inevitable; the engineering job is to constrain when it matters.
What makes LLMs different from previous NLP
Before LLMs, each task needed its own trained model (sentiment, translation, summarization, Q&A — five different models, five training pipelines). LLMs collapse all of them into one model controlled by natural-language instructions.
Interview-grade summary
"An LLM is a transformer-based neural net trained to predict the next token. Inference flows: tokenize → embed → transformer blocks (attention + MLP) → output logits → sample → repeat. Attention lets each token attend to relevant earlier ones; MLPs hold factual knowledge. The whole system is a probability model over text, not a reasoning engine — though pattern matching at this scale produces reasoning-like behavior."