What is a Large Language Model (LLM) and how does it work internally?

Question

Randhir Jassal · Accepted Answer

An LLM is a transformer-based neural network trained on massive amounts of text to predict the next token given a sequence of preceding tokens. Stack enough of them, train on enough data, and the result is a model that can write, reason, code, translate, and answer questions — all from one core capability: next-token prediction. The 5-step inference pipeline Step 1 — Tokenization (BPE) The model has a fixed vocabulary (32k-200k tokens). Text is split into sub-word units using Byte-Pair Encoding: Rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words. Step 2 — Embeddings + positions Each token ID is looked up in a learned embedding table — [vocabsize, dmodel] shape. Position info is added (modern models use RoPE — rotary positional encoding). For Llama 3 8B: dmodel = 4096. For GPT-3 175B: dmodel = 12288. Each token becomes a vector of that dimensionality. Step 3 — Transformer blocks (the actual computation) A stack of 32-80 identical blocks. Each block has two stages: Self-attention — for each token, compute Q (query), K (key), V (value) vectors, then softmax(QK^T / sqrt(d)) · V. This lets each token "look at" relevant earlier tokens. Multi-head: 32-128 parallel attention computations, each with its own learned focus pattern. Feed-forward (MLP) — per-token transformation through a non-linear layer. This is where most factual knowledge ("Paris is the capital of France") is stored. Both stages have residual connections (x + sublayer(x)) and LayerNorm. Step 4 — Output head A linear layer projects the last token's representation to vocabulary-sized logits. Softmax → probability distribution over every possible next token. Step 5 — Sampling Pick the next token using temperature / top-p / top-k. Append it. Repeat from step 1. This is the autoregressive loop — one token at a time. It's also why LLM latency scales linearly with output length. Key intuitions - It's pattern recognition at massive scale, not reasoning — though the patterns are deep enough that it FEELS like reasoning for many tasks. - Knowledge lives in MLP weights — attention shuffles context; the MLP recalls facts. - Attention is O(n²) — doubling context = 4× the compute. This is why context windows are bounded. - KV-cache stores past attention keys/values during inference so each new token is O(n), not O(n²). - The model has no notion of "true" — it predicts plausible next tokens. Hallucinations are inevitable; the engineering job is to constrain when it matters. What makes LLMs different from previous NLP Before LLMs, each task needed its own trained model (sentiment, translation, summarization, Q&A — five different models, five training pipelines). LLMs collapse all of them into one model controlled by natural-language instructions. Interview-grade summary "An LLM is a transformer-based neural net trained to predict the next token. Inference flows: tokenize → embed → transformer blocks (attention + MLP) → output logits → sample → repeat. Attention lets each token attend to relevant earlier ones; MLPs hold factual knowledge. The whole system is a probability model over text, not a reasoning engine — though pattern matching at this scale produces reasoning-like behavior."

What is a Large Language Model (LLM) and how does it work internally?

The 5-step inference pipeline

Step 1 — Tokenization (BPE)

Step 2 — Embeddings + positions

Step 3 — Transformer blocks (the actual computation)

Step 4 — Output head

Step 5 — Sampling

Key intuitions

What makes LLMs different from previous NLP

Interview-grade summary

What is a Large Language Model (LLM) and how does it work internally?

The 5-step inference pipeline

Step 1 — Tokenization (BPE)

Step 2 — Embeddings + positions

Step 3 — Transformer blocks (the actual computation)

Step 4 — Output head

Step 5 — Sampling

Key intuitions

What makes LLMs different from previous NLP

Interview-grade summary

Why does Python dominate AI/ML development — what are the real reasons?

Tokens, context windows, and the O(n²) attention cost — what every dev should know

LLM sampling parameters — temperature, top-p, top-k — when to tune each

Why does Python dominate AI/ML development — what are the real reasons?

Tokens, context windows, and the O(n²) attention cost — what every dev should know

LLM sampling parameters — temperature, top-p, top-k — when to tune each

Related questions

Related questions