LLM sampling parameters — temperature, top-p, top-k — when to tune each

Question

Randhir Jassal · Accepted Answer

Sampling is how the LLM picks the next token from a probability distribution over the vocabulary. Three parameters control it: temperature, top-p, and top-k. Tuning them wrong is the #1 cause of "the LLM keeps giving stupid answers" complaints. Temperature — the creativity dial After the model produces logits (one number per vocab token), it applies logits / temperature before softmax. - Temperature → 0: distribution becomes spiky. The top token's probability approaches 1.0. Effectively greedy, deterministic. - Temperature = 1: original distribution. Standard sampling. - Temperature > 1: distribution flattens. Lower-ranked tokens become more likely. More creative / random / weird. When to use each temperature | Use case | Temperature | |---|---| | Factual Q&A (RAG, extraction) | 0.0 — deterministic, no creativity | | Code generation | 0.0 - 0.3 — slight variability for hard problems | | Structured extraction (JSON) | 0.0 | | Summarization | 0.2 - 0.5 — slight variety in phrasing | | General chat assistant | 0.5 - 0.7 — natural-feeling but not weird | | Creative writing, brainstorming | 0.7 - 1.0 | | Generating diverse alternatives | 1.0 - 1.5 | Default rule: when in doubt, start at 0.0 for factual tasks. Move up only if outputs feel monotonous. Top-k — limit to the k most likely tokens After softmax, only consider the top-k highest-probability tokens, renormalize, sample from those. - topk = 1: greedy (same as temperature 0) - topk = 50: typical default — cuts the long tail of garbage tokens - topk = ∞: no filtering Top-k is brittle when the distribution is uneven. If the top 3 tokens carry 99% probability, sampling from top-50 includes 47 nearly-impossible tokens. Top-p is usually better. Top-p (nucleus sampling) — the smart filter Pick the smallest set of tokens whose cumulative probability is at least p, then sample from that set. - topp = 0.9: standard default — keep tokens until 90% probability mass is covered - topp = 1.0: no filtering (consider all tokens) - topp = 0.5: only the most likely tokens — tighter Top-p adapts to the distribution. If the model is confident (one clear next token), top-p naturally narrows. If it's uncertain (many plausible tokens), top-p includes more. Top-p is usually preferred over top-k. Use top-k as a fallback / extra safety. Combining them In practice you often combine: temperature=0.7, topp=0.9, topk=50. This means: 1. Soften the distribution with temperature 2. Then keep only the nucleus (90% mass) 3. Then enforce a top-k cap (defense against unusual distributions) Most providers expose temperature and topp; some also expose topk (Anthropic does; OpenAI doesn't directly). Other related parameters frequencypenalty / presencepenalty Reduce repetition by penalizing tokens that have already appeared. - frequencypenalty (0 to 2): penalty grows with token count - presencepenalty (0 to 2): flat penalty for any reused token Useful for: brainstorming (force variety), reducing repetitive lists. seed (for reproducibility at temp > 0) OpenAI accepts a seed parameter. Same seed + same parameters = same output (mostly — model versions can still vary). Useful for: reproducible tests, regression detection. Common interview questions "You're building a structured-extraction service that returns JSON. What sampling params?" temperature=0.0. No need for variety; we want the most likely correct extraction. Maybe seed=42 for reproducibility in tests. "Why does temperature 0.0 still sometimes produce slightly different outputs?" Two reasons: 1) floating-point non-determinism on GPUs (especially across batch sizes), and 2) model-version updates from the provider. To pin: use a specific model version (gpt-4o-2024-08-06, not gpt-4o) AND set a seed. "Customer says GPT keeps giving the same answer. Suggest fixes." Likely temperature is 0. Bump to 0.5-0.7. If they want diverse alternatives in one call, generate n=5 samples at temperature 0.8. "GPT keeps giving weird, off-topic answers. Suggest fixes." Temperature t…

LLM sampling parameters — temperature, top-p, top-k — when to tune each

Temperature — the creativity dial

When to use each temperature

Top-k — limit to the k most likely tokens

Top-p (nucleus sampling) — the smart filter

Combining them

frequency_penalty / presence_penalty

seed (for reproducibility at temp > 0)

Common interview questions

Production defaults I use

Interview-grade summary

LLM sampling parameters — temperature, top-p, top-k — when to tune each

Temperature — the creativity dial

When to use each temperature

Top-k — limit to the k most likely tokens

Top-p (nucleus sampling) — the smart filter

Combining them

frequency_penalty / presence_penalty

seed (for reproducibility at temp > 0)

Common interview questions

Production defaults I use

Interview-grade summary

Why does Python dominate AI/ML development — what are the real reasons?

Tokens, context windows, and the O(n²) attention cost — what every dev should know

What is a Large Language Model (LLM) and how does it work internally?

Why does Python dominate AI/ML development — what are the real reasons?

Tokens, context windows, and the O(n²) attention cost — what every dev should know

What is a Large Language Model (LLM) and how does it work internally?

Use case	Temperature
Factual Q&A (RAG, extraction)	0.0 — deterministic, no creativity
Code generation	0.0 - 0.3 — slight variability for hard problems
Structured extraction (JSON)	0.0
Summarization	0.2 - 0.5 — slight variety in phrasing
General chat assistant	0.5 - 0.7 — natural-feeling but not weird
Creative writing, brainstorming	0.7 - 1.0
Generating diverse alternatives	1.0 - 1.5

Task	Temperature	Top-p
RAG Q&A	0.0	1.0 (default)
Structured extraction	0.0	1.0
Code generation	0.2	0.95
Summarization	0.3	0.9
Conversational chatbot	0.7	0.9
Creative brainstorming	0.9	0.95

Related questions

Related questions