LLM sampling parameters — temperature, top-p, top-k — when to tune each
Sampling is how the LLM picks the next token from a probability distribution over the vocabulary. Three parameters control it: temperature, top-p, and top-k. Tuning them wrong is the #1 cause of "the LLM keeps giving stupid answers" complaints.
Temperature — the creativity dial
After the model produces logits (one number per vocab token), it applies logits / temperature before softmax.
import torch.nn.functional as F
def softmax_with_temp(logits, temperature):
return F.softmax(logits / temperature, dim=-1)
- Temperature → 0: distribution becomes spiky. The top token's probability approaches 1.0. Effectively greedy, deterministic.
- Temperature = 1: original distribution. Standard sampling.
- Temperature > 1: distribution flattens. Lower-ranked tokens become more likely. More creative / random / weird.
# Same logits, different temperatures
logits = torch.tensor([3.0, 2.0, 1.0, 0.5])
softmax_with_temp(logits, 0.0001).round(decimals=3)
# tensor([1.000, 0.000, 0.000, 0.000]) — only the top token
softmax_with_temp(logits, 1.0).round(decimals=3)
# tensor([0.621, 0.228, 0.084, 0.067]) — original
softmax_with_temp(logits, 2.0).round(decimals=3)
# tensor([0.428, 0.260, 0.158, 0.123]) — flatter, more variety
When to use each temperature
| Use case | Temperature |
|---|---|
| Factual Q&A (RAG, extraction) | 0.0 — deterministic, no creativity |
| Code generation | 0.0 - 0.3 — slight variability for hard problems |
| Structured extraction (JSON) | 0.0 |
| Summarization | 0.2 - 0.5 — slight variety in phrasing |
| General chat assistant | 0.5 - 0.7 — natural-feeling but not weird |
| Creative writing, brainstorming | 0.7 - 1.0 |
| Generating diverse alternatives | 1.0 - 1.5 |
Default rule: when in doubt, start at 0.0 for factual tasks. Move up only if outputs feel monotonous.
Top-k — limit to the k most likely tokens
After softmax, only consider the top-k highest-probability tokens, renormalize, sample from those.
def top_k_sample(probs, k=50):
top_values, top_indices = probs.topk(k)
top_values = top_values / top_values.sum()
choice = torch.multinomial(top_values, num_samples=1)
return top_indices[choice]
- top_k = 1: greedy (same as temperature 0)
- top_k = 50: typical default — cuts the long tail of garbage tokens
- top_k = ∞: no filtering
Top-k is brittle when the distribution is uneven. If the top 3 tokens carry 99% probability, sampling from top-50 includes 47 nearly-impossible tokens. Top-p is usually better.
Top-p (nucleus sampling) — the smart filter
Pick the smallest set of tokens whose cumulative probability is at least p, then sample from that set.
def top_p_sample(probs, p=0.9):
sorted_probs, sorted_indices = probs.sort(descending=True)
cumsum = sorted_probs.cumsum(dim=-1)
nucleus = cumsum <= p
nucleus[0] = True # always include at least the top token
nucleus_probs = sorted_probs * nucleus
nucleus_probs = nucleus_probs / nucleus_probs.sum()
choice = torch.multinomial(nucleus_probs, num_samples=1)
return sorted_indices[choice]
- top_p = 0.9: standard default — keep tokens until 90% probability mass is covered
- top_p = 1.0: no filtering (consider all tokens)
- top_p = 0.5: only the most likely tokens — tighter
Top-p adapts to the distribution. If the model is confident (one clear next token), top-p naturally narrows. If it's uncertain (many plausible tokens), top-p includes more.
Top-p is usually preferred over top-k. Use top-k as a fallback / extra safety.
Combining them
In practice you often combine: temperature=0.7, top_p=0.9, top_k=50. This means:
- Soften the distribution with temperature
- Then keep only the nucleus (90% mass)
- Then enforce a top-k cap (defense against unusual distributions)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
temperature=0.7,
top_p=0.9,
)
Most providers expose temperature and top_p; some also expose top_k (Anthropic does; OpenAI doesn't directly).
Other related parameters
frequency_penalty / presence_penalty
Reduce repetition by penalizing tokens that have already appeared.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
frequency_penalty=0.5, # penalize tokens that appear multiple times
presence_penalty=0.5, # penalize tokens that appeared at all
)
- frequency_penalty (0 to 2): penalty grows with token count
- presence_penalty (0 to 2): flat penalty for any reused token
Useful for: brainstorming (force variety), reducing repetitive lists.
seed (for reproducibility at temp > 0)
OpenAI accepts a seed parameter. Same seed + same parameters = same output (mostly — model versions can still vary).
client.chat.completions.create(model="gpt-4o-mini", messages=[...], seed=42)
Useful for: reproducible tests, regression detection.
Common interview questions
"You're building a structured-extraction service that returns JSON. What sampling params?"
temperature=0.0. No need for variety; we want the most likely correct extraction. Maybe seed=42 for reproducibility in tests.
"Why does temperature 0.0 still sometimes produce slightly different outputs?"
Two reasons: 1) floating-point non-determinism on GPUs (especially across batch sizes), and 2) model-version updates from the provider. To pin: use a specific model version (gpt-4o-2024-08-06, not gpt-4o) AND set a seed.
"Customer says GPT keeps giving the same answer. Suggest fixes."
Likely temperature is 0. Bump to 0.5-0.7. If they want diverse alternatives in one call, generate n=5 samples at temperature 0.8.
"GPT keeps giving weird, off-topic answers. Suggest fixes."
Temperature too high. Drop to 0.0-0.5. Also check the prompt — high temperature amplifies any ambiguity in the prompt.
Production defaults I use
| Task | Temperature | Top-p |
|---|---|---|
| RAG Q&A | 0.0 | 1.0 (default) |
| Structured extraction | 0.0 | 1.0 |
| Code generation | 0.2 | 0.95 |
| Summarization | 0.3 | 0.9 |
| Conversational chatbot | 0.7 | 0.9 |
| Creative brainstorming | 0.9 | 0.95 |
Interview-grade summary
"Temperature is the creativity dial — 0.0 for factual / deterministic tasks, 0.7-1.0 for creative. Top-p (nucleus) filters to the smallest set of tokens whose probabilities sum to p — adapts to the distribution; usually 0.9 is fine. Top-k is a fallback. Most use cases need temperature 0.0 with default top-p — the default 'sample from everything' behavior of conversational defaults is what causes the 'why is GPT giving weird answers' problem. For factual systems, default to 0.0 and only raise if outputs feel monotonous."