Sampling & Decoding Strategies — Inference Engineering

From logits to tokens

At each decode step, the model produces a vector of raw scores — one per token in the vocabulary — called logits. A vocabulary of 100,000 tokens produces 100,000 logit values. These scores are then converted to probabilities via a softmax operation, which exponentiates each value and normalises the result so all probabilities sum to 1.

The sampling strategy then decides which token from this distribution to select. This is where temperature, top-k, and top-p come in — they each modify either the distribution itself or the subset from which selection is made.

// decode step — logits to token

Prompt: "The capital of France is" — top candidates after softmax

"Paris"

0.82

"Lyon"

0.08

"Nice"

0.04

"Bordeaux"

0.02

... 99,996 more

0.04

For a factual prompt like this, the distribution is sharply peaked. Temperature, top-k, and top-p all behave differently on peaked vs. flat distributions.

Greedy decoding

The simplest strategy: always pick the token with the highest probability. No randomness, no parameters to tune. The output is fully deterministic — the same prompt will always produce exactly the same output.

Greedy decoding works well for tasks where the correct answer is unambiguous and you want reproducibility: classification, structured extraction, code generation in narrow domains. It fails on longer or more open-ended generations because it gets stuck in repetitive loops — once a locally high-probability token is chosen, it can set up conditions where the same token stays highest on subsequent steps.

Temperature

Temperature is a scalar applied to the logits before the softmax. Dividing logits by a value greater than 1 flattens the distribution — high-probability tokens become less dominant, low-probability tokens become more plausible. Dividing by a value less than 1 sharpens the distribution — the highest-probability token gets even more probability mass concentrated on it.

// temperature effect on distribution

Same logits, different temperatures — "Paris" probability:

temp = 0.1

≈0.99 — near greedy

temp = 1.0

0.82 — default

temp = 1.5

≈0.55 — more varied

temp = 2.0

≈0.30 — high entropy

High temperature doesn't mean random — it means the long tail of the distribution has more influence. At very high temperatures, low-quality tokens become sampled often enough to degrade coherence.

A temperature of 1.0 means the softmax is applied to the raw logits unchanged. This is the baseline the model was trained with. Values below 1.0 make the model more confident and consistent; values above 1.0 increase entropy and diversity.

// Temperature = 0 is not greedy decoding

Setting temperature to 0 is mathematically undefined (dividing by zero), but most frameworks interpret it as greedy decoding — argmax over the logits. In practice, "temperature 0" and "greedy" are equivalent in every inference framework you're likely to use, but they're conceptually distinct operations.

Top-k sampling

Top-k restricts sampling to the k highest-probability tokens, then re-normalises and samples from that subset. If k=50, the model considers only the 50 most likely next tokens — regardless of how much of the probability mass they collectively represent.

The problem with a fixed k is that it's context-insensitive. On a peaked distribution (one token at 95%), k=50 includes 49 essentially irrelevant tokens. On a flat distribution (50 tokens each at 2%), k=50 captures the entire meaningful distribution. The same k value behaves very differently depending on the shape of the logits.

Top-p (nucleus) sampling

Top-p, also called nucleus sampling, is designed to address this. Instead of a fixed count of tokens, top-p keeps the smallest set of tokens whose cumulative probability exceeds a threshold p. If p=0.9, the model samples from however many tokens it takes to cover 90% of the probability mass.

On a peaked distribution, the nucleus might be just 3–5 tokens. On a flat distribution, it might be 500. The vocabulary size adapts to the shape of each individual prediction, which is why nucleus sampling tends to produce more coherent text than top-k across a wide range of prompts.

// Top-p + temperature together

Most production systems apply temperature first (reshaping the distribution), then top-p (selecting the nucleus), then sample. The order matters — temperature applied before top-p means the nucleus size changes with temperature. Applying them in different orders produces different results, though frameworks generally handle this consistently.

Beam search

All of the above are sample-once-and-move-on strategies. Beam search is different: it maintains multiple candidate sequences simultaneously, extending each one at every step and pruning the worst candidates. With a beam width of 4, you track the 4 highest-probability partial sequences and evaluate them in parallel.

Beam search tends to produce more grammatically correct and predictable output than sampling, but it has significant downsides for generative tasks: it's slower (you're generating beam_width sequences instead of 1), it penalises diversity by design, and it's known to produce bland, repetitive text on open-ended prompts. It performs well on tasks where there's a clearly correct output — machine translation, summarisation with tight constraints.

Strategy	Deterministic?	Best for	Risks
Greedy	Yes	Classification, structured extraction	Repetition loops on long output
Temperature sampling	No	Creative generation, diversity	Incoherence at high values
Top-k	No	Moderate control over randomness	Fixed k is context-insensitive
Top-p (nucleus)	No	General-purpose generation	Can still produce poor tokens on flat distributions
Beam search	Yes (given width)	Translation, constrained generation	Slow, bland, poor on open-ended tasks

Reproducibility and seeds

Any strategy involving sampling is probabilistic — two identical calls can produce different outputs. For applications that need reproducible results (testing, debugging, deterministic pipelines), use greedy decoding or set a fixed random seed. Most inference frameworks expose a seed parameter that initialises the random number generator and makes sampling deterministic across identical inputs.

Note that reproducibility can still break across different hardware, framework versions, or even between batched and unbatched execution, because floating-point operations may be reordered differently. If exact reproducibility is a hard requirement, test it explicitly on the hardware and software you'll run in production.

Effect on latency and throughput

Sampling strategy has essentially no effect on per-token latency. The logit computation is the expensive part — selecting from the resulting distribution is negligible by comparison. Beam search is the exception: it multiplies the memory and compute cost of each decode step by the beam width, since you're tracking multiple sequences in parallel.

// In short

01Greedy decoding always picks the highest-probability token. Deterministic, fast to reason about, prone to repetition on long outputs.

02Temperature scales logits before softmax, flattening or sharpening the distribution. It doesn't change which tokens are possible — it changes how likely each one is to be chosen.

03Top-k restricts sampling to the k highest-probability tokens. Simple, but context-insensitive — the same k behaves very differently depending on distribution shape.

04Top-p (nucleus sampling) keeps the smallest set of tokens that covers probability mass p. Adapts to distribution shape, which is why it tends to outperform fixed-k on diverse prompts.

05Beam search maintains multiple candidate sequences simultaneously. Better on constrained tasks (translation), worse on open-ended generation, and proportionally slower than single-sequence strategies.

06Sampling strategy has negligible effect on per-token latency except for beam search, which scales cost by beam width.