// inference-engineering.com

Inference.

n. The act of running a trained model to produce an output. Distinct from training — inference does not update weights. 1. In statistics: deriving conclusions from evidence. 2. In ML: a single forward pass through a model to generate a prediction, token, or embedding. Every AI response you have ever received was inference.

Engineering.

n. The application of scientific principles to design, build, and operate systems reliably at scale. 1. In software: turning a working prototype into something that handles production load. 2. Here: the full discipline of making model inference fast, cheap, and dependable — from GPU selection to serving architecture.

From GPU memory bandwidth to KV cache management, continuous batching to quantisation — a technical resource for engineers deploying AI in production.

Browse Concepts About

Foundations — how inference works

Training vs. Inference

What happens between sending a prompt and receiving a response. The prefill and decode phases that every optimisation flows from.

prefilldecodetokens

→

Hardware & Memory

Why GPU memory bandwidth matters more than raw compute for most LLMs. How to think about accelerator selection for inference.

bandwidthHBMroofline

→

Attention & KV Cache

Why attention scales quadratically, how the KV cache trades memory for speed, and why it dominates VRAM at long context lengths.

attentionKV cachememory

→

Context Windows

How context length affects memory, cost, and attention complexity — and what actually happens at the limits.

context lengthlong contextprefill cost

→

Sampling & Decoding

Temperature, top-p, top-k, beam search — how the decoding algorithm shapes output quality, diversity, and reproducibility.

temperaturetop-pgreedy

→

Working with models — directing and orchestrating

Prompt Engineering for Inference

How prompt structure affects prefill cost, KV cache reuse, and latency. The engineering angle, not tips for better outputs.

prefillprefix cachingcost

→

Intent Engineering

What users say versus what they mean — and how to design systems that close that gap reliably through prompts, structure, and evaluation.

intentsystem designevaluation

→

Agents & State Machines

What an agent actually is, how a state machine applies to inference loops, and why most agent bugs are state bugs.

agentsstate machinestool use

→

Data & knowledge — retrieval and memory

Embeddings & Vector Search

What an embedding is, how vector databases work, and where this fits in an AI system pipeline.

embeddingsANNvector DB

→

RAG

How retrieval fits into the inference pipeline — chunking, embedding, reranking, and the tradeoffs vs. longer context.

retrievalchunkingreranking

→

Fine-tuning vs Prompting vs RAG

A decision framework for when each approach makes sense — what each costs in time, money, and complexity.

fine-tuningRAGLoRA

→

Optimisation & production — scale and efficiency

Batching Strategies

From static batching to continuous batching — how serving systems amortise inference cost across concurrent users.

continuous batchingPagedAttentionthroughput

→

Quantization

Shrinking model weights from full precision to INT4 to fit more on GPU and increase throughput. When the accuracy tradeoff is worth it.

INT8FP8precision

→

Speculative Decoding

Using a small draft model to generate candidate tokens the large model verifies in one pass. More tokens per forward pass, same output quality.

draft modellatencyTTFT

→

Mixture of Experts

How MoE models route each token through a subset of specialised sub-networks — more capacity without proportionally more compute per token.

routingsparseexperts

→

Production Metrics

TTFT, TPOT, tokens/second, cost per million tokens — the numbers that determine whether your inference system is working.

TTFTTPOTcost/token

→

Featured Concept

The Two Phases of Inference

Every LLM request goes through exactly two phases, and confusing them is the single most common mistake when optimising for latency.

In the prefill phase, the model ingests your entire prompt in parallel. All input tokens are processed simultaneously — it's compute-bound and relatively fast. This phase populates the KV cache.

In the decode phase, the model generates output tokens one at a time. Each new token is appended to the KV cache, which is read in full on every step. This phase is memory-bandwidth-bound — throwing more FLOPS at it rarely helps.

Understanding this split tells you why optimisations like speculative decoding target the decode phase, and why prefill batching has different economics than decode batching.

Read the full guide

// inference_pipeline.txt

Phase 1 — Prefill

Themodelreadsyourentirepromptinparallel

Computed in parallel → populates KV cache

↓

Phase 2 — Decode

Theansweris▮

One token per forward pass → reads full KV cache
Memory-bandwidth bound, not compute bound

→ KV cache grows with every output token. At long context lengths it becomes the dominant consumer of GPU memory.

Glossary — selected terms

TTFTTime to First Token. The latency between sending a request and receiving the first output token. Dominated by the prefill phase. The metric users feel most acutely.

TPOTTime Per Output Token. The inter-token latency during decode. Determines perceived streaming speed. Bottlenecked by memory bandwidth, not compute.

KV CacheKey-Value cache. Stores intermediate attention states from the prefill phase so they needn't be recomputed on each decode step. The primary VRAM consumer at runtime.

Continuous BatchingA serving strategy where new requests are inserted into the batch mid-generation rather than waiting for the whole batch to finish. Improves GPU utilisation substantially.

Arithmetic IntensityFLOP count divided by bytes accessed. Compared against a GPU's ops:byte ratio to determine if a workload is compute-bound or memory-bound.

Chunked PrefillBreaking long prompts into chunks processed across multiple forward passes, allowing decode requests to interleave with prefill. Reduces head-of-line blocking.

The Inference Stack

Layer	What it does	Key Tools	Guide
Application	Prompt construction, streaming, response parsing	OpenAI SDKLangChain	Guide →
Retrieval	Embedding, vector search, reranking, RAG pipelines	pgvectorPineconeWeaviate	Guide →
Serving	Batching, scheduling, KV cache management, APIs	vLLMSGLangTGI	Guide →
Inference Engine	Kernel optimisation, graph compilation, quantisation	TensorRT-LLMFlashAttention	Guide →
Hardware	Compute, HBM, NVLink interconnects, PCIe	HBM GPUsNVLinkPCIe	Guide →