// inference-engineering.com

Inference.

n. The act of running a trained model to produce an output. Distinct from training — inference does not update weights. 1. In statistics: deriving conclusions from evidence. 2. In ML: a single forward pass through a model to generate a prediction, token, or embedding. Every AI response you have ever received was inference.

Engineering.

n. The application of scientific principles to design, build, and operate systems reliably at scale. 1. In software: turning a working prototype into something that handles production load. 2. Here: the full discipline of making model inference fast, cheap, and dependable — from GPU selection to serving architecture.

From GPU memory bandwidth to KV cache management, continuous batching to quantisation — a technical resource for engineers deploying AI in production.

16Guides
40+Glossary Terms
prefill → decode → stream

The Two Phases of Inference

Every LLM request goes through exactly two phases, and confusing them is the single most common mistake when optimising for latency.

In the prefill phase, the model ingests your entire prompt in parallel. All input tokens are processed simultaneously — it's compute-bound and relatively fast. This phase populates the KV cache.

In the decode phase, the model generates output tokens one at a time. Each new token is appended to the KV cache, which is read in full on every step. This phase is memory-bandwidth-bound — throwing more FLOPS at it rarely helps.

Understanding this split tells you why optimisations like speculative decoding target the decode phase, and why prefill batching has different economics than decode batching.

Read the full guide

// inference_pipeline.txt

Phase 1 — Prefill

Themodelreadsyourentirepromptinparallel
Computed in parallel → populates KV cache

Phase 2 — Decode

Theansweris
One token per forward pass → reads full KV cache
Memory-bandwidth bound, not compute bound
→ KV cache grows with every output token. At long context lengths it becomes the dominant consumer of GPU memory.
TTFTTime to First Token. The latency between sending a request and receiving the first output token. Dominated by the prefill phase. The metric users feel most acutely.
TPOTTime Per Output Token. The inter-token latency during decode. Determines perceived streaming speed. Bottlenecked by memory bandwidth, not compute.
KV CacheKey-Value cache. Stores intermediate attention states from the prefill phase so they needn't be recomputed on each decode step. The primary VRAM consumer at runtime.
Continuous BatchingA serving strategy where new requests are inserted into the batch mid-generation rather than waiting for the whole batch to finish. Improves GPU utilisation substantially.
Arithmetic IntensityFLOP count divided by bytes accessed. Compared against a GPU's ops:byte ratio to determine if a workload is compute-bound or memory-bound.
Chunked PrefillBreaking long prompts into chunks processed across multiple forward passes, allowing decode requests to interleave with prefill. Reduces head-of-line blocking.
LayerWhat it doesKey ToolsGuide
ApplicationPrompt construction, streaming, response parsing
OpenAI SDKLangChain
Guide →
RetrievalEmbedding, vector search, reranking, RAG pipelines
pgvectorPineconeWeaviate
Guide →
ServingBatching, scheduling, KV cache management, APIs
vLLMSGLangTGI
Guide →
Inference EngineKernel optimisation, graph compilation, quantisation
TensorRT-LLMFlashAttention
Guide →
HardwareCompute, HBM, NVLink interconnects, PCIe
HBM GPUsNVLinkPCIe
Guide →