// inference-engineering.com
n. The act of running a trained model to produce an output. Distinct from training — inference does not update weights. 1. In statistics: deriving conclusions from evidence. 2. In ML: a single forward pass through a model to generate a prediction, token, or embedding. Every AI response you have ever received was inference.
n. The application of scientific principles to design, build, and operate systems reliably at scale. 1. In software: turning a working prototype into something that handles production load. 2. Here: the full discipline of making model inference fast, cheap, and dependable — from GPU selection to serving architecture.
From GPU memory bandwidth to KV cache management, continuous batching to quantisation — a technical resource for engineers deploying AI in production.
Foundations — how inference works
01
What happens between sending a prompt and receiving a response. The prefill and decode phases that every optimisation flows from.
→02
Why GPU memory bandwidth matters more than raw compute for most LLMs. How to think about accelerator selection for inference.
→03
Why attention scales quadratically, how the KV cache trades memory for speed, and why it dominates VRAM at long context lengths.
→04
How context length affects memory, cost, and attention complexity — and what actually happens at the limits.
→05
Temperature, top-p, top-k, beam search — how the decoding algorithm shapes output quality, diversity, and reproducibility.
→Working with models — directing and orchestrating
06
How prompt structure affects prefill cost, KV cache reuse, and latency. The engineering angle, not tips for better outputs.
→07
What users say versus what they mean — and how to design systems that close that gap reliably through prompts, structure, and evaluation.
→08
What an agent actually is, how a state machine applies to inference loops, and why most agent bugs are state bugs.
→Data & knowledge — retrieval and memory
09
What an embedding is, how vector databases work, and where this fits in an AI system pipeline.
→10
How retrieval fits into the inference pipeline — chunking, embedding, reranking, and the tradeoffs vs. longer context.
→11
A decision framework for when each approach makes sense — what each costs in time, money, and complexity.
→Optimisation & production — scale and efficiency
12
From static batching to continuous batching — how serving systems amortise inference cost across concurrent users.
→13
Shrinking model weights from full precision to INT4 to fit more on GPU and increase throughput. When the accuracy tradeoff is worth it.
→14
Using a small draft model to generate candidate tokens the large model verifies in one pass. More tokens per forward pass, same output quality.
→15
How MoE models route each token through a subset of specialised sub-networks — more capacity without proportionally more compute per token.
→16
TTFT, TPOT, tokens/second, cost per million tokens — the numbers that determine whether your inference system is working.
→Featured Concept
Every LLM request goes through exactly two phases, and confusing them is the single most common mistake when optimising for latency.
In the prefill phase, the model ingests your entire prompt in parallel. All input tokens are processed simultaneously — it's compute-bound and relatively fast. This phase populates the KV cache.
In the decode phase, the model generates output tokens one at a time. Each new token is appended to the KV cache, which is read in full on every step. This phase is memory-bandwidth-bound — throwing more FLOPS at it rarely helps.
Understanding this split tells you why optimisations like speculative decoding target the decode phase, and why prefill batching has different economics than decode batching.
Read the full guide// inference_pipeline.txt
Phase 1 — Prefill
Phase 2 — Decode
Glossary — selected terms
The Inference Stack
| Layer | What it does | Key Tools | Guide |
|---|---|---|---|
| Application | Prompt construction, streaming, response parsing | OpenAI SDKLangChain | Guide → |
| Retrieval | Embedding, vector search, reranking, RAG pipelines | pgvectorPineconeWeaviate | Guide → |
| Serving | Batching, scheduling, KV cache management, APIs | vLLMSGLangTGI | Guide → |
| Inference Engine | Kernel optimisation, graph compilation, quantisation | TensorRT-LLMFlashAttention | Guide → |
| Hardware | Compute, HBM, NVLink interconnects, PCIe | HBM GPUsNVLinkPCIe | Guide → |