Attention & the KV Cache — Inference Engineering

What attention actually does

Every modern LLM is built on attention — a mechanism where each token, when being processed, looks at every other token in the context and decides how much weight to put on it. This is what lets a model resolve pronouns, track references, and maintain coherent reasoning across thousands of tokens.

Concretely: to generate the next token, the model must consider the relationship between that token and every previous token in the context window. This isn't optional — it's the core of how transformers work.

// The O(n²) problem

Attention complexity scales as O(n²) with sequence length n. Double the context length, quadruple the attention computation. This is why long-context inference is more expensive than short-context inference — a 4× cost increase for 2× the context length.

The naive approach: recompute everything

In a naive implementation, generating each new token requires recomputing the attention scores between that token and every prior token from scratch. For a 4,096-token context, generating token 4,097 requires computing 4,097 × 4,097 attention operations. Token 4,098 requires even more.

This is wildly inefficient. The attention scores for tokens 1–4,096 were already computed during the previous step. They haven't changed. Recomputing them is pure waste.

What the KV cache is

The KV (Key-Value) cache solves this by storing intermediate attention computations from previous steps so they don't need to be recomputed. In the attention mechanism, each token produces three vectors: a Query (Q), a Key (K), and a Value (V). To generate a new token, the model computes attention between the current query and all previous keys, then uses the result to weight the values.

The keys and values for all previous tokens never change — they're determined by those tokens' positions in the sequence. The KV cache stores them. On each decode step, the model only needs to compute K and V for the new token, append them to the cache, and then run attention over the full cached sequence.

// kv_cache_decode_step

1.Prefill: Process full prompt → compute and cache K, V for all input tokens

2.Decode step N: New token arrives → compute K, V for this token only → append to cache

3.Attention: Compute Q for new token → attend over all cached K,V pairs → produce output

4.Generate: Output passed through remaining layers → sample next token → repeat

Without the KV cache, each decode step is O(n²). With it, each step is O(n) — reading the cache linearly. The cache trades memory for compute.

ThecapitalofFranceis Paris

Blue = K,V cached. Green = token being generated, K,V being computed fresh.

How much memory the KV cache uses

The KV cache is not free. Its memory footprint grows with every generated token and scales with context length, batch size, model size, and precision.

KV cache memory per token

bytes = 2 × num_layers × num_heads × head_dim × bytes_per_element

The factor of 2 is for K and V. For a large model with 80 layers, 8 KV heads, and 128-dimensional heads at FP16: 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 0.32 MB per token.

Total KV cache for a running batch

total = bytes_per_token × max_seq_len × batch_size

A large 70B-class model at 128k context, batch of 8: 0.32 MB × 131,072 × 8 ≈ 335 GB. This exceeds the model weights themselves.

// The long-context problem

At 128k context length, the KV cache for a single request on a large 70B-class model can consume tens of gigabytes — comparable to or exceeding the model weights at INT4. Serving multiple concurrent long-context requests is a memory management problem as much as a compute problem.

FlashAttention: reordering the computation

Even with the KV cache, the attention computation itself has a memory access problem. The standard algorithm loads Q, K, and V matrices from HBM (slow), performs the computation in SRAM (fast), and writes results back. For long sequences, this involves enormous data movement between HBM and SRAM.

FlashAttention reorders the attention calculation into blocks that fit entirely within SRAM. Instead of loading the full KV matrices from HBM, it tiles the computation so intermediate results never leave SRAM. HBM access drops sharply — the original paper reported an order of magnitude reduction in memory reads — and it is now universal across production inference stacks.

Grouped Query Attention: shrinking the cache

Standard multi-head attention (MHA) requires storing one K,V pair per attention head per token. For a model with 64 heads, that's 64 K vectors and 64 V vectors per token. Grouped Query Attention (GQA) allows multiple query heads to share a single K,V pair.

A model using GQA with 8 KV heads instead of 64 reduces KV cache memory by 8×. This is why grouped-query attention has become the standard architectural choice for large models — it makes long-context inference viable without sacrificing quality.

PagedAttention: managing the cache like virtual memory

In naive implementations, each request's KV cache is allocated as a contiguous block of memory. This leads to severe fragmentation — gaps between allocations that can't be used. PagedAttention breaks the KV cache into fixed-size pages, like operating system virtual memory, enabling nearly waste-free memory utilisation under concurrent load.

This is covered in depth in the Batching Strategies guide, but the core insight is architectural: treating the KV cache as paged virtual memory rather than contiguous buffers eliminates the largest source of GPU memory waste at production concurrency levels.

// In short

01Attention is O(n²) without the KV cache. The cache converts each decode step from quadratic to linear in sequence length — a fundamental algorithmic improvement.

02The KV cache is the dominant memory consumer at long context lengths. At 128k tokens, it dwarfs the model weights. Memory planning must account for it.

03GQA dramatically reduces cache size by sharing K,V pairs across groups of query heads. It's the primary architectural reason modern large models are deployable at long contexts.

04FlashAttention is now near-universal. The substantial reduction in HBM reads it achieves makes long-context inference feasible at reasonable cost.

05Cache management is a serving problem. PagedAttention and prefix caching are serving-layer innovations that improve how efficiently the cache is used under concurrent load.