Attention & the KV Cache
The Memory Structure at the Heart of Every LLM
At short context lengths the KV cache is a minor detail. At 32k+ tokens it becomes the dominant memory consumer, often exceeding the model weights themselves. This is where requests get slow, memory gets exhausted, and costs balloon.
What attention actually does
Every modern LLM is built on attention — a mechanism where each token, when being processed, looks at every other token in the context and decides how much weight to put on it. This is what lets a model resolve pronouns, track references, and maintain coherent reasoning across thousands of tokens.
Concretely: to generate the next token, the model must consider the relationship between that token and every previous token in the context window. This isn't optional — it's the core of how transformers work.
// The O(n²) problem
Attention complexity scales as O(n²) with sequence length n. Double the context length, quadruple the attention computation. This is why long-context inference is more expensive than short-context inference — a 4× cost increase for 2× the context length.
The naive approach: recompute everything
In a naive implementation, generating each new token requires recomputing the attention scores between that token and every prior token from scratch. For a 4,096-token context, generating token 4,097 requires computing 4,097 × 4,097 attention operations. Token 4,098 requires even more.
This is wildly inefficient. The attention scores for tokens 1–4,096 were already computed during the previous step. They haven't changed. Recomputing them is pure waste.
What the KV cache is
The KV (Key-Value) cache solves this by storing intermediate attention computations from previous steps so they don't need to be recomputed. In the attention mechanism, each token produces three vectors: a Query (Q), a Key (K), and a Value (V). To generate a new token, the model computes attention between the current query and all previous keys, then uses the result to weight the values.
The keys and values for all previous tokens never change — they're determined by those tokens' positions in the sequence. The KV cache stores them. On each decode step, the model only needs to compute K and V for the new token, append them to the cache, and then run attention over the full cached sequence.
// kv_cache_decode_step
Blue = K,V cached. Green = token being generated, K,V being computed fresh.
How much memory the KV cache uses
The KV cache is not free. Its memory footprint grows with every generated token and scales with context length, batch size, model size, and precision.
KV cache memory per token
bytes = 2 × num_layers × num_heads × head_dim × bytes_per_element
The factor of 2 is for K and V. For a large model with 80 layers, 8 KV heads, and 128-dimensional heads at FP16: 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 0.32 MB per token.
Total KV cache for a running batch
total = bytes_per_token × max_seq_len × batch_size
A large 70B-class model at 128k context, batch of 8: 0.32 MB × 131,072 × 8 ≈ 335 GB. This exceeds the model weights themselves.
// The long-context problem
At 128k context length, the KV cache for a single request on a large 70B-class model can consume tens of gigabytes — comparable to or exceeding the model weights at INT4. Serving multiple concurrent long-context requests is a memory management problem as much as a compute problem.
FlashAttention: reordering the computation
Even with the KV cache, the attention computation itself has a memory access problem. The standard algorithm loads Q, K, and V matrices from HBM (slow), performs the computation in SRAM (fast), and writes results back. For long sequences, this involves enormous data movement between HBM and SRAM.
FlashAttention reorders the attention calculation into blocks that fit entirely within SRAM. Instead of loading the full KV matrices from HBM, it tiles the computation so intermediate results never leave SRAM. HBM access drops sharply — the original paper reported an order of magnitude reduction in memory reads — and it is now universal across production inference stacks.
Grouped Query Attention: shrinking the cache
Standard multi-head attention (MHA) requires storing one K,V pair per attention head per token. For a model with 64 heads, that's 64 K vectors and 64 V vectors per token. Grouped Query Attention (GQA) allows multiple query heads to share a single K,V pair.
A model using GQA with 8 KV heads instead of 64 reduces KV cache memory by 8×. This is why grouped-query attention has become the standard architectural choice for large models — it makes long-context inference viable without sacrificing quality.
PagedAttention: managing the cache like virtual memory
In naive implementations, each request's KV cache is allocated as a contiguous block of memory. This leads to severe fragmentation — gaps between allocations that can't be used. PagedAttention breaks the KV cache into fixed-size pages, like operating system virtual memory, enabling nearly waste-free memory utilisation under concurrent load.
This is covered in depth in the Batching Strategies guide, but the core insight is architectural: treating the KV cache as paged virtual memory rather than contiguous buffers eliminates the largest source of GPU memory waste at production concurrency levels.
// In short