// Concept 05 — Serving

Batching Strategies

How Modern Serving Engines Handle Concurrent Requests

Most of the gap between naive LLM serving and production-grade LLM serving comes down to batching. Continuous batching and PagedAttention are the two techniques that make it economically viable to serve requests at scale.

20 min read·Concept 05 of 07

The naive serving problem

Suppose you have ten users submitting requests at roughly the same time. The most naive implementation processes them serially — user 1 completes, then user 2, then user 3. Most of your GPU sits idle waiting for each request to finish. Throughput is terrible.

The obvious improvement is to batch requests together — process all ten simultaneously. The GPU runs a single forward pass over all ten inputs at once, amortising weight loads across more compute and lifting GPU utilisation. This is static batching.

// The static batching problem

In static batching, the entire batch must complete before new requests are admitted. If request 1 generates 500 tokens but requests 2–10 only need 50, the nine short requests finish early — then sit idle, holding their GPU memory, waiting for request 1 to finish. This is wasteful.

Continuous batching

Continuous batching (also called iteration-level scheduling or in-flight batching) solves this by making the batch dynamic. Instead of processing a fixed set of requests from start to finish, the scheduler operates at the token iteration level: after every decode step, finished sequences are evicted from the batch and new requests are admitted immediately.

// static_vs_continuous_batching

STATIC BATCHING — GPU timeline (slot = 1 decode step)

Request A
GENERATING (50 tokens)
idle
Request B
GENERATING (70 tokens)
Request C
WAITING for batch to complete

CONTINUOUS BATCHING — GPU timeline

Request A
GEN (30)
Request B
GENERATING (70 tokens)
Request C
wait
GENERATING (admitted after A)
Continuous batching admits a new request the moment a slot opens, eliminating idle cycles and substantially increasing GPU utilisation.

The throughput gains are substantial — typically an order of magnitude or more over static batching in mixed-length workloads, where some requests complete quickly and others run long. The gains are largest precisely when request lengths vary widely, because that variance is exactly what static batching handles worst. All major serving systems implement continuous batching: it's the baseline, not a differentiator.

PagedAttention: virtual memory for the KV cache

Continuous batching improves scheduling efficiency. But it exposes a new problem: memory fragmentation. In a naive implementation, each request's KV cache is allocated as a single contiguous block. You must pre-allocate the maximum possible size upfront (because you don't know how long the response will be), leading to massive internal fragmentation when responses are short.

PagedAttention solves this by borrowing a concept from operating systems: virtual memory paging. Instead of allocating contiguous KV cache blocks, PagedAttention divides the cache into fixed-size pages (typically 16 tokens each) that can be allocated and freed independently, anywhere in GPU memory.

// kv_cache_memory_management

NAIVE — contiguous allocation, fragmentation visible

R1
R1
R1
R2
R2
R3

PAGED — non-contiguous pages, near-zero fragmentation

R1
R2
R3
R1
R2
R1
R3
R2
Pages from R1, R2, R3 are interleaved throughout memory. No pre-allocation required. Pages grow on demand. Fragmentation waste drops from substantial to near-zero.

The practical result: paged allocation eliminates the largest source of KV cache waste. Pre-allocated contiguous buffers waste a substantial fraction of VRAM on padding and fragmentation; page-based allocation brings that close to zero, which directly translates to more concurrent requests per GPU.

Prefix caching

Many production use cases send the same or highly similar prefixes — system prompts, few-shot examples, shared document contexts. Without caching, each request recomputes the KV values for these tokens from scratch during prefill, wasting compute on redundant work.

Prefix caching (sometimes called prompt caching) retains the KV pages for common prefixes in GPU memory. When a new request arrives with a matching prefix, the engine maps those cached pages directly, skipping the prefill computation for those tokens entirely.

In deployments with long system prompts or document-based Q&A, prefix caching can eliminate most of the repeated prefill work — which directly cuts time-to-first-token for any request that shares a prefix with a prior one.

Chunked prefill

Prefill (processing the input prompt) and decode (generating tokens one at a time) have very different compute profiles. Prefill is compute-intensive and fast — the GPU is near-saturated processing many tokens in parallel. Decode is memory-bandwidth-bound and slow — one token at a time.

When a long prefill request arrives, it can monopolise the GPU for hundreds of milliseconds, stalling all active decode requests and spiking their time-to-first-token. Chunked prefill addresses this by breaking prefill into fixed-size chunks (e.g., 512 tokens), interleaving them with decode steps. This keeps the GPU busy without blocking active generations, improving latency predictability under load.

Scheduling priorities

Real serving systems must make tradeoffs between throughput and latency. High-throughput optimisation maximises tokens per second across all users — good for cost-per-token metrics but may increase individual user latency. Latency-first scheduling prioritises completing requests quickly at the expense of overall utilisation.

// The key insight

The right batching strategy depends on your SLO. Batch-size-maximising continuous batching optimises for throughput. Admission control (limiting max batch size) protects latency SLOs. These are separate dials. Most production systems use both, tuned for their specific request distribution.

// In short

01Continuous batching is now universal. Every production serving system uses it. The gains over static batching are largest in mixed-length workloads — which is most real traffic.
02PagedAttention eliminates KV cache fragmentation. Contiguous allocation wastes a substantial fraction of VRAM on padding and gaps; paged allocation brings fragmentation close to zero, fitting significantly more concurrent requests on the same hardware.
03Prefix caching can eliminate most repeated prefill work in workloads with shared system prompts or document contexts. It's often the biggest win for chatbot and RAG deployments.
04Chunked prefill keeps latency predictable. Long prompts don't block decode steps, reducing tail latency under load.
05Throughput and latency are in tension. Tuning batch size and admission control is how you navigate that tradeoff in deployment.