Batching Strategies
How Modern Serving Engines Handle Concurrent Requests
Most of the gap between naive LLM serving and production-grade LLM serving comes down to batching. Continuous batching and PagedAttention are the two techniques that make it economically viable to serve requests at scale.
The naive serving problem
Suppose you have ten users submitting requests at roughly the same time. The most naive implementation processes them serially — user 1 completes, then user 2, then user 3. Most of your GPU sits idle waiting for each request to finish. Throughput is terrible.
The obvious improvement is to batch requests together — process all ten simultaneously. The GPU runs a single forward pass over all ten inputs at once, amortising weight loads across more compute and lifting GPU utilisation. This is static batching.
// The static batching problem
In static batching, the entire batch must complete before new requests are admitted. If request 1 generates 500 tokens but requests 2–10 only need 50, the nine short requests finish early — then sit idle, holding their GPU memory, waiting for request 1 to finish. This is wasteful.
Continuous batching
Continuous batching (also called iteration-level scheduling or in-flight batching) solves this by making the batch dynamic. Instead of processing a fixed set of requests from start to finish, the scheduler operates at the token iteration level: after every decode step, finished sequences are evicted from the batch and new requests are admitted immediately.
// static_vs_continuous_batching
STATIC BATCHING — GPU timeline (slot = 1 decode step)
CONTINUOUS BATCHING — GPU timeline
The throughput gains are substantial — typically an order of magnitude or more over static batching in mixed-length workloads, where some requests complete quickly and others run long. The gains are largest precisely when request lengths vary widely, because that variance is exactly what static batching handles worst. All major serving systems implement continuous batching: it's the baseline, not a differentiator.
PagedAttention: virtual memory for the KV cache
Continuous batching improves scheduling efficiency. But it exposes a new problem: memory fragmentation. In a naive implementation, each request's KV cache is allocated as a single contiguous block. You must pre-allocate the maximum possible size upfront (because you don't know how long the response will be), leading to massive internal fragmentation when responses are short.
PagedAttention solves this by borrowing a concept from operating systems: virtual memory paging. Instead of allocating contiguous KV cache blocks, PagedAttention divides the cache into fixed-size pages (typically 16 tokens each) that can be allocated and freed independently, anywhere in GPU memory.
// kv_cache_memory_management
NAIVE — contiguous allocation, fragmentation visible
PAGED — non-contiguous pages, near-zero fragmentation
The practical result: paged allocation eliminates the largest source of KV cache waste. Pre-allocated contiguous buffers waste a substantial fraction of VRAM on padding and fragmentation; page-based allocation brings that close to zero, which directly translates to more concurrent requests per GPU.
Prefix caching
Many production use cases send the same or highly similar prefixes — system prompts, few-shot examples, shared document contexts. Without caching, each request recomputes the KV values for these tokens from scratch during prefill, wasting compute on redundant work.
Prefix caching (sometimes called prompt caching) retains the KV pages for common prefixes in GPU memory. When a new request arrives with a matching prefix, the engine maps those cached pages directly, skipping the prefill computation for those tokens entirely.
In deployments with long system prompts or document-based Q&A, prefix caching can eliminate most of the repeated prefill work — which directly cuts time-to-first-token for any request that shares a prefix with a prior one.
Chunked prefill
Prefill (processing the input prompt) and decode (generating tokens one at a time) have very different compute profiles. Prefill is compute-intensive and fast — the GPU is near-saturated processing many tokens in parallel. Decode is memory-bandwidth-bound and slow — one token at a time.
When a long prefill request arrives, it can monopolise the GPU for hundreds of milliseconds, stalling all active decode requests and spiking their time-to-first-token. Chunked prefill addresses this by breaking prefill into fixed-size chunks (e.g., 512 tokens), interleaving them with decode steps. This keeps the GPU busy without blocking active generations, improving latency predictability under load.
Scheduling priorities
Real serving systems must make tradeoffs between throughput and latency. High-throughput optimisation maximises tokens per second across all users — good for cost-per-token metrics but may increase individual user latency. Latency-first scheduling prioritises completing requests quickly at the expense of overall utilisation.
// The key insight
The right batching strategy depends on your SLO. Batch-size-maximising continuous batching optimises for throughput. Admission control (limiting max batch size) protects latency SLOs. These are separate dials. Most production systems use both, tuned for their specific request distribution.
// In short