Hardware & Memory
Why Bandwidth Beats FLOPS for LLM Inference
Every GPU spec sheet leads with FLOPS. For LLM inference, that's the wrong number to look at. The thing that actually limits token generation speed is memory bandwidth — how fast the GPU can read weights from memory into its compute cores.
The wrong question
When engineers new to inference start GPU shopping, they reach for the wrong number first: FLOPS — floating-point operations per second. It is the most prominent spec on every GPU data sheet, the number that fills press releases, and the one most engineers already understand from training. It is also largely irrelevant for LLM inference.
The right question is: how fast can this GPU move data from memory to its compute cores? That speed — memory bandwidth — is what determines how quickly a language model generates tokens.
// Key Principle
LLM inference is memory-bandwidth-bound, not compute-bound. The bottleneck is reading model weights from GPU memory into compute cores, not the arithmetic performed on those weights. More FLOPS does not solve a memory bottleneck.
Anatomy of a GPU for inference
A modern datacenter GPU has three numbers that matter for inference:
// gpu_spec_anatomy
The roofline model
The roofline model is a simple framework for diagnosing whether a workload is compute-bound or memory-bound. It requires two numbers:
Ops:Byte ratio of the GPU
ops_per_byte = peak_FLOPS ÷ memory_bandwidth
On any modern high-end datacenter GPU, this ratio sits in the hundreds of FLOPs per byte — meaning the hardware can perform hundreds of arithmetic operations for every byte it reads from memory before it becomes compute-limited.
Arithmetic intensity of the workload
arithmetic_intensity = FLOPs_performed ÷ bytes_accessed
If intensity < ops_per_byte → memory-bound. If intensity > ops_per_byte → compute-bound.
For LLM decode at small batch sizes, the arithmetic intensity of the attention step is roughly 1 FLOP per byte — far below the ops-per-byte ratio of any datacenter GPU. The workload is deeply memory-bound. The GPU's compute cores sit largely idle, waiting for data to arrive from memory.
// Why batching changes the picture
Batching multiple requests together increases arithmetic intensity by amortising the weight loads across more compute. A batch of 32 requests approaches the compute-bound regime where FLOPS start to matter. This is why throughput-optimised deployments use large batches, and latency-optimised deployments for single users keep batches small.
GPU tiers for inference
Datacenter GPUs for LLM inference generally fall into a few practical tiers. The right axis to sort on is memory bandwidth per dollar for latency-sensitive workloads, and memory capacity for large-model feasibility. FLOPS is largely a secondary concern until you're running very large batch sizes.
| Tier | Memory type | Typical capacity | Best fit |
|---|---|---|---|
| Flagship HBM (latest gen) | HBM3e | Very high (120GB+) | Largest models, highest throughput, long context |
| Production HBM | HBM3 / HBM2e | High (80–96GB) | 70B-scale models, mainstream production serving |
| GDDR6 datacenter | GDDR6 | Moderate (24–48GB) | Smaller models, cost-sensitive inference, edge of production |
| Consumer / prosumer | GDDR6X | Lower (16–24GB) | Development, small models, local experimentation |
Two GPUs in the same tier can have identical compute throughput (FLOPS) but meaningfully different inference speed if their memory bandwidth differs. This is the clearest evidence that bandwidth — not FLOPS — is the operative constraint. When comparing hardware options in the same FLOPS tier, the one with higher bandwidth will almost always produce faster tokens at low-to-medium batch sizes.
HBM vs. GDDR6: why memory type matters
Datacenter GPUs use High Bandwidth Memory (HBM) stacked directly on the processor die. Consumer and mid-tier GPUs use GDDR6, connected over a PCIe bus. The bandwidth gap between these approaches is large — typically several times — and translates almost directly into proportionally faster token generation at low batch sizes.
The practical consequence: for the same model at batch size 1, moving from a GDDR6-based GPU to an HBM-based GPU of comparable generation can roughly double or triple token throughput. Not because of more FLOPS, but because weights arrive at the compute cores faster. This is why HBM-based hardware dominates production inference despite its higher cost.
Memory capacity: fitting the model
Before bandwidth matters, the model must fit. The rule of thumb:
VRAM required for model weights
GB ≈ params_billions × bytes_per_param × 1.2
FP16: 2 bytes/param → 70B model ≈ 70 × 2 × 1.2 = 168 GB. INT8: 1 byte → 84 GB. INT4: 0.5 bytes → 42 GB.
The 1.2× overhead accounts for activations, the KV cache at short context lengths, and framework overhead. At long context lengths the KV cache can grow to dominate — see the KV Cache guide for the full calculation.
Multi-GPU serving
When a model doesn't fit on a single GPU, it can be sharded across multiple GPUs using tensor parallelism. Each GPU holds a slice of each weight matrix, and they collaborate via NVLink or PCIe during each forward pass.
High-speed interconnects like NVLink deliver substantially more bandwidth between GPUs than standard PCIe — often an order of magnitude more. This is why the SXM form factor (which enables NVLink) is strongly preferred for tensor-parallel inference, where every forward pass requires multiple all-reduce operations across GPUs. For pipeline parallelism (where each GPU holds entire layers and communicates only at layer boundaries), PCIe is often sufficient since inter-GPU traffic is much lower.
// Practical Decision Rule
If your model fits on one GPU: choose based on bandwidth-per-dollar. If it requires 2–4 GPUs: NVLink (SXM/NVL form factor) materially outperforms PCIe. Beyond 4 GPUs: you're in cluster territory and interconnect topology dominates.
// In short