Hardware & Memory: Why Bandwidth Beats FLOPS

The wrong question

When engineers new to inference start GPU shopping, they reach for the wrong number first: FLOPS — floating-point operations per second. It is the most prominent spec on every GPU data sheet, the number that fills press releases, and the one most engineers already understand from training. It is also largely irrelevant for LLM inference.

The right question is: how fast can this GPU move data from memory to its compute cores? That speed — memory bandwidth — is what determines how quickly a language model generates tokens.

// Key Principle

LLM inference is memory-bandwidth-bound, not compute-bound. The bottleneck is reading model weights from GPU memory into compute cores, not the arithmetic performed on those weights. More FLOPS does not solve a memory bottleneck.

Anatomy of a GPU for inference

A modern datacenter GPU has three numbers that matter for inference:

// gpu_spec_anatomy

1.Memory capacity (GB) — how much model you can fit. A 70B FP16 model needs ~140GB. If it doesn't fit, it doesn't run.

2.Memory bandwidth (TB/s) — how fast weights move from memory to compute cores. This is your decode speed ceiling.

3.Compute (TFLOPS) — arithmetic throughput. Relevant for the prefill phase; largely irrelevant during decode.

Capacity determines what you can run. Bandwidth determines how fast you run it. FLOPS determine the ceiling you'll rarely hit.

The roofline model

The roofline model is a simple framework for diagnosing whether a workload is compute-bound or memory-bound. It requires two numbers:

Ops:Byte ratio of the GPU

ops_per_byte = peak_FLOPS ÷ memory_bandwidth

On any modern high-end datacenter GPU, this ratio sits in the hundreds of FLOPs per byte — meaning the hardware can perform hundreds of arithmetic operations for every byte it reads from memory before it becomes compute-limited.

Arithmetic intensity of the workload

arithmetic_intensity = FLOPs_performed ÷ bytes_accessed

If intensity < ops_per_byte → memory-bound. If intensity > ops_per_byte → compute-bound.

For LLM decode at small batch sizes, the arithmetic intensity of the attention step is roughly 1 FLOP per byte — far below the ops-per-byte ratio of any datacenter GPU. The workload is deeply memory-bound. The GPU's compute cores sit largely idle, waiting for data to arrive from memory.

// Why batching changes the picture

Batching multiple requests together increases arithmetic intensity by amortising the weight loads across more compute. A batch of 32 requests approaches the compute-bound regime where FLOPS start to matter. This is why throughput-optimised deployments use large batches, and latency-optimised deployments for single users keep batches small.

GPU tiers for inference

Datacenter GPUs for LLM inference generally fall into a few practical tiers. The right axis to sort on is memory bandwidth per dollar for latency-sensitive workloads, and memory capacity for large-model feasibility. FLOPS is largely a secondary concern until you're running very large batch sizes.

Tier	Memory type	Typical capacity	Best fit
Flagship HBM (latest gen)	HBM3e	Very high (120GB+)	Largest models, highest throughput, long context
Production HBM	HBM3 / HBM2e	High (80–96GB)	70B-scale models, mainstream production serving
GDDR6 datacenter	GDDR6	Moderate (24–48GB)	Smaller models, cost-sensitive inference, edge of production
Consumer / prosumer	GDDR6X	Lower (16–24GB)	Development, small models, local experimentation

Two GPUs in the same tier can have identical compute throughput (FLOPS) but meaningfully different inference speed if their memory bandwidth differs. This is the clearest evidence that bandwidth — not FLOPS — is the operative constraint. When comparing hardware options in the same FLOPS tier, the one with higher bandwidth will almost always produce faster tokens at low-to-medium batch sizes.

HBM vs. GDDR6: why memory type matters

Datacenter GPUs use High Bandwidth Memory (HBM) stacked directly on the processor die. Consumer and mid-tier GPUs use GDDR6, connected over a PCIe bus. The bandwidth gap between these approaches is large — typically several times — and translates almost directly into proportionally faster token generation at low batch sizes.

The practical consequence: for the same model at batch size 1, moving from a GDDR6-based GPU to an HBM-based GPU of comparable generation can roughly double or triple token throughput. Not because of more FLOPS, but because weights arrive at the compute cores faster. This is why HBM-based hardware dominates production inference despite its higher cost.

Memory capacity: fitting the model

Before bandwidth matters, the model must fit. The rule of thumb:

VRAM required for model weights

GB ≈ params_billions × bytes_per_param × 1.2

FP16: 2 bytes/param → 70B model ≈ 70 × 2 × 1.2 = 168 GB. INT8: 1 byte → 84 GB. INT4: 0.5 bytes → 42 GB.

The 1.2× overhead accounts for activations, the KV cache at short context lengths, and framework overhead. At long context lengths the KV cache can grow to dominate — see the KV Cache guide for the full calculation.

Multi-GPU serving

When a model doesn't fit on a single GPU, it can be sharded across multiple GPUs using tensor parallelism. Each GPU holds a slice of each weight matrix, and they collaborate via NVLink or PCIe during each forward pass.

High-speed interconnects like NVLink deliver substantially more bandwidth between GPUs than standard PCIe — often an order of magnitude more. This is why the SXM form factor (which enables NVLink) is strongly preferred for tensor-parallel inference, where every forward pass requires multiple all-reduce operations across GPUs. For pipeline parallelism (where each GPU holds entire layers and communicates only at layer boundaries), PCIe is often sufficient since inter-GPU traffic is much lower.

// Practical Decision Rule

If your model fits on one GPU: choose based on bandwidth-per-dollar. If it requires 2–4 GPUs: NVLink (SXM/NVL form factor) materially outperforms PCIe. Beyond 4 GPUs: you're in cluster territory and interconnect topology dominates.

// In short

01Memory bandwidth is the primary performance driver for LLM decode at low-to-medium batch sizes. This is where to focus hardware comparisons.

02Memory capacity determines feasibility. A model that doesn't fit cannot run. Calculate VRAM requirements before anything else.

03FLOPS matter at high batch sizes and during the prefill phase. For latency-sensitive low-concurrency workloads, they're largely irrelevant.

04HBM outperforms GDDR6 by a wide margin. The memory type gap — not the FLOPS gap — is the dominant explanation for why datacenter-tier hardware generates tokens faster than mid-tier alternatives.

05Quantization is a memory strategy, not just a compression strategy. INT4 cuts VRAM to 25% of FP16, doubling the effective bandwidth and fitting larger models on smaller hardware.