Quantization
From FP32 to INT4 — Shrinking Models Without Losing Your Mind
Quantization is the most practical lever an inference engineer has. It can cut memory use by 8× and double throughput with minimal quality loss — if you understand what you're trading and when each method applies.
What quantization is
Neural network weights are numbers. Like all numbers on a computer, they're represented in some numerical format that determines their precision and range. The standard format used during training is FP32 — 32-bit floating point, which can represent values from roughly 1.4×10⁻⁴⁵ to 3.4×10³⁸ with about 7 decimal digits of precision.
Most of that precision is wasted for inference. Experiments consistently show that model quality is largely preserved even when weights are stored in much coarser formats. Quantization is the process of converting weights from a high-precision format to a lower-precision one, trading precision for memory and speed.
The number formats
FP16 and BF16 are the standard inference formats without explicit quantization — both use 16 bits but differ in how they allocate bits between exponent and mantissa. BF16 (Brain Float) matches FP32's dynamic range, making it stable for training. FP16 has higher precision but can overflow. Most modern inference stacks default to BF16.
Moving to INT8 halves memory and typically increases throughput meaningfully. Moving to INT4 cuts memory by 4× versus FP16 and often doubles throughput again, but requires more careful calibration to maintain quality.
Why it works at all
Neural networks are redundant — the same information is encoded across many overlapping weights, so small perturbations to individual weights have limited effect. And larger models tolerate it better than smaller ones. A 70B model quantized to INT4 typically loses less performance than a 7B model quantized to INT4, because larger models have more capacity to absorb perturbation.
// The key insight
Quantization is not just a compression trick — it's also a bandwidth trick. INT4 weights occupy 4× less memory than FP16, which means they can be transferred from HBM to compute cores 4× faster. Since inference is memory-bandwidth-bound, this directly accelerates token generation.
Post-training quantization methods
Post-training quantization (PTQ) applies quantization after training, without any retraining. The three dominant methods for LLMs each make different tradeoffs:
| Method | Approach | Quality | Speed | Best for |
|---|---|---|---|---|
| Dynamic INT8 | Per-row absmax scaling, performed on-the-fly during inference | Good | Medium | Prototyping, fitting large models into limited VRAM |
| GPTQ | Layer-wise quantization using second-order info (Hessian). Requires calibration dataset. | Very good | Fast | Production INT4, static deployment |
| AWQ | Identifies and protects the 1% of weights most important to quality. Activation-aware. | Excellent | Fast | Production INT4, especially 4-bit deployment |
| FP8 | 8-bit floating point. Supported natively on newer datacenter GPUs. | Near-lossless | Very fast | Production inference on hardware with native FP8 support |
GPTQ vs AWQ in practice
GPTQ quantizes each layer by solving an optimization problem: find the INT4 values that best approximate the FP16 weights, using the Hessian of the layer's outputs to weight which weights matter most. It requires a small calibration dataset (a few hundred examples) and takes several minutes to quantize a large model — but you only do it once.
AWQ takes a different insight: activation magnitudes vary enormously across channels, and the weights connected to high-activation channels matter far more. AWQ identifies the top ~1% of salient weights, scales them to use more of the INT4 range, then quantizes the rest. The result is generally better perplexity than GPTQ at the same bit width, with comparable or faster inference.
// Where quality loss accumulates
INT4 quality loss is not uniformly distributed. It concentrates in: small models (<7B parameters), long-form reasoning tasks, coding, and math. For chat and summarisation on models ≥13B, INT4 is often indistinguishable from FP16 in practice. For agentic tasks requiring precise multi-step reasoning, consider INT8 or FP16.
FP8: the best of both worlds
Newer datacenter GPUs include native FP8 tensor cores, enabling 8-bit floating-point computation at nearly the speed of INT4 but with cleaner numerical properties. FP8 uses the same exponential encoding as FP16, just with fewer bits — meaning it handles the same wide dynamic range without the quantization artefacts that integer formats introduce.
On hardware with native FP8 support, FP8 inference achieves near-FP16 quality at meaningfully higher throughput. It requires no calibration dataset and is generally preferable to INT4 where supported.
GGUF and llama.cpp
For local inference and research, GGUF (the format used by llama.cpp) has become the dominant quantization format. GGUF supports a family of quantization levels from Q2 (2-bit) through Q8 (8-bit), with mixed-precision variants (like Q4_K_M) that use higher precision for the most sensitive layers. GGUF models can run on CPU, GPU, or both, making them practical for laptops and consumer hardware.
For production GPU serving, GPTQ and AWQ remain the standard — they're well-optimised for GPU tensor core execution and integrate cleanly with major serving frameworks.
// quantization_decision_tree
// In short