// Concept 04 — Quantization

Quantization

From FP32 to INT4 — Shrinking Models Without Losing Your Mind

Quantization is the most practical lever an inference engineer has. It can cut memory use by 8× and double throughput with minimal quality loss — if you understand what you're trading and when each method applies.

22 min read·Concept 04 of 07

What quantization is

Neural network weights are numbers. Like all numbers on a computer, they're represented in some numerical format that determines their precision and range. The standard format used during training is FP32 — 32-bit floating point, which can represent values from roughly 1.4×10⁻⁴⁵ to 3.4×10³⁸ with about 7 decimal digits of precision.

Most of that precision is wasted for inference. Experiments consistently show that model quality is largely preserved even when weights are stored in much coarser formats. Quantization is the process of converting weights from a high-precision format to a lower-precision one, trading precision for memory and speed.

The number formats

FP3232 bits4 bytes/param
BF1616 bits2 bytes/param
FP1616 bits2 bytes/param
INT88 bits1 byte/param
INT44 bits0.5 bytes/param

FP16 and BF16 are the standard inference formats without explicit quantization — both use 16 bits but differ in how they allocate bits between exponent and mantissa. BF16 (Brain Float) matches FP32's dynamic range, making it stable for training. FP16 has higher precision but can overflow. Most modern inference stacks default to BF16.

Moving to INT8 halves memory and typically increases throughput meaningfully. Moving to INT4 cuts memory by 4× versus FP16 and often doubles throughput again, but requires more careful calibration to maintain quality.

Why it works at all

Neural networks are redundant — the same information is encoded across many overlapping weights, so small perturbations to individual weights have limited effect. And larger models tolerate it better than smaller ones. A 70B model quantized to INT4 typically loses less performance than a 7B model quantized to INT4, because larger models have more capacity to absorb perturbation.

// The key insight

Quantization is not just a compression trick — it's also a bandwidth trick. INT4 weights occupy 4× less memory than FP16, which means they can be transferred from HBM to compute cores 4× faster. Since inference is memory-bandwidth-bound, this directly accelerates token generation.

Post-training quantization methods

Post-training quantization (PTQ) applies quantization after training, without any retraining. The three dominant methods for LLMs each make different tradeoffs:

MethodApproachQualitySpeedBest for
Dynamic INT8 Per-row absmax scaling, performed on-the-fly during inference Good Medium Prototyping, fitting large models into limited VRAM
GPTQ Layer-wise quantization using second-order info (Hessian). Requires calibration dataset. Very good Fast Production INT4, static deployment
AWQ Identifies and protects the 1% of weights most important to quality. Activation-aware. Excellent Fast Production INT4, especially 4-bit deployment
FP8 8-bit floating point. Supported natively on newer datacenter GPUs. Near-lossless Very fast Production inference on hardware with native FP8 support

GPTQ vs AWQ in practice

GPTQ quantizes each layer by solving an optimization problem: find the INT4 values that best approximate the FP16 weights, using the Hessian of the layer's outputs to weight which weights matter most. It requires a small calibration dataset (a few hundred examples) and takes several minutes to quantize a large model — but you only do it once.

AWQ takes a different insight: activation magnitudes vary enormously across channels, and the weights connected to high-activation channels matter far more. AWQ identifies the top ~1% of salient weights, scales them to use more of the INT4 range, then quantizes the rest. The result is generally better perplexity than GPTQ at the same bit width, with comparable or faster inference.

// Where quality loss accumulates

INT4 quality loss is not uniformly distributed. It concentrates in: small models (<7B parameters), long-form reasoning tasks, coding, and math. For chat and summarisation on models ≥13B, INT4 is often indistinguishable from FP16 in practice. For agentic tasks requiring precise multi-step reasoning, consider INT8 or FP16.

FP8: the best of both worlds

Newer datacenter GPUs include native FP8 tensor cores, enabling 8-bit floating-point computation at nearly the speed of INT4 but with cleaner numerical properties. FP8 uses the same exponential encoding as FP16, just with fewer bits — meaning it handles the same wide dynamic range without the quantization artefacts that integer formats introduce.

On hardware with native FP8 support, FP8 inference achieves near-FP16 quality at meaningfully higher throughput. It requires no calibration dataset and is generally preferable to INT4 where supported.

GGUF and llama.cpp

For local inference and research, GGUF (the format used by llama.cpp) has become the dominant quantization format. GGUF supports a family of quantization levels from Q2 (2-bit) through Q8 (8-bit), with mixed-precision variants (like Q4_K_M) that use higher precision for the most sensitive layers. GGUF models can run on CPU, GPU, or both, making them practical for laptops and consumer hardware.

For production GPU serving, GPTQ and AWQ remain the standard — they're well-optimised for GPU tensor core execution and integrate cleanly with major serving frameworks.

// quantization_decision_tree

?Does your GPU have native FP8 tensor core support?
YUse FP8. Near-lossless quality, higher throughput, no calibration needed.
?Is quality the top priority (coding, math, agents)?
YUse INT8 / BF16. Don't compromise on precision.
NUse AWQ INT4. Best quality-to-size ratio for chat/summarisation.
?Running locally on consumer hardware?
YUse GGUF Q4_K_M (llama.cpp). Best CPU/GPU split support.

// In short

01Quantization is a bandwidth strategy first. INT4 moves 4× less data from HBM per token, directly increasing decode throughput on memory-bandwidth-bound hardware.
02Larger models handle quantization better. A 70B INT4 model often beats a 70B FP16 on a cost-per-quality basis. A 7B INT4 model noticeably degrades.
03AWQ > GPTQ for most production INT4 deployments. Better perplexity, comparable speed, especially at 4-bit.
04FP8 is the cleanest tradeoff where hardware supports it. Near-FP16 quality, higher throughput, no calibration required.
05Task type matters. Chat and summarisation tolerate INT4 well. Coding, math, and multi-step reasoning should use INT8 or higher.