Production Metrics
TTFT, TPOT, and the Numbers That Define LLM Experience
A model that is fast in a benchmark can feel slow in production. The metrics you track determine what you optimise. Know these numbers and you can have a useful conversation about any LLM deployment.
The two phases create two separate problems
LLM inference has two distinct phases — prefill and decode — and they generate different user experience problems. Prefill is fast but produces waiting. Decode is slow but produces visible progress. Users perceive these very differently, which is why a single latency number fails to capture what matters.
A response that takes 10 seconds with nothing displayed, then dumps 500 words instantly, is experienced as a 10-second wait. A response that streams 500 words over 12 seconds, starting within 200ms, feels fast and responsive. Both have the same end-to-end latency. The metrics below capture this distinction precisely.
Time to First Token (TTFT)
Time to First Token TTFT
The elapsed time from when a request is submitted to when the first output token is received by the client. Determined primarily by prefill time and queue wait time. This is the metric that governs perceived responsiveness — how long users wait before anything appears.
TTFT is dominated by prompt length and system load. Long system prompts (2k+ tokens) can add hundreds of milliseconds of prefill time before any streaming begins. Prefix caching is the primary lever for reducing TTFT in production — a cached system prompt costs near zero prefill time.
Time per Output Token (TPOT)
Time per Output Token TPOT
The average time between successive output tokens, measured during the decode phase. Also called inter-token latency (ITL). This determines the streaming "feel" — whether the text appears to flow naturally or stutters. At good TPOT values, text appears at roughly reading speed.
Human reading speed is roughly 250 words per minute, or about 4 words per second. At ~1.3 tokens per word, this is approximately 5 tokens per second, or 200ms per token. Anything below 100ms TPOT produces a perceptibly smooth stream; above 150ms, users typically notice stuttering.
End-to-End Latency (E2EL)
End-to-End Latency E2EL
Total time from request submission to receiving the final token. Equals TTFT + (TPOT × output_tokens). The least useful single metric for UX, but important for batch/offline workloads where streaming is irrelevant.
Throughput
Throughput
The rate at which the system generates output tokens across all concurrent requests, measured in tokens per second (TPS) or requests per second (RPS). The primary metric for cost efficiency — maximising throughput minimises cost per token.
Throughput figures are estimates derived from bandwidth and model size. Actual numbers vary significantly with batch size, context length, quantization implementation, and serving configuration. Treat as order-of-magnitude references, not benchmarks.
// Throughput vs. latency tension
Increasing throughput usually increases latency. Batching more requests together improves GPU utilisation and tokens-per-second, but each individual request waits longer in the queue and experiences higher TPOT as the batch competes for compute. Your SLO defines where to set this tradeoff.
Percentiles, not averages
Average latency is nearly useless for SLO management. A system with 200ms average TTFT might have p99 of 3 seconds — meaning 1% of users wait 15× longer than average. In production, you care about the tail.
// latency_percentile_hierarchy
Cost per token
The economic metric that ultimately governs hardware and optimisation decisions.
Cost per million output tokens
cost = (hourly_GPU_cost / throughput_tokens_per_hour) × 1_000_000
Example: at $3/hr and 4,000 tokens/sec, you generate ~14.4M tokens/hr → ~$0.21 per million tokens before margin. Substitute your actual GPU cost and measured throughput.
Throughput is the key variable in the denominator. Every optimisation that increases tokens per second — continuous batching, quantization, prefix caching, speculative decoding — directly reduces cost per token. This is why throughput is the primary engineering target for cost-sensitive deployments.
Goodput
A subtler metric gaining adoption in production teams: goodput is the fraction of GPU compute that produces tokens delivered to users within SLO, as opposed to tokens that were generated but arrived too late (after a client timeout), or compute spent on cancelled requests.
A system at 90% throughput utilisation but 20% timeout rate has much lower goodput than it appears. Goodput = useful work ÷ total work. It's the metric that most honestly captures whether your serving system is doing what you're paying for.
What to instrument
| Metric | Unit | Why it matters |
|---|---|---|
| TTFT p50/p95/p99 | ms | User-perceived responsiveness; first impression |
| TPOT p50/p95/p99 | ms/token | Streaming quality; reading speed match |
| Token throughput | tokens/sec | Cost efficiency; capacity planning |
| Request queue depth | count | Early warning of overload; latency predictor |
| KV cache utilisation | % | Memory pressure; preemption risk |
| Prefix cache hit rate | % | Prefill efficiency; TTFT reduction |
| Token error rate | % | Generation failures, OOM events |
| GPU utilisation | % | Headroom assessment; over/under-provisioning |
// In short