// Guide 04 — Applied Concepts

Context Windows

Length, Cost, and What Happens at the Limits

Context length is one of the most visible specs on a model's data sheet. It's also one of the most misunderstood — because the cost of using a long context doesn't scale the way most people expect, and the quality often doesn't either.

18 min read

What the context window actually is

The context window is the total number of tokens a model can see at once during a single forward pass. This includes everything: the system prompt, conversation history, any documents you've retrieved, the current user message, and space for the model's response. All of it competes for the same fixed budget.

When people say a model has a "128k context," they mean 128,000 tokens can be present simultaneously. In rough terms, that's somewhere around 90,000–100,000 words — but token count and word count don't map cleanly, and the exact ratio depends on the language and content type.

// how_context_is_used — example allocation

128k token budget — one request

System
Conv. history
Retrieved documents
Query
Reserved for output
In RAG systems, retrieved documents are often the dominant consumer of context. System prompts, history, and the current query typically take a small fraction. Output reservation is often overlooked — the model can't generate past the context limit.

How attention cost scales with context

Attention is the mechanism that lets every token attend to every other token in the context. The cost of computing attention scales with the square of the sequence length. Double the context, quadruple the attention computation. This is the O(n²) complexity that comes up repeatedly in discussions of long-context inference.

In practice, attention is only part of the total compute cost — the feed-forward layers scale linearly with context length, not quadratically. But attention dominates at long sequence lengths, which is why going from a short context to a very long one increases cost disproportionately.

// The quadratic problem

Doubling context from 8k to 16k tokens doesn't double the prefill cost — it roughly quadruples the attention computation. This is why FlashAttention and similar optimisations exist: they reduce the memory footprint of attention without changing its mathematical output, making long-context inference feasible on hardware that would otherwise run out of memory.

Memory cost of long context

The KV cache stores the key and value tensors for every token in the context, for every layer of the model. It grows linearly with context length. A short context uses little KV cache memory; a long context can consume a significant portion of available VRAM — and in extreme cases, more VRAM than the model weights themselves.

This is why context length and batch size trade off directly. If you're serving a model at a very long context, the KV cache for a single request takes up so much memory that you can't fit many concurrent requests. Throughput drops even if the hardware is otherwise capable.

// relative kv cache memory — same model, different context lengths

4k tokens
baseline
16k tokens
64k tokens
16×
128k tokens
32×

KV cache scales linearly with context length. At 128k tokens it can exceed model weight memory for many architectures.

Prefill cost at long context

The prefill phase processes the entire context in parallel before generating the first token. At short contexts, prefill is fast. At very long contexts — say, a document-heavy RAG request with 50k+ tokens of retrieved content — prefill can take seconds. This directly increases TTFT (time to first token), which users experience as the model "thinking" before it starts responding.

This is one reason why chunked prefill matters in production: by breaking long prefill into smaller chunks, the serving system can interleave prefill work with ongoing decode for other requests, rather than blocking everything while one long request completes its prefill.

Quality at the limits

Models are trained and evaluated on specific context distributions. Most training data contains relatively short sequences. While models are often fine-tuned to extend context length, their ability to effectively use information spread across a very long context is not uniform — and it degrades in specific ways that are worth knowing.

The lost-in-the-middle problem

Research consistently shows that models are better at using information near the beginning and end of their context than information buried in the middle. If you have a long context with critical information in the middle, the model may effectively ignore it even though it's technically within the context window. This isn't a bug — it's a reflection of how attention patterns form during training and fine-tuning.

Effective context vs. nominal context

A model with a 128k token context window doesn't uniformly utilise all 128k tokens. The effective context — the range within which information is reliably attended to — is typically shorter than the nominal maximum. How much shorter depends on the model, the task, and how the context is structured. Testing on your specific workload is the only reliable way to know.

// Don't assume the context window is the ceiling

Just because a model supports 128k tokens doesn't mean you should use 128k tokens. Longer context means higher cost, higher latency, more VRAM pressure, and potentially lower quality on information in the middle. The right context length is the shortest one that contains what the model actually needs.

Long context vs. RAG

One of the practical decisions teams face is whether to give the model a very long context (stuffing in all potentially relevant documents) or to use retrieval to select only the most relevant content and keep the context shorter. This is a cost, latency, and quality tradeoff.

ApproachCostLatencyQuality risk
Long context (stuff everything in)High — scales with total contentHigh TTFT from long prefillLost-in-the-middle; model may miss key information
RAG (retrieve then generate)Lower — only relevant chunks in contextAdds retrieval latency, lower prefill costRetrieval can miss relevant content; chunking quality matters
Hybrid (retrieve + rerank + longer context)ModerateModerateBetter coverage than RAG alone; less bloat than stuffing

There's no universal winner. For tasks where the model needs to reason across the full document (contracts, codebases, long transcripts), long context can outperform RAG. For tasks where specific facts need to be retrieved from a large corpus, RAG is typically more practical and cost-effective.

// In short

01Context window = total token budget for everything. System prompt, history, documents, query, and output all compete for the same limit.
02Attention cost scales quadratically with context length. Double the context, roughly quadruple the attention compute. Feed-forward layers are linear; attention dominates at long context.
03KV cache grows linearly and can exceed model weight memory. Long context squeezes batch size and reduces throughput, even on capable hardware.
04Prefill at long context increases TTFT. A 50k-token prefill takes noticeably longer than a 2k-token prefill. Users feel this as latency before the first token arrives.
05Models don't use all of their context uniformly. Information in the middle of a very long context is attended to less reliably than information at the start or end. Effective context is shorter than nominal context.
06Use the shortest context that contains what you need. Long context has real costs. RAG and chunking are tools for keeping context focused rather than exhaustive.