Prompt Engineering for Inference — Inference Engineering

Tokens cost money and time

Every token in your prompt is processed during the prefill phase. Prefill is fast relative to decode, but it's not free — and at scale, the cumulative cost of large system prompts, over-retrieved context, and redundant few-shot examples adds up in compute time and API cost.

The engineering lens on prompt design starts here: prompts are not just instructions, they're compute inputs with measurable cost. Token count affects prefill time (and thus TTFT), KV cache memory usage, and how much room remains in the context window for output.

Prefix caching

This is the most consequential infrastructure concept for prompt design. When the prefix of a prompt — typically the system prompt — is identical across multiple requests, serving frameworks can cache the KV cache entries for those tokens and reuse them without recomputing the prefill.

The implication is significant: a long system prompt that stays constant across all requests only needs to be prefilled once per cache entry lifetime. Subsequent requests that share that prefix pay near-zero prefill cost for the cached portion. A 2,000-token system prompt has very different economics if it's always the same (cacheable) vs. slightly different per user (not cacheable).

// prompt structure — cache-friendly layout

┌─ STATIC (cacheable prefix) ────────────────────┐ │ System prompt │ │ You are a customer support assistant for... │ │ Core instructions, policies, tone guidelines │ │ Few-shot examples (if any) │ │ │ │ [These ~1500 tokens hit the cache on request 2] │ └─────────────────────────────────────────────────┘ ┌─ DYNAMIC (computed fresh each request) ─────────┐ │ Retrieved context (if using RAG) │ │ Conversation history │ │ Current user message │ └──────────────────────────────────────────────────┘

Cache-friendly prompt design keeps static content at the front of the context. Systems that append dynamic content to the beginning of the context (before the system prompt) break the prefix match and invalidate the cache on every request.

// When prefix caching is available

Prefix caching (also called prompt caching or KV cache sharing) is supported by all major serving frameworks and is offered as a first-class API feature by most hosted providers. Cache hit rates depend on how frequently the same prefix appears within the cache TTL. Short TTLs and high-variance prefixes reduce effectiveness substantially.

System prompt size

System prompts often grow incrementally — a few lines added here, an edge case handled there — until they're several thousand tokens. At this scale, a few things happen. The prefill cost for non-cached requests increases. More KV cache memory is consumed per request. And there's less room in the context window for output and user content.

The practical question is whether all of that content is actually used by the model on a given request. A 3,000-token system prompt covering 40 different edge cases is only useful when those edge cases occur. One approach is to retrieve only the relevant portions of a large instruction set rather than including everything, treating the instructions themselves as a retrieval problem.

Few-shot examples

Few-shot examples (input-output pairs that demonstrate the desired behaviour) are one of the most effective prompting techniques — and one of the most expensive in token terms. Each example consumes tokens in both the input (the example input) and the system prompt (the expected output format).

When few-shot examples are static and used across many requests, they're excellent candidates for prefix caching. When they need to be dynamic — selected based on the user's query — consider whether the selection is doing enough work to justify the added prefill cost vs. better system prompt instructions.

Output length

Output token count has a different cost profile than input token count. While input tokens are processed during prefill (fast, parallel), output tokens are generated one at a time during decode (slow, sequential). An instruction that produces a longer output doesn't just cost more in tokens — it takes proportionally longer to complete.

Where the output length matters for your application, consider instructing the model explicitly. "Answer in one sentence" or "respond with only the JSON object, no explanation" can substantially reduce output token count without reducing quality for structured tasks. Verbose outputs are expensive in both latency and compute.

Structured output and JSON mode

Structured output (requesting JSON, XML, or a specific schema) reduces post-processing complexity, but the output format itself affects token count. A verbose JSON structure with many nested keys produces more tokens than a compact representation of the same data. Schema design is prompt engineering, and compact schemas reduce decode time.

Design decision	Cost/latency effect	Recommendation
Static system prompt at start	Enables prefix caching — major savings at scale	Always keep static content first
Dynamic content mixed into system prompt	Breaks cache, full prefill on every request	Separate static and dynamic regions cleanly
Long static system prompt	One-time prefill cost, then cached	Acceptable if cacheable; avoid growing uncritically
Long dynamic context (RAG)	Full prefill cost every request	Retrieve only what's needed; rank and trim
Few-shot examples (static)	Prefill cost offset by cache hit	Good pattern — keep examples in the cacheable prefix
Verbose output instructions	Longer decode, higher TPOT cost	Instruct for concise output where quality allows

Measuring the impact

Prompt engineering decisions have measurable effects that should be validated, not assumed. Cache hit rate, average prefill token count, TTFT, and cost per request are all directly observable in production. A prompt change that looks minor can meaningfully shift these metrics if it changes the cacheable prefix length or alters output verbosity.

// In short

01Every input token has a prefill cost. It's not free — at scale, prompt size directly affects TTFT and compute cost.

02Prefix caching is the highest-leverage structural optimisation. Keep static content (system prompt, examples) at the start so the KV cache entries can be reused across requests.

03Dynamic content mixed into the prefix breaks caching. Keep the dynamic region (history, query, retrieved docs) after the static region.

04Output tokens are more expensive per token than input tokens — they're generated sequentially. Instructions that reduce output length reduce latency, not just cost.

05Measure prompt changes in production terms — cache hit rate, prefill cost, TTFT — not just output quality.