Intent Engineering — Inference Engineering

The gap between input and intent

When a user sends a message to an AI system, they're rarely expressing their full intent. They're expressing the part of their intent that they expect to be understood — which is a much smaller thing. Human communication works this way because it's heavily context-dependent: we rely on shared assumptions, implicit goals, and the other party's ability to infer what we actually want.

Language models are trained on human-generated text, so they develop some capacity to infer intent beyond literal meaning. But that inference happens inside the model, without visibility into the actual system the user is interacting with, the user's real goal, or the constraints that define a "good" response in that context. The system designer has to do the work of surfacing those things — either by providing context to the model or by structuring the interaction in ways that reduce the gap before the model ever sees the input.

// intent_gap.txt — three layers of what a user wants

What they typed

"fix my code"

→

What they meant

Make this function behave correctly without changing its interface, and explain what was wrong so I can avoid it next time

What they typed

"summarise this document"

→

What they meant

Extract the three things I need to act on — I have 90 seconds before a call

What they typed

"write a cold email to a CTO"

→

What they meant

Sound like a peer, not a vendor — no jargon, no flattery, one paragraph, specific to their industry

Layers of intent

Intent has structure. It helps to decompose it explicitly before deciding what the system should do with it.

Immediate requestSurface layer

The literal task stated in the input. "Translate this paragraph." "Write a test for this function." The model will always respond to this layer — the question is whether it responds to anything deeper.

Underlying goalOne level up

What the user is actually trying to achieve. A user asking to translate a paragraph might be preparing a client presentation — which means tone and register matter more than literal accuracy. The model doesn't know this unless you tell it, or unless it's inferable from context.

Implicit standardsAssumed, unstated

Expectations the user assumes are so obvious they go without saying. Don't delete my original code when you rewrite it. Keep the response short enough to actually read. Match the formality level of the input. These are often the first things violated by a well-meaning model response.

AutonomyWhat not to override

Choices the user has made deliberately and doesn't want corrected. If someone asks for a casual tone, they don't want to be upgraded to formal. If someone has chosen a specific approach, they may want help executing it rather than an argument for an alternative. Respecting autonomy means understanding which parts of the request are constraints versus which are open to interpretation.

How models handle intent

Models trained on human feedback develop a working model of intent — they learn to infer what someone probably wants from the pattern of their words, and they weight their responses toward what satisfies that inferred goal. This happens implicitly, not through explicit reasoning about intent.

The quality of this inference varies significantly by domain, specificity, and ambiguity. Short, ambiguous queries are hard. Queries that carry the goal in the words themselves are easy. Domain-specific queries where the intended meaning only makes sense with background knowledge can fail completely if the model's training didn't include that domain.

One important consequence of how models learn intent: they generalise from training data patterns. If your use case is genuinely novel — a workflow or task type that wasn't common in training — the model's intent inference will be unreliable, and the system design has to compensate. You can't assume the model has a correct prior for what your users are trying to do.

Intent failure modes

When intent engineering fails, the failure usually falls into one of a small number of recognisable patterns.

Literal compliance

"Shorten this email" → model deletes every other sentence

The model satisfies the literal request while violating the obvious underlying goal. Common when the instruction is underspecified and the model lacks context about why.

Goal substitution

"Help me write a cover letter" → model writes a generic one from scratch instead of working with the user's draft

The model correctly identifies the domain but substitutes its own interpretation of the goal. Often happens when common patterns in training override the specific request.

Implicit standard violation

"Clean up my essay" → model rewrites it in a completely different voice

The model accomplishes the surface task but violates an assumption the user considered obvious. These failures often generate strong user frustration because they feel like the model "wasn't listening."

Overcorrection

"Make this more formal" → model adds legal disclaimers and removes all personality

The model applies the instruction beyond the intended scope. Usually a calibration problem — it doesn't know when to stop, because the stopping point was implicit.

Context blindness

User asks a follow-up question that only makes sense in the context of earlier messages — model treats it as standalone

The model doesn't connect the current request to prior context. Can happen when context window management drops relevant history, or when the model fails to maintain state.

Autonomy override

"Translate this literally" → model "improves" the translation based on its own judgement

The model substitutes its own preferences for the user's explicit choice. Common in creative tasks where the model has strong priors about what "good" looks like.

Designing systems for intent

Intent engineering happens at multiple points in system design — not just in the system prompt. Each architectural decision either narrows or widens the gap between what users express and what the system delivers.

System prompt as intent contract

The system prompt is where you encode your best understanding of what users of this system are actually trying to achieve. A good system prompt doesn't just describe the task — it articulates the underlying goal, the implicit standards that should never be violated, and the constraints that should be treated as fixed. It establishes the interpretive frame the model uses when user intent is ambiguous.

The temptation is to write a system prompt that covers every edge case. This usually produces prompts that are too long to be followed reliably and that create contradictions the model has to resolve on its own. A better approach is to identify the three or four things that matter most — the constraints whose violation would produce unacceptable outputs — and focus the system prompt there. Everything else is noise that dilutes the signal.

Few-shot examples as intent demonstration

Natural language instructions are ambiguous by nature. Showing the model example input-output pairs removes much of that ambiguity. A well-chosen example communicates not just what the output looks like, but how to handle the range of cases that the instruction doesn't fully specify. Two or three strong examples often do more for intent alignment than a paragraph of instructions.

The examples should be representative of the actual distribution of inputs — not idealised cases that the model already handles well. If your system regularly receives short, vague queries, your examples should show how to handle short, vague queries. If users frequently provide incomplete context, examples should demonstrate how to ask for clarification rather than hallucinate what's missing.

Structured input collection

Sometimes the right solution is to gather more intent information before sending anything to the model. A form that asks for tone, length, audience, and purpose before generating a document isn't friction — it's intent capture. The more information you have about what the user actually wants, the less the model has to infer, and the more reliable the output.

This is most applicable in workflows where users can be expected to specify their requirements upfront. In conversational contexts, the equivalent is designing the system to ask clarifying questions when the initial message is genuinely ambiguous — rather than guessing and producing an output that may miss the mark entirely.

Output structure as implicit intent specification

Defining the structure of the output is a form of intent specification. If you tell the model to produce a JSON object with specific fields, you're constraining not just the format but the interpretive choices it makes along the way. A field called concise_summary communicates length and style expectations that a generic "summary" instruction doesn't. The structure encodes part of the intent.

Structured outputs are also more evaluable. If the model should produce a list of action items, you can check whether it did. If it should produce a response under 150 words, you can measure that. Making intent explicit through structure makes it possible to verify whether intent was satisfied, which is the prerequisite for systematic improvement.

Intent in multi-turn systems

Single-turn intent is already complex. Multi-turn systems add the complication of intent that evolves across a conversation — the user starts with one goal, learns something, and shifts to a related but different goal. Or they express dissatisfaction with an output without being explicit about what they wanted instead.

Multi-turn intent engineering requires that the system maintain a coherent model of what the user is trying to achieve across the entire conversation, not just the current message. This means thoughtful context window management — what history to retain, what to summarise, and what to drop — and prompts that explicitly direct the model to reason about the current request in light of the conversation so far.

A common failure pattern in multi-turn systems is treating each message as an independent request. The model responds to what the user just said without connecting it to the thread of what they've been working toward. This produces conversations that feel like they have no memory of themselves, and it consistently fails users who have spent multiple turns building toward something specific.

Evaluating intent satisfaction

The practical question is: how do you know whether your system is correctly handling user intent? Direct evaluation — humans reviewing whether outputs matched what users actually wanted — is the most reliable but also the most expensive. For most teams, it's the right baseline to establish before trying to automate.

Evaluation method	What it measures	Limitation
Human review	Whether output matched actual user intent	Expensive; doesn't scale; annotator agreement varies
Task completion rate	Whether the user accomplished their goal (via follow-up survey or implicit signals)	Delayed signal; doesn't identify what went wrong
Regeneration/edit rate	Whether users accepted the output or tried again	Conflates many failure types; doesn't capture silent dissatisfaction
LLM-as-judge	A second model evaluates whether the output matches the stated intent	Reliable only if the evaluator model understands the intent better than the generator
Structured field verification	Whether structured output fields are populated correctly	Measures format compliance, not whether the content satisfied the goal

The most useful signal is often the simplest: if users frequently rephrase their request and try again, or follow up with corrections, intent is consistently being missed. The pattern of corrections often identifies exactly which part of the intent the system is failing to capture — which is much more actionable than an aggregate quality score.

// Intent engineering is ongoing

User intent evolves as a product evolves. New features create new use cases. New user populations bring different communication styles and different background assumptions. Intent engineering isn't a one-time calibration — it's a continuous discipline of observing how users actually interact with the system and updating the design to match what they're actually trying to do.

// In short

01What users type is not the same as what they want. Intent has layers: the immediate request, the underlying goal, implicit standards, and choices that should be respected rather than overridden.

02Models infer intent from training data patterns. That inference is unreliable for novel use cases, ambiguous queries, and domain-specific tasks not well represented in training. The system has to compensate.

03Common failures: literal compliance, goal substitution, implicit standard violation, overcorrection, context blindness, autonomy override. Each has a recognisable signature and a different fix.

04System prompts should encode underlying goals and non-negotiable constraints. Long, exhaustive prompts that try to cover every case usually perform worse than focused ones that nail the few things that actually matter.

05Few-shot examples communicate intent more precisely than instructions alone. Show the model how to handle the ambiguous cases — not just the easy ones.

06Structure your outputs to make intent explicit and evaluable. Named fields with meaningful labels constrain interpretive choices and make it possible to verify whether intent was satisfied.

07The correction pattern is your clearest signal. When users rephrase and retry, or follow up with "actually I meant…", the pattern of those corrections tells you exactly what the system is missing.