Intent Engineering
Closing the gap between what users say and what they mean
A language model responds to the literal content of its input. Users communicate with goals in mind that often aren't fully expressed in what they type. Intent engineering is the discipline of designing systems — prompts, pipelines, retrieval strategies, output structures — that reliably close that gap. It's where inference engineering meets product thinking.
The gap between input and intent
When a user sends a message to an AI system, they're rarely expressing their full intent. They're expressing the part of their intent that they expect to be understood — which is a much smaller thing. Human communication works this way because it's heavily context-dependent: we rely on shared assumptions, implicit goals, and the other party's ability to infer what we actually want.
Language models are trained on human-generated text, so they develop some capacity to infer intent beyond literal meaning. But that inference happens inside the model, without visibility into the actual system the user is interacting with, the user's real goal, or the constraints that define a "good" response in that context. The system designer has to do the work of surfacing those things — either by providing context to the model or by structuring the interaction in ways that reduce the gap before the model ever sees the input.
// intent_gap.txt — three layers of what a user wants
What they typed
"fix my code"What they meant
Make this function behave correctly without changing its interface, and explain what was wrong so I can avoid it next timeWhat they typed
"summarise this document"What they meant
Extract the three things I need to act on — I have 90 seconds before a callWhat they typed
"write a cold email to a CTO"What they meant
Sound like a peer, not a vendor — no jargon, no flattery, one paragraph, specific to their industryLayers of intent
Intent has structure. It helps to decompose it explicitly before deciding what the system should do with it.
How models handle intent
Models trained on human feedback develop a working model of intent — they learn to infer what someone probably wants from the pattern of their words, and they weight their responses toward what satisfies that inferred goal. This happens implicitly, not through explicit reasoning about intent.
The quality of this inference varies significantly by domain, specificity, and ambiguity. Short, ambiguous queries are hard. Queries that carry the goal in the words themselves are easy. Domain-specific queries where the intended meaning only makes sense with background knowledge can fail completely if the model's training didn't include that domain.
One important consequence of how models learn intent: they generalise from training data patterns. If your use case is genuinely novel — a workflow or task type that wasn't common in training — the model's intent inference will be unreliable, and the system design has to compensate. You can't assume the model has a correct prior for what your users are trying to do.
Intent failure modes
When intent engineering fails, the failure usually falls into one of a small number of recognisable patterns.
Literal compliance
"Shorten this email" → model deletes every other sentence
The model satisfies the literal request while violating the obvious underlying goal. Common when the instruction is underspecified and the model lacks context about why.
Goal substitution
"Help me write a cover letter" → model writes a generic one from scratch instead of working with the user's draft
The model correctly identifies the domain but substitutes its own interpretation of the goal. Often happens when common patterns in training override the specific request.
Implicit standard violation
"Clean up my essay" → model rewrites it in a completely different voice
The model accomplishes the surface task but violates an assumption the user considered obvious. These failures often generate strong user frustration because they feel like the model "wasn't listening."
Overcorrection
"Make this more formal" → model adds legal disclaimers and removes all personality
The model applies the instruction beyond the intended scope. Usually a calibration problem — it doesn't know when to stop, because the stopping point was implicit.
Context blindness
User asks a follow-up question that only makes sense in the context of earlier messages — model treats it as standalone
The model doesn't connect the current request to prior context. Can happen when context window management drops relevant history, or when the model fails to maintain state.
Autonomy override
"Translate this literally" → model "improves" the translation based on its own judgement
The model substitutes its own preferences for the user's explicit choice. Common in creative tasks where the model has strong priors about what "good" looks like.
Designing systems for intent
Intent engineering happens at multiple points in system design — not just in the system prompt. Each architectural decision either narrows or widens the gap between what users express and what the system delivers.
System prompt as intent contract
The system prompt is where you encode your best understanding of what users of this system are actually trying to achieve. A good system prompt doesn't just describe the task — it articulates the underlying goal, the implicit standards that should never be violated, and the constraints that should be treated as fixed. It establishes the interpretive frame the model uses when user intent is ambiguous.
The temptation is to write a system prompt that covers every edge case. This usually produces prompts that are too long to be followed reliably and that create contradictions the model has to resolve on its own. A better approach is to identify the three or four things that matter most — the constraints whose violation would produce unacceptable outputs — and focus the system prompt there. Everything else is noise that dilutes the signal.
Few-shot examples as intent demonstration
Natural language instructions are ambiguous by nature. Showing the model example input-output pairs removes much of that ambiguity. A well-chosen example communicates not just what the output looks like, but how to handle the range of cases that the instruction doesn't fully specify. Two or three strong examples often do more for intent alignment than a paragraph of instructions.
The examples should be representative of the actual distribution of inputs — not idealised cases that the model already handles well. If your system regularly receives short, vague queries, your examples should show how to handle short, vague queries. If users frequently provide incomplete context, examples should demonstrate how to ask for clarification rather than hallucinate what's missing.
Structured input collection
Sometimes the right solution is to gather more intent information before sending anything to the model. A form that asks for tone, length, audience, and purpose before generating a document isn't friction — it's intent capture. The more information you have about what the user actually wants, the less the model has to infer, and the more reliable the output.
This is most applicable in workflows where users can be expected to specify their requirements upfront. In conversational contexts, the equivalent is designing the system to ask clarifying questions when the initial message is genuinely ambiguous — rather than guessing and producing an output that may miss the mark entirely.
Output structure as implicit intent specification
Defining the structure of the output is a form of intent specification. If you tell the model to produce a JSON object with specific fields, you're constraining not just the format but the interpretive choices it makes along the way. A field called concise_summary communicates length and style expectations that a generic "summary" instruction doesn't. The structure encodes part of the intent.
Structured outputs are also more evaluable. If the model should produce a list of action items, you can check whether it did. If it should produce a response under 150 words, you can measure that. Making intent explicit through structure makes it possible to verify whether intent was satisfied, which is the prerequisite for systematic improvement.
Intent in multi-turn systems
Single-turn intent is already complex. Multi-turn systems add the complication of intent that evolves across a conversation — the user starts with one goal, learns something, and shifts to a related but different goal. Or they express dissatisfaction with an output without being explicit about what they wanted instead.
Multi-turn intent engineering requires that the system maintain a coherent model of what the user is trying to achieve across the entire conversation, not just the current message. This means thoughtful context window management — what history to retain, what to summarise, and what to drop — and prompts that explicitly direct the model to reason about the current request in light of the conversation so far.
A common failure pattern in multi-turn systems is treating each message as an independent request. The model responds to what the user just said without connecting it to the thread of what they've been working toward. This produces conversations that feel like they have no memory of themselves, and it consistently fails users who have spent multiple turns building toward something specific.
Evaluating intent satisfaction
The practical question is: how do you know whether your system is correctly handling user intent? Direct evaluation — humans reviewing whether outputs matched what users actually wanted — is the most reliable but also the most expensive. For most teams, it's the right baseline to establish before trying to automate.
| Evaluation method | What it measures | Limitation |
|---|---|---|
| Human review | Whether output matched actual user intent | Expensive; doesn't scale; annotator agreement varies |
| Task completion rate | Whether the user accomplished their goal (via follow-up survey or implicit signals) | Delayed signal; doesn't identify what went wrong |
| Regeneration/edit rate | Whether users accepted the output or tried again | Conflates many failure types; doesn't capture silent dissatisfaction |
| LLM-as-judge | A second model evaluates whether the output matches the stated intent | Reliable only if the evaluator model understands the intent better than the generator |
| Structured field verification | Whether structured output fields are populated correctly | Measures format compliance, not whether the content satisfied the goal |
The most useful signal is often the simplest: if users frequently rephrase their request and try again, or follow up with corrections, intent is consistently being missed. The pattern of corrections often identifies exactly which part of the intent the system is failing to capture — which is much more actionable than an aggregate quality score.
// Intent engineering is ongoing
User intent evolves as a product evolves. New features create new use cases. New user populations bring different communication styles and different background assumptions. Intent engineering isn't a one-time calibration — it's a continuous discipline of observing how users actually interact with the system and updating the design to match what they're actually trying to do.
// In short