RAG: Retrieval Augmented Generation — Inference Engineering

Why RAG exists

A language model's knowledge is frozen at training time. It can't look things up, doesn't know about recent events, and can't access private data your organisation holds. The naive solutions are to fine-tune the model on your data (expensive, slow to update, easy to overfit) or to dump everything into the context window (impractical at scale, expensive, noisy). RAG offers a middle path: retrieve only what's relevant, when it's needed.

The core idea is straightforward. Before sending a user's query to the model, you run a retrieval step that finds the most relevant documents or passages from a corpus. You insert those passages into the model's context. The model then answers with access to that specific content. The model itself doesn't need to memorise the corpus — it just reads from what you hand it at inference time.

The RAG pipeline, step by step

Document ingestion & chunking

Raw documents (PDFs, webpages, databases) are split into chunks. Chunk size and strategy significantly affect downstream quality.

Offline

Embedding

Each chunk is passed through an embedding model to produce a dense vector representation. These vectors are stored in a vector database.

Offline

Query embedding (at runtime)

The user's query is embedded with the same model used to embed the corpus. This converts the query into the same vector space.

Online — fast

Approximate nearest neighbour search

The query vector is compared against all stored chunk vectors. The top-k most similar chunks are retrieved (typically k=5–20).

Online — fast

Reranking (optional)

A cross-encoder model scores query–chunk pairs more precisely and re-orders the retrieved results. Slower but higher quality than vector search alone.

Online — moderate

Context assembly & generation

Retrieved chunks are formatted and inserted into the prompt. The LLM generates a response using the provided content.

Online — dominant cost

Chunking

Chunking is one of the most consequential decisions in a RAG system, and one of the least studied. The goal is to split documents into pieces small enough that each chunk is topically coherent and fits in context, but large enough that individual chunks contain enough information to be useful on their own.

Chunk size tradeoffs

Small chunks (100–300 tokens) give the retrieval system more precision — each chunk is about one specific thing, so retrieval signals are cleaner. But small chunks lose context: a sentence that references a table from the previous paragraph will retrieve without that table. The model gets partial information.

Large chunks (500–1500 tokens) carry more context around each piece of information. But retrieval quality degrades — a large chunk about a general topic will match many different queries, making it harder for the retrieval system to be selective. Inserting large chunks also consumes more of the context window per retrieved result.

A common middle ground is chunks of 300–600 tokens with a small overlap between adjacent chunks (50–100 tokens) so that information at chunk boundaries appears in at least one chunk intact.

Chunking strategies

Fixed-size character or token splitting is the default in most frameworks, but it often splits in the middle of sentences or paragraphs. Sentence-aware splitting is better — split on sentence boundaries and aggregate until you reach a target size. Semantic chunking (using an embedding model to identify topic boundaries) can further improve coherence at the cost of more preprocessing compute.

For structured documents — code, tables, legal contracts, technical specifications — generic splitting strategies often fail. Code should chunk at function or class boundaries, not arbitrary token counts. Tables should be kept intact or converted to text before chunking. Document structure is retrieval signal; destroying it degrades results.

Embedding models

The embedding model is what converts text into vectors for storage and comparison. The choice of embedding model matters because retrieval quality is only as good as how well the embedding space captures semantic similarity for your specific domain and task.

General-purpose embedding models (trained on broad web text) work well for general queries. Domain-specific corpora — legal, medical, code, scientific — often benefit from models fine-tuned on that domain. Using a general embedding model on a specialised corpus is a common source of retrieval quality problems that doesn't show up in standard benchmarks.

The embedding model used to index the corpus and the model used to embed queries at runtime must be the same model — or at minimum, models trained to produce vectors in the same space. Mixing embedding models produces vectors that aren't comparable and retrieval will fail in subtle ways.

Vector search

Once the corpus is embedded, retrieval uses approximate nearest neighbour (ANN) search to find the chunks most similar to a query vector. Similarity is typically measured by cosine similarity or dot product. The "approximate" in ANN matters: exact nearest neighbour search over millions of vectors is too slow for real-time systems, so index structures (HNSW, IVF, etc.) trade a small amount of recall for much faster search.

Vector databases (Pinecone, Weaviate, Qdrant, pgvector) handle this indexing and search. For smaller corpora (under a few hundred thousand chunks), Postgres with pgvector is often sufficient. At larger scale, dedicated vector databases offer better performance and operational tooling.

Reranking

Embedding-based retrieval is fast but imprecise. Bi-encoders (the architecture used in most embedding models) encode query and document independently, then compare vectors. This is fast because documents are pre-encoded, but it loses the direct interaction between query tokens and document tokens.

A cross-encoder reranker takes a query-document pair and scores them jointly — the full attention mechanism attends to both simultaneously. This is slower (you can't pre-compute, since you need the query) but produces substantially more accurate relevance scores. The standard pattern is to retrieve k=20–50 candidates from vector search, then rerank with a cross-encoder and keep the top 3–5 for the context.

Reranking adds latency (typically 50–200ms depending on model size and number of candidates) but often improves answer quality enough to justify it, particularly when retrieval precision matters more than throughput.

Where RAG systems fail

RAG introduces multiple points of failure that are independent from the quality of the underlying language model. Most RAG quality problems are retrieval problems, not generation problems — the model is doing its best with what it was given.

Failure	Root cause	Fix
Retrieval miss	Relevant content wasn't retrieved (wrong chunks, poor embedding)	Improve chunking strategy; try domain-specific embeddings; add reranking
Context poisoning	Irrelevant chunks retrieved and inserted; model pulls from them	Reduce k; add reranking; add a filtering step before context assembly
Lost in the middle	Relevant chunk retrieved but buried in a long context; model ignores it	Put most relevant content at the start or end; reduce total context length
Chunk truncation	Answer spans a chunk boundary; neither chunk alone contains the full answer	Increase chunk overlap; use larger chunks for the domain; add parent-chunk retrieval
Query-corpus mismatch	Query vocabulary doesn't match how content is indexed (acronyms, synonyms)	Query expansion; hybrid search (vector + keyword); query rewriting
Hallucination despite retrieval	Model ignores retrieved content and generates from parametric memory	Prompt engineering (instruct the model to cite sources); evaluate citation behaviour

Hybrid search

Pure vector search is good at semantic matching — finding documents that mean the same thing even if they use different words. But it can be poor at exact matching — finding specific product names, codes, acronyms, or proper nouns that appear verbatim. Keyword search (BM25) excels at exact matching but fails at semantic similarity.

Hybrid search combines both: run vector search and BM25 in parallel, then merge the result lists using reciprocal rank fusion or a learned fusion model. For most production RAG systems, hybrid search outperforms either approach alone, particularly on corpora with a mix of natural language and structured identifiers.

Latency anatomy of a RAG request

A RAG request has more latency components than a direct generation call: query embedding, ANN search, optional reranking, context assembly, and then the LLM generation. The LLM generation is usually the dominant cost, but the retrieval steps are additive.

Step	Typical latency	Scales with
Query embedding	10–50ms	Embedding model size
ANN vector search	5–30ms	Index size; k
Reranking (optional)	50–300ms	Reranker model size; number of candidates
LLM prefill	500ms–3s	Context length (retrieved chunks dominate)
LLM decode	Seconds	Output length; model size

The LLM generation step is almost always the bottleneck — but RAG makes it more expensive than a raw generation call because it adds tokens to the context. Every retrieved chunk extends the prefill length, increasing TTFT and KV cache memory pressure. Keeping retrieved context focused (fewer, better chunks) directly reduces generation cost.

// Evaluate retrieval separately from generation

Most teams evaluate their RAG system end-to-end: does the final answer match the expected answer? This conflates retrieval quality with generation quality. A better approach is to evaluate retrieval independently — for a set of test queries, did the retrieval step return the relevant passages? Fix retrieval first. The model can't answer well from bad context regardless of its capability.

// In short

01RAG retrieves before it generates. The core flow: embed query → search index → retrieve chunks → insert into context → generate.

02Chunking strategy has an outsized effect on quality. Too small: retrieval is precise but context is lost. Too large: retrieval is noisy and context windows bloat. 300–600 tokens with overlap is a reasonable starting point.

03Embedding models must match between indexing and query time. Mixing models breaks retrieval in ways that may not be immediately obvious.

04Reranking improves precision at the cost of latency. Retrieve 20–50, rerank, keep 3–5. The cross-encoder sees both query and document together, producing much more accurate scores than vector similarity alone.

05Most RAG failures are retrieval failures, not generation failures. Evaluate retrieval independently. The model can't rescue bad context.

06Retrieved content adds to prefill cost. Every chunk inserted into context extends the prefill and consumes KV cache memory. Fewer, better chunks reduce generation cost directly.