RAG
Retrieval Augmented Generation
RAG is the practice of retrieving relevant content from an external source and inserting it into the model's context before generation. It's a practical solution to a real constraint: models have fixed knowledge cutoffs and can't fit every relevant document in their context at once. Understanding how retrieval fits into the inference pipeline — and where it breaks — makes the difference between a system that works and one that just looks like it works.
Why RAG exists
A language model's knowledge is frozen at training time. It can't look things up, doesn't know about recent events, and can't access private data your organisation holds. The naive solutions are to fine-tune the model on your data (expensive, slow to update, easy to overfit) or to dump everything into the context window (impractical at scale, expensive, noisy). RAG offers a middle path: retrieve only what's relevant, when it's needed.
The core idea is straightforward. Before sending a user's query to the model, you run a retrieval step that finds the most relevant documents or passages from a corpus. You insert those passages into the model's context. The model then answers with access to that specific content. The model itself doesn't need to memorise the corpus — it just reads from what you hand it at inference time.
The RAG pipeline, step by step
Document ingestion & chunking
Raw documents (PDFs, webpages, databases) are split into chunks. Chunk size and strategy significantly affect downstream quality.
Embedding
Each chunk is passed through an embedding model to produce a dense vector representation. These vectors are stored in a vector database.
Query embedding (at runtime)
The user's query is embedded with the same model used to embed the corpus. This converts the query into the same vector space.
Approximate nearest neighbour search
The query vector is compared against all stored chunk vectors. The top-k most similar chunks are retrieved (typically k=5–20).
Reranking (optional)
A cross-encoder model scores query–chunk pairs more precisely and re-orders the retrieved results. Slower but higher quality than vector search alone.
Context assembly & generation
Retrieved chunks are formatted and inserted into the prompt. The LLM generates a response using the provided content.
Chunking
Chunking is one of the most consequential decisions in a RAG system, and one of the least studied. The goal is to split documents into pieces small enough that each chunk is topically coherent and fits in context, but large enough that individual chunks contain enough information to be useful on their own.
Chunk size tradeoffs
Small chunks (100–300 tokens) give the retrieval system more precision — each chunk is about one specific thing, so retrieval signals are cleaner. But small chunks lose context: a sentence that references a table from the previous paragraph will retrieve without that table. The model gets partial information.
Large chunks (500–1500 tokens) carry more context around each piece of information. But retrieval quality degrades — a large chunk about a general topic will match many different queries, making it harder for the retrieval system to be selective. Inserting large chunks also consumes more of the context window per retrieved result.
A common middle ground is chunks of 300–600 tokens with a small overlap between adjacent chunks (50–100 tokens) so that information at chunk boundaries appears in at least one chunk intact.
Chunking strategies
Fixed-size character or token splitting is the default in most frameworks, but it often splits in the middle of sentences or paragraphs. Sentence-aware splitting is better — split on sentence boundaries and aggregate until you reach a target size. Semantic chunking (using an embedding model to identify topic boundaries) can further improve coherence at the cost of more preprocessing compute.
For structured documents — code, tables, legal contracts, technical specifications — generic splitting strategies often fail. Code should chunk at function or class boundaries, not arbitrary token counts. Tables should be kept intact or converted to text before chunking. Document structure is retrieval signal; destroying it degrades results.
Embedding models
The embedding model is what converts text into vectors for storage and comparison. The choice of embedding model matters because retrieval quality is only as good as how well the embedding space captures semantic similarity for your specific domain and task.
General-purpose embedding models (trained on broad web text) work well for general queries. Domain-specific corpora — legal, medical, code, scientific — often benefit from models fine-tuned on that domain. Using a general embedding model on a specialised corpus is a common source of retrieval quality problems that doesn't show up in standard benchmarks.
The embedding model used to index the corpus and the model used to embed queries at runtime must be the same model — or at minimum, models trained to produce vectors in the same space. Mixing embedding models produces vectors that aren't comparable and retrieval will fail in subtle ways.
Vector search
Once the corpus is embedded, retrieval uses approximate nearest neighbour (ANN) search to find the chunks most similar to a query vector. Similarity is typically measured by cosine similarity or dot product. The "approximate" in ANN matters: exact nearest neighbour search over millions of vectors is too slow for real-time systems, so index structures (HNSW, IVF, etc.) trade a small amount of recall for much faster search.
Vector databases (Pinecone, Weaviate, Qdrant, pgvector) handle this indexing and search. For smaller corpora (under a few hundred thousand chunks), Postgres with pgvector is often sufficient. At larger scale, dedicated vector databases offer better performance and operational tooling.
Reranking
Embedding-based retrieval is fast but imprecise. Bi-encoders (the architecture used in most embedding models) encode query and document independently, then compare vectors. This is fast because documents are pre-encoded, but it loses the direct interaction between query tokens and document tokens.
A cross-encoder reranker takes a query-document pair and scores them jointly — the full attention mechanism attends to both simultaneously. This is slower (you can't pre-compute, since you need the query) but produces substantially more accurate relevance scores. The standard pattern is to retrieve k=20–50 candidates from vector search, then rerank with a cross-encoder and keep the top 3–5 for the context.
Reranking adds latency (typically 50–200ms depending on model size and number of candidates) but often improves answer quality enough to justify it, particularly when retrieval precision matters more than throughput.
Where RAG systems fail
RAG introduces multiple points of failure that are independent from the quality of the underlying language model. Most RAG quality problems are retrieval problems, not generation problems — the model is doing its best with what it was given.
| Failure | Root cause | Fix |
|---|---|---|
| Retrieval miss | Relevant content wasn't retrieved (wrong chunks, poor embedding) | Improve chunking strategy; try domain-specific embeddings; add reranking |
| Context poisoning | Irrelevant chunks retrieved and inserted; model pulls from them | Reduce k; add reranking; add a filtering step before context assembly |
| Lost in the middle | Relevant chunk retrieved but buried in a long context; model ignores it | Put most relevant content at the start or end; reduce total context length |
| Chunk truncation | Answer spans a chunk boundary; neither chunk alone contains the full answer | Increase chunk overlap; use larger chunks for the domain; add parent-chunk retrieval |
| Query-corpus mismatch | Query vocabulary doesn't match how content is indexed (acronyms, synonyms) | Query expansion; hybrid search (vector + keyword); query rewriting |
| Hallucination despite retrieval | Model ignores retrieved content and generates from parametric memory | Prompt engineering (instruct the model to cite sources); evaluate citation behaviour |
Hybrid search
Pure vector search is good at semantic matching — finding documents that mean the same thing even if they use different words. But it can be poor at exact matching — finding specific product names, codes, acronyms, or proper nouns that appear verbatim. Keyword search (BM25) excels at exact matching but fails at semantic similarity.
Hybrid search combines both: run vector search and BM25 in parallel, then merge the result lists using reciprocal rank fusion or a learned fusion model. For most production RAG systems, hybrid search outperforms either approach alone, particularly on corpora with a mix of natural language and structured identifiers.
Latency anatomy of a RAG request
A RAG request has more latency components than a direct generation call: query embedding, ANN search, optional reranking, context assembly, and then the LLM generation. The LLM generation is usually the dominant cost, but the retrieval steps are additive.
| Step | Typical latency | Scales with |
|---|---|---|
| Query embedding | 10–50ms | Embedding model size |
| ANN vector search | 5–30ms | Index size; k |
| Reranking (optional) | 50–300ms | Reranker model size; number of candidates |
| LLM prefill | 500ms–3s | Context length (retrieved chunks dominate) |
| LLM decode | Seconds | Output length; model size |
The LLM generation step is almost always the bottleneck — but RAG makes it more expensive than a raw generation call because it adds tokens to the context. Every retrieved chunk extends the prefill length, increasing TTFT and KV cache memory pressure. Keeping retrieved context focused (fewer, better chunks) directly reduces generation cost.
// Evaluate retrieval separately from generation
Most teams evaluate their RAG system end-to-end: does the final answer match the expected answer? This conflates retrieval quality with generation quality. A better approach is to evaluate retrieval independently — for a set of test queries, did the retrieval step return the relevant passages? Fix retrieval first. The model can't answer well from bad context regardless of its capability.
// In short