RAG: Retrieval-Augmented Generation

What is RAG?

RAG (Retrieval-Augmented Generation) is a pattern that gives a language model access to an external knowledge base at inference time. Instead of relying solely on what was baked into its weights during training, the model first retrieves relevant documents from a database, then generates a response grounded in that retrieved context.

The key insight: the model's knowledge and the model's reasoning are separated. You keep knowledge in a database you control, and use the LLM for what it's actually good at — synthesis, reasoning, and language generation.

How it works

A RAG pipeline has three stages:

Indexing (offline, done once): Split your documents into chunks, embed each chunk into a vector, store vectors in a vector database.
Retrieval (at query time): Embed the user's query, find the top-k chunks whose vectors are closest to the query vector.
Generation: Inject the retrieved chunks into the model's context window as a prompt prefix, then generate the answer.

User query
    ↓
[Embed query] → vector
    ↓
[Vector DB] → top-k relevant chunks
    ↓
[LLM] ← prompt = system + chunks + query
    ↓
Answer grounded in retrieved context

The quality of retrieval determines the quality of the answer. Garbage in, garbage out — no matter how good the LLM is.

When to use it

RAG is the right choice when:

Your knowledge base changes frequently (product docs, internal wikis, news)
You need citations — answers traceable to specific source documents
You're working with domain-specific or proprietary information the model wasn't trained on
You want to reduce hallucinations on factual questions
Cost is a concern — fine-tuning is expensive, RAG can use any base model

Trade-offs vs fine-tuning

Knowledge update	Change the DB, no retraining	Full or partial retraining
Cost	Low (retrieval + inference)	High (GPU training)
Hallucination on facts	Lower (grounded in docs)	Higher (baked in weights)
Reasoning style / tone	Unchanged	Can be adapted
Latency	+retrieval step	Same as base model
Knowledge cutoff	None — DB is always current	Frozen at training time

Rule of thumb: Use RAG for what the model knows. Use fine-tuning for how the model behaves.

Failure modes to know

Chunking too coarsely: relevant info gets split across chunks that aren't retrieved together
Chunking too finely: chunks lose context, embeddings become noisy
Query-document mismatch: user asks in a different "register" than how docs are written — semantic search struggles
Context window overflow: retrieving too many chunks can push the actual question out of focus
Retrieval without reranking: top-k by cosine similarity ≠ top-k by relevance to the actual question. A reranker (cross-encoder) helps.

In practice

A minimal RAG stack looks like:

# 1. Embed the query
query_vec = embedder.encode(user_query)

# 2. Retrieve top-k chunks
results = vector_db.search(query_vec, top_k=5)

# 3. Build prompt
context = "\n\n".join(r.text for r in results)
prompt = f"Use the following context to answer.\n\n{context}\n\nQuestion: {user_query}"

# 4. Generate
answer = llm.generate(prompt)

In production, you'll want: a reranker after retrieval, hybrid search (dense + BM25), metadata filtering, and eval on retrieval quality separately from generation quality.