What is RAG?

RAG (Retrieval-Augmented Generation) is a pattern that gives a language model access to an external knowledge base at inference time. Instead of relying solely on what was baked into its weights during training, the model first retrieves relevant documents from a database, then generates a response grounded in that retrieved context.

The key insight: the model's knowledge and the model's reasoning are separated. You keep knowledge in a database you control, and use the LLM for what it's actually good at — synthesis, reasoning, and language generation.

How it works

A RAG pipeline has three stages:

  1. Indexing (offline, done once): Split your documents into chunks, embed each chunk into a vector, store vectors in a vector database.
  2. Retrieval (at query time): Embed the user's query, find the top-k chunks whose vectors are closest to the query vector.
  3. Generation: Inject the retrieved chunks into the model's context window as a prompt prefix, then generate the answer.
User query
    ↓
[Embed query] → vector
    ↓
[Vector DB] → top-k relevant chunks
    ↓
[LLM] ← prompt = system + chunks + query
    ↓
Answer grounded in retrieved context

The quality of retrieval determines the quality of the answer. Garbage in, garbage out — no matter how good the LLM is.

When to use it

RAG is the right choice when:

  • Your knowledge base changes frequently (product docs, internal wikis, news)
  • You need citations — answers traceable to specific source documents
  • You're working with domain-specific or proprietary information the model wasn't trained on
  • You want to reduce hallucinations on factual questions
  • Cost is a concern — fine-tuning is expensive, RAG can use any base model

Trade-offs vs fine-tuning

Knowledge updateChange the DB, no retrainingFull or partial retraining
CostLow (retrieval + inference)High (GPU training)
Hallucination on factsLower (grounded in docs)Higher (baked in weights)
Reasoning style / toneUnchangedCan be adapted
Latency+retrieval stepSame as base model
Knowledge cutoffNone — DB is always currentFrozen at training time

Rule of thumb: Use RAG for what the model knows. Use fine-tuning for how the model behaves.

Failure modes to know

  • Chunking too coarsely: relevant info gets split across chunks that aren't retrieved together
  • Chunking too finely: chunks lose context, embeddings become noisy
  • Query-document mismatch: user asks in a different "register" than how docs are written — semantic search struggles
  • Context window overflow: retrieving too many chunks can push the actual question out of focus
  • Retrieval without reranking: top-k by cosine similarity ≠ top-k by relevance to the actual question. A reranker (cross-encoder) helps.

In practice

A minimal RAG stack looks like:

# 1. Embed the query
query_vec = embedder.encode(user_query)

# 2. Retrieve top-k chunks
results = vector_db.search(query_vec, top_k=5)

# 3. Build prompt
context = "\n\n".join(r.text for r in results)
prompt = f"Use the following context to answer.\n\n{context}\n\nQuestion: {user_query}"

# 4. Generate
answer = llm.generate(prompt)

In production, you'll want: a reranker after retrieval, hybrid search (dense + BM25), metadata filtering, and eval on retrieval quality separately from generation quality.