What is RAG?
RAG (Retrieval-Augmented Generation) is a pattern that gives a language model access to an external knowledge base at inference time. Instead of relying solely on what was baked into its weights during training, the model first retrieves relevant documents from a database, then generates a response grounded in that retrieved context.
The key insight: the model's knowledge and the model's reasoning are separated. You keep knowledge in a database you control, and use the LLM for what it's actually good at — synthesis, reasoning, and language generation.
How it works
A RAG pipeline has three stages:
- Indexing (offline, done once): Split your documents into chunks, embed each chunk into a vector, store vectors in a vector database.
- Retrieval (at query time): Embed the user's query, find the top-k chunks whose vectors are closest to the query vector.
- Generation: Inject the retrieved chunks into the model's context window as a prompt prefix, then generate the answer.
User query
↓
[Embed query] → vector
↓
[Vector DB] → top-k relevant chunks
↓
[LLM] ← prompt = system + chunks + query
↓
Answer grounded in retrieved context
The quality of retrieval determines the quality of the answer. Garbage in, garbage out — no matter how good the LLM is.
When to use it
RAG is the right choice when:
- Your knowledge base changes frequently (product docs, internal wikis, news)
- You need citations — answers traceable to specific source documents
- You're working with domain-specific or proprietary information the model wasn't trained on
- You want to reduce hallucinations on factual questions
- Cost is a concern — fine-tuning is expensive, RAG can use any base model
Trade-offs vs fine-tuning
| Knowledge update | Change the DB, no retraining | Full or partial retraining |
| Cost | Low (retrieval + inference) | High (GPU training) |
| Hallucination on facts | Lower (grounded in docs) | Higher (baked in weights) |
| Reasoning style / tone | Unchanged | Can be adapted |
| Latency | +retrieval step | Same as base model |
| Knowledge cutoff | None — DB is always current | Frozen at training time |
Rule of thumb: Use RAG for what the model knows. Use fine-tuning for how the model behaves.
Failure modes to know
- Chunking too coarsely: relevant info gets split across chunks that aren't retrieved together
- Chunking too finely: chunks lose context, embeddings become noisy
- Query-document mismatch: user asks in a different "register" than how docs are written — semantic search struggles
- Context window overflow: retrieving too many chunks can push the actual question out of focus
- Retrieval without reranking: top-k by cosine similarity ≠ top-k by relevance to the actual question. A reranker (cross-encoder) helps.
In practice
A minimal RAG stack looks like:
# 1. Embed the query
query_vec = embedder.encode(user_query)
# 2. Retrieve top-k chunks
results = vector_db.search(query_vec, top_k=5)
# 3. Build prompt
context = "\n\n".join(r.text for r in results)
prompt = f"Use the following context to answer.\n\n{context}\n\nQuestion: {user_query}"
# 4. Generate
answer = llm.generate(prompt)
In production, you'll want: a reranker after retrieval, hybrid search (dense + BM25), metadata filtering, and eval on retrieval quality separately from generation quality.