The question nobody asks out loud
When people get a wrong answer from an LLM, the usual reaction is: "it hallucinated." The word implies the model made a mistake it shouldn't have made — like a reliable system that glitched. That framing is wrong, and it leads to a bad mental model for everything that follows.
The better question is: why would a language model ever be right about a specific fact?
What an LLM actually is
A large language model is trained on a massive corpus of text — web pages, books, code, forums. The training process doesn't build a database of facts. It compresses statistical patterns into billions of numerical parameters called weights.
When you ask a question, the model doesn't look anything up. It runs a mathematical operation over its weights and the tokens in your prompt to compute what text is most likely to come next. That's it. The mechanism is prediction, not recall.
This has a non-obvious consequence: the model can produce a fluent, confident, detailed answer about something it was never directly trained on — by interpolating from related patterns. Sometimes that interpolation is accurate. Sometimes it isn't. The model has no internal signal to tell the difference.
What changes the output
If you ask the same question twice, you may not get the same answer. Here's why:
Temperature is a parameter that controls randomness in the sampling process. At temperature 0, the model always picks the most probable next token — outputs are deterministic. At higher temperatures, lower-probability tokens get more chances to be selected. The same question becomes a probability distribution over possible answers, not a single answer.
Top-p and top-k are related sampling parameters that constrain which tokens are eligible at each step. They tune the balance between coherence and diversity.
The system prompt and context window change what the model "sees" when generating. A different system prompt — even a subtle one — shifts the probability distribution over outputs. The model doesn't have a view of facts independent of its context; it reasons over what's in front of it.
The model itself matters. GPT-4, Claude, Llama 3, and Mistral were trained on different data, with different architectures and different fine-tuning processes (RLHF, RLAIF, DPO...). They have systematically different strengths, biases, and failure modes. On contested facts, they may genuinely disagree.
None of these parameters have a "correct" setting for factual accuracy. The model has no access to ground truth.
The training cutoff problem
Model weights are frozen at the end of training. Nothing that happened after the cutoff date exists in the model's weights. If you ask about recent events, the model may:
- correctly say it doesn't know
- confabulate something plausible based on earlier patterns
- confidently give you outdated information presented as current
The cutoff is published for each model, but it's not a clean line — data from just before the cutoff is underrepresented compared to older data (the internet takes time to process and discuss events). Models are often less reliable about the year preceding their cutoff than about things from several years earlier.
Hallucination is not a bug
The word "hallucination" gets used as if it describes an anomaly. It describes the default behavior of the architecture.
When a model generates a citation that doesn't exist, it's doing exactly what it was trained to do: producing a sequence of tokens that looks like a valid citation, based on the patterns of how citations are structured in its training data. The model isn't checking against a database of real papers. There is no such check in the architecture.
The same applies to statistics, dates, names, addresses, code behavior, and legal facts. The model generates plausible continuations. Plausibility is not accuracy.
This doesn't mean LLMs are useless for factual tasks. It means the right use of an LLM for factual tasks requires a different architecture — one that separates retrieval of real data from generation of language.
The right mental model
Think of an LLM as an expert reasoner with no access to recent documents and an imperfect memory.
Ask it to explain a concept, structure an argument, synthesize multiple perspectives, write code, translate, or reason over something you've given it — it's excellent at these. These are tasks where the quality of reasoning matters more than access to ground truth.
Ask it to tell you what happened last week, whether a specific fact is true, or what a specific document says (without giving it the document) — these are tasks where you need data, not reasoning. An LLM alone is the wrong tool.
What RAG does
RAG (Retrieval-Augmented Generation) is the standard architectural answer to this problem. Instead of asking the model to recall facts from its weights, you:
- Retrieve relevant documents from a database you control (recent, verified, specific)
- Inject those documents into the model's context window
- Ask the model to reason over the provided documents
The model's job becomes reasoning and synthesis over real data you supplied — not recalling compressed patterns from training. The knowledge comes from your database. The reasoning comes from the model. These two responsibilities are cleanly separated.
This is why RAG dramatically reduces hallucinations on factual questions: the model is no longer asked to generate facts from memory. It's asked to extract and synthesize facts from documents it can actually see.
See the RAG deep dive for the full technical breakdown.
What this means in practice
When you use an LLM:
- Don't verify outputs by asking the same model again. The model will confirm its own confabulations fluently.
- Treat confidence of phrasing as unrelated to accuracy. A model can be maximally confident and maximally wrong simultaneously.
- For factual tasks, give the model the documents. Paste in the relevant text, policy, or data. RAG automates this at scale.
- For reasoning tasks, LLMs are powerful. Analysis, synthesis, writing, code generation — these are not ground-truth tasks and LLMs excel at them.
- Different models, different answers. If a question matters, cross-check across models and treat disagreement as a signal to verify externally.
The mental model shift: stop thinking of an LLM as a search engine that knows things. Start thinking of it as a powerful reasoning engine that needs you to supply the data.