Definition
A Chat LLM is a system that allows a user to send a message in natural language and receive a response automatically generated by a language model.
The key point: an LLM doesn't "understand" in the human sense — it predicts the next token from a given context.
Global pipeline
User Input
→ Prompt Construction (system + history + instructions)
→ [Optional] Retrieval (RAG)
→ LLM (inference)
→ Generation (token by token)
→ Post-processing
→ Response
1. User Input
Message sent by the user in natural language. Can include:
- a question
- an instruction
- business context
2. Prompt Construction Layer
Critical step — often invisible.
The system assembles a final prompt by combining:
- the user's message
- conversation history
- system prompt (rules, role, tone)
- constraints (format, safety, etc.)
Final Prompt = System + History + User Input
3. Retrieval (RAG) — Optional
Important: this is not always used.
When to use it:
- need for up-to-date information
- specific documents (PDF, internal knowledge base…)
- knowledge outside the model's training data
RAG pipeline:
User Query
→ Embedding
→ Vector Search / DB / API
→ Top-K documents
→ Injection into prompt
The model doesn't "go fetch" the information itself — the context is injected into it.
Key insight: RAG is a component external to the LLM, not a native capability of the model.
4. LLM Core (Inference)
4.1 Tokenization
Text → Tokens
"Hello world" → ["Hello", " world"]
4.2 Embedding
Tokens → numerical vectors + positional encoding
4.3 Transformer (model core)
Stack of layers, each with:
- self-attention: each token "looks at" the others to build context
- feedforward: per-token transformation
4.4 Attention Mechanism
Weights the importance of each word relative to the others.
"it" → linked to "the cat"
4.5 Inference
Computes the probability distribution over the next token:
P(token | context)
4.6 Decoding Strategy
Chooses the next token from the distribution:
| Strategy | Behavior |
|---|---|
| Greedy | Always picks the most probable token |
| Top-k | Samples from the k most probable |
| Top-p (nucleus) | Samples from the smallest set covering probability p |
| Temperature | Scales the distribution (↑ creative, ↓ precise) |
Direct impact on: creativity, precision, hallucination rate.
5. Generation Loop (Auto-regressive)
Token → appended to context → next token prediction → repeat
The model generates one token at a time, conditioning each new token on everything that came before.
6. Post-processing
- Detokenization (tokens → text)
- Formatting (markdown, structured output)
- Optional filters (safety, PII, etc.)
7. Response
Final response displayed to the user — sometimes enriched with UI elements, images, or tool results.
Two modes to distinguish
Mode 1 — Simple Chat (without RAG)
User → Prompt → LLM → Answer
Based solely on:
- the model's weights
- the conversational context
Mode 2 — Chat with RAG
User
→ Retrieval (docs, DB, search)
→ Enriched prompt
→ LLM
→ Answer
Used for: precision, fresh data, business context.
Fundamental insight
An LLM is not a database, nor a search engine.
It is: a probabilistic text generation engine conditioned on a context.
Common mistakes
| Wrong | Correct |
|---|---|
| "The LLM goes and fetches the info" | The LLM generates from its weights + context |
| "The LLM understands like a human" | It predicts the next token statistically |
| "RAG = LLM" | RAG is a separate retrieval layer, optional |
Correct mental model
- LLM = generation engine
- RAG = retrieval engine (optional)
- Product = orchestration of both
Why add RAG?
RAG allows enriching the model's context with data external to its training, at query time. It doesn't replace the LLM — it augments it.
Main use cases
| Use case | Why RAG |
|---|---|
| Private data access | Internal docs, PDFs, CRM — impossible for LLM alone |
| Data freshness | News, prices, regulations — LLMs have a training cutoff |
| Hallucination reduction | Model grounds answers in real documents |
| Business context | Internal jargon, procedures, client-specific knowledge |
| Traceability | Cite sources, audit answers — critical in enterprise |
Trade-offs
- Added complexity: vector DB, embeddings, retrieval pipeline
- Latency: retrieval + injection adds time to each request
- Quality depends on retrieval: garbage in = garbage out
Key insight: RAG transforms a "generalist" LLM into a "specialist" assistant.
Diagram — Chat LLM with / without RAG
System + History + Instructions] B --> C{Need External Knowledge?} C -->|No| D[LLM Inference] C -->|Yes| E[Embedding Query] E --> F[Vector Search / DB / APIs] F --> G[Top-K Retrieved Documents] G --> H[Augmented Prompt] H --> D D --> I[Token Generation Loop
Auto-regressive] I --> J[Detokenization] J --> K[Post-processing / Formatting] K --> L[Final Response]