Definition

A Chat LLM is a system that allows a user to send a message in natural language and receive a response automatically generated by a language model.

The key point: an LLM doesn't "understand" in the human sense — it predicts the next token from a given context.


Global pipeline

User Input
→ Prompt Construction (system + history + instructions)
→ [Optional] Retrieval (RAG)
→ LLM (inference)
→ Generation (token by token)
→ Post-processing
→ Response

1. User Input

Message sent by the user in natural language. Can include:

  • a question
  • an instruction
  • business context

2. Prompt Construction Layer

Critical step — often invisible.

The system assembles a final prompt by combining:

  • the user's message
  • conversation history
  • system prompt (rules, role, tone)
  • constraints (format, safety, etc.)
Final Prompt = System + History + User Input

3. Retrieval (RAG) — Optional

Important: this is not always used.

When to use it:

  • need for up-to-date information
  • specific documents (PDF, internal knowledge base…)
  • knowledge outside the model's training data

RAG pipeline:

User Query
→ Embedding
→ Vector Search / DB / API
→ Top-K documents
→ Injection into prompt

The model doesn't "go fetch" the information itself — the context is injected into it.

Key insight: RAG is a component external to the LLM, not a native capability of the model.


4. LLM Core (Inference)

4.1 Tokenization

Text → Tokens

"Hello world" → ["Hello", " world"]

4.2 Embedding

Tokens → numerical vectors + positional encoding

4.3 Transformer (model core)

Stack of layers, each with:

  • self-attention: each token "looks at" the others to build context
  • feedforward: per-token transformation

4.4 Attention Mechanism

Weights the importance of each word relative to the others.

"it" → linked to "the cat"

4.5 Inference

Computes the probability distribution over the next token:

P(token | context)

4.6 Decoding Strategy

Chooses the next token from the distribution:

StrategyBehavior
GreedyAlways picks the most probable token
Top-kSamples from the k most probable
Top-p (nucleus)Samples from the smallest set covering probability p
TemperatureScales the distribution (↑ creative, ↓ precise)

Direct impact on: creativity, precision, hallucination rate.


5. Generation Loop (Auto-regressive)

Token → appended to context → next token prediction → repeat

The model generates one token at a time, conditioning each new token on everything that came before.


6. Post-processing

  • Detokenization (tokens → text)
  • Formatting (markdown, structured output)
  • Optional filters (safety, PII, etc.)

7. Response

Final response displayed to the user — sometimes enriched with UI elements, images, or tool results.


Two modes to distinguish

Mode 1 — Simple Chat (without RAG)

User → Prompt → LLM → Answer

Based solely on:

  • the model's weights
  • the conversational context

Mode 2 — Chat with RAG

User
→ Retrieval (docs, DB, search)
→ Enriched prompt
→ LLM
→ Answer

Used for: precision, fresh data, business context.


Fundamental insight

An LLM is not a database, nor a search engine.

It is: a probabilistic text generation engine conditioned on a context.

Common mistakes

WrongCorrect
"The LLM goes and fetches the info"The LLM generates from its weights + context
"The LLM understands like a human"It predicts the next token statistically
"RAG = LLM"RAG is a separate retrieval layer, optional

Correct mental model

  • LLM = generation engine
  • RAG = retrieval engine (optional)
  • Product = orchestration of both

Why add RAG?

RAG allows enriching the model's context with data external to its training, at query time. It doesn't replace the LLM — it augments it.

Main use cases

Use caseWhy RAG
Private data accessInternal docs, PDFs, CRM — impossible for LLM alone
Data freshnessNews, prices, regulations — LLMs have a training cutoff
Hallucination reductionModel grounds answers in real documents
Business contextInternal jargon, procedures, client-specific knowledge
TraceabilityCite sources, audit answers — critical in enterprise

Trade-offs

  • Added complexity: vector DB, embeddings, retrieval pipeline
  • Latency: retrieval + injection adds time to each request
  • Quality depends on retrieval: garbage in = garbage out

Key insight: RAG transforms a "generalist" LLM into a "specialist" assistant.


Diagram — Chat LLM with / without RAG

flowchart TD A[User Input] --> B[Prompt Construction
System + History + Instructions] B --> C{Need External Knowledge?} C -->|No| D[LLM Inference] C -->|Yes| E[Embedding Query] E --> F[Vector Search / DB / APIs] F --> G[Top-K Retrieved Documents] G --> H[Augmented Prompt] H --> D D --> I[Token Generation Loop
Auto-regressive] I --> J[Detokenization] J --> K[Post-processing / Formatting] K --> L[Final Response]