Glossary

Agentic AI

agentic, agent IA, AI agent, AI agents, agents IA, agentic system, système agentique

A mode of operation where an AI model acts autonomously over multiple steps to complete a goal, rather than simply answering a single prompt. An agentic system can use tools (web search, code execution, file access), plan a sequence of actions, observe the results of each step, and adjust its behaviour accordingly — often without human intervention between steps. The key properties are: tool use, multi-step reasoning, and a feedback loop between action and observation. Examples: Claude Code, OpenAI Operator, AutoGPT, LangChain agents.

Context Window

context length, fenêtre de contexte, longueur de contexte, context

The maximum number of tokens a model can process in a single interaction — both the input (prompt + history) and the output combined. It defines the model's "working memory": anything outside the window is invisible to it. Modern models range from 8k tokens (Llama 3 8B) to 1M+ tokens (Gemini 1.5 Pro). A larger context window means the model can read longer documents, hold longer conversations, and maintain more context — but also costs more to run.

Dense Model

dense models, modèle dense, modèles denses

A neural network architecture where every parameter is activated for every input. All weights participate in every forward pass. Dense models are simpler to train and reason about, but become expensive at very large scales since doubling the parameters doubles the compute required for each inference. Examples: GPT-4 (believed to be dense), Llama 3, Mistral 7B.

→ Dense vs MoE diagram

FedRAMP

Federal Risk and Authorization Management Program

Federal Risk and Authorization Management Program. A US government program that standardizes security assessment and authorization for cloud services used by federal agencies. Any cloud provider wanting to sell to the US federal government must obtain FedRAMP authorization. Three impact levels: Low, Moderate (most civilian agencies), High (sensitive government data). The process is lengthy (12–18 months) and expensive — effectively a barrier to entry. In the AI context: FedRAMP authorization is required for AI platforms used in government (defense, healthcare, intelligence). AWS GovCloud and Azure Government have FedRAMP High; Google Vertex AI and some OpenAI/Anthropic offerings have Moderate. FedRAMP is the AI compliance requirement you hear about in US public sector deals.

Hallucination

hallucinations, hallucinate, halluciner

When a model generates text that is factually incorrect, fabricated, or inconsistent with reality — stated with apparent confidence. Hallucinations arise because LLMs predict plausible continuations of text, not verified facts. Common examples: invented citations, wrong dates, non-existent people or products. Mitigated (but not eliminated) by RAG, grounding, and better alignment.

HIPAA

Health Insurance Portability and Accountability Act

Health Insurance Portability and Accountability Act. A US federal law (1996) that sets strict rules on how health information (medical records, diagnoses, prescriptions…) can be stored, shared and processed. In the AI context: any cloud platform or application that handles patient data in the US must be "HIPAA-compliant" — meaning the vendor signs a Business Associate Agreement (BAA) and implements specific technical safeguards (encryption, access controls, audit logs). AWS Bedrock, Azure OpenAI and Google Vertex AI all offer HIPAA-eligible configurations. Non-compliance can lead to fines up to $1.9M per violation category per year.

KV Cache

key-value cache, key value cache, cache clé-valeur, cache KV

Short for Key-Value Cache. A memory optimization used during LLM inference (text generation).

How transformers generate text: they produce tokens one at a time. For each new token, the attention mechanism needs the Key (K) and Value (V) matrices for *all previous tokens* — to understand what came before. Without a cache, those matrices are recomputed from scratch at every step: cost grows quadratically with context length.

What the KV cache does: after computing K and V for each token, it stores them. When generating the next token, only the new token's K/V are computed — all previous ones are read from cache. This reduces per-token cost from O(n²) to O(n).

The trade-off: the cache lives in GPU/CPU memory. For a 70B model with a 128K context, the KV cache can consume tens of gigabytes of VRAM — often more than the model weights themselves.

Key related concepts: - GQA / MQA (Grouped/Multi-Query Attention): architectural variants (Llama 3, Gemma, Mistral)

that share K/V heads across query heads, shrinking cache size 4–8×.

Paged Attention (vLLM): manages the cache like OS virtual memory to avoid fragmentation

and maximize GPU utilization in multi-user serving.

Prefix caching: reuses the cached K/V of a repeated system prompt across requests —

saving latency and cost (supported by Anthropic, OpenAI, Google APIs).

Large Language Model

LLM, LLMs, Large Language Models, grand modèle de langage, grands modèles de langage

A language model trained on massive text datasets with billions of parameters. The "large" refers to both the size of the training data and the number of parameters. LLMs can understand and generate coherent text, answer questions, write code, and more. Examples: GPT-4, Claude, Gemini, Llama.

Machine Learning

ML, apprentissage automatique

A subfield of artificial intelligence where systems learn from data to improve their performance on a task, without being explicitly programmed for each case. Deep Learning and LLMs are branches of Machine Learning.

MCP

Model Context Protocol, model context protocol

Model Context Protocol. An open standard (governed by the Linux Foundation) that defines how AI applications connect to external data sources and tools. MCP uses a client-server model: the MCP Host (your app) contains an MCP Client that communicates with one or more MCP Servers, each exposing three types of capabilities: Resources (read-only data), Tools (functions with side effects), and Prompts (reusable templates). Before MCP, every AI application had to build custom integrations for every external resource. MCP standardizes this layer so that any compliant client can talk to any compliant server. Official SDKs exist for 10 languages. A community registry lists hundreds of production servers covering filesystems, databases, GitHub, Slack, browser automation, and more.

→ MCP deep dive

Mixture of Experts

MoE, mélange d'experts, Mélange d'Experts

An architecture where the model is divided into many specialized sub-networks ("experts"). For each token, a routing mechanism activates only a small subset of experts (typically 2–8). Result: a model can have hundreds of billions of total parameters but use only a fraction per inference — achieving high capacity at lower compute cost. Examples: GPT-4 (believed MoE), Mixtral 8x7B, Gemini 1.5, DeepSeek-V3.

→ Dense vs MoE diagram

Quantization

quantisation, model quantization, quantisation du modèle, INT8, INT4, FP8, post-training quantization, PTQ, GGUF, GPTQ, AWQ

A compression technique that reduces the numerical precision of model weights — and sometimes activations — from high-precision formats (FP32, BF16) to lower-precision ones (INT8, FP8, INT4). The goal: smaller model, faster inference, less memory, minimal quality loss.

Why it matters: a 70B parameter model in BF16 requires ~140 GB of VRAM. In INT4, it fits in ~35 GB — runnable on 2× consumer GPUs or a single high-end one.

Common formats: - FP32 (32-bit float): training reference, rarely used for inference - BF16 / FP16 (16-bit): standard inference today — good quality, half the FP32 size - INT8 (8-bit integer): ~2× smaller than BF16, fast on modern GPUs, minimal degradation - FP8 (8-bit float): newer format (H100 GPUs), better than INT8 for preserving range - INT4 (4-bit integer): aggressive compression, some quality loss, used in consumer setups

Popular quantization methods: - GPTQ: post-training quantization (PTQ), layer-by-layer, good INT4 quality - AWQ (Activation-aware Weight Quantization): preserves important weights, better quality than GPTQ - GGUF (llama.cpp): file format that bundles model + quantization metadata for CPU/GPU inference

KV cache quantization: quantization can also be applied to the KV cache (not just weights), reducing its memory footprint for long contexts — distinct from model weight quantization.

Not the same as KV cache: quantization compresses the model itself at load time; the KV cache is a runtime optimization for storing attention states during generation.

RAG

Retrieval-Augmented Generation, retrieval augmented generation, retrieval-augmented generation

Retrieval-Augmented Generation. A pattern that solves the knowledge cutoff and hallucination problems by injecting real-time or domain-specific information into the prompt at query time. How it works: (1) the user query is converted to a vector, (2) semantically similar documents are retrieved from a knowledge base, (3) those documents are added to the prompt, (4) the model answers based on the provided context. Widely adopted in production — the most reliable way to give an LLM access to specific data.

→ RAG on the ecosystem map

RLHF

Reinforcement Learning from Human Feedback, reinforcement learning from human feedback

Reinforcement Learning from Human Feedback. A training technique used to align LLMs with human preferences after the initial pre-training phase. Human raters compare model outputs and rank them; these preferences train a reward model which then guides the LLM via reinforcement learning to produce more helpful, harmless, and honest responses. Used by OpenAI (InstructGPT, ChatGPT) and Anthropic (Claude).

SOC 2

SOC2, SOC 2 Type II, SOC 2 Type I, System and Organization Controls

Service Organization Control 2. An auditing framework created by the AICPA (American Institute of CPAs) that evaluates how a cloud service provider manages data security. SOC 2 covers five "Trust Service Criteria": Security, Availability, Processing Integrity, Confidentiality, and Privacy. Two levels: Type I (snapshot audit — controls exist at a point in time), Type II (operational audit — controls worked consistently over 6–12 months). Type II is the standard enterprise buyers require. In the AI context: SOC 2 Type II is the baseline certification expected from any AI API provider (OpenAI, Anthropic, Google, AWS…) before enterprise procurement teams will approve it.

Temperature

température, temp, sampling temperature

A parameter (typically between 0 and 2) that controls how random or deterministic a model's outputs are. Technically: it scales the probability distribution over the vocabulary before sampling the next token. Low temperature → the model concentrates probability on the most likely tokens (predictable, repetitive). High temperature → probability is spread more evenly (creative, varied, but also more likely to hallucinate). Temperature 0: near-deterministic — the model almost always picks the highest-probability token. Good for: code generation, factual Q&A, structured outputs. Temperature 0.7–1: balanced creativity. Good for: chat, writing assistance. Temperature > 1: highly creative/random. Good for: brainstorming, creative writing. Often used alongside top-p (nucleus sampling) to further control output diversity.

Token

tokens, tokenization, tokenisation

The smallest unit of text a language model processes. A token is roughly 0.75 words in English — "tokenization" is typically split into ["token", "ization"]. Models don't see raw text: they convert everything into a sequence of token IDs before processing. The number of tokens in a request determines its cost and speed.

Transformer

transformers, attention mechanism, mécanisme d'attention, self-attention

The neural network architecture that underpins virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al., Google). Key innovation: the self-attention mechanism, which lets the model weigh the relevance of every word against every other word in the input simultaneously — instead of processing tokens one by one like older RNNs. This parallelism made it possible to train on massive datasets using GPUs/TPUs at scale. Two variants: Encoder (BERT — good for understanding/classification), Decoder (GPT, Claude, Llama — good for generation), and Encoder-Decoder (T5, translation). Every LLM you use today — GPT-4, Claude, Gemini, Llama — is a Transformer.

Weights

weight, model weights, poids, poids du modèle, open weights

The numerical parameters learned by a neural network during training. Weights encode everything the model has "learned" — grammar, facts, reasoning patterns. When a model is said to have billions of parameters, those parameters are its weights. "Open weights" means the trained weight files are publicly released, allowing anyone to run or fine-tune the model locally — even if the training data or code are not shared.