A mode of operation where an AI model acts autonomously over multiple steps to complete a goal, rather than simply answering a single prompt. An agentic system can use tools (web search, code execution, file access), plan a sequence of actions, observe the results of each step, and adjust its behaviour accordingly â often without human intervention between steps. The key properties are: tool use, multi-step reasoning, and a feedback loop between action and observation. Examples: Claude Code, OpenAI Operator, AutoGPT, LangChain agents.
The maximum number of tokens a model can process in a single interaction â both the input (prompt + history) and the output combined. It defines the model's "working memory": anything outside the window is invisible to it. Modern models range from 8k tokens (Llama 3 8B) to 1M+ tokens (Gemini 1.5 Pro). A larger context window means the model can read longer documents, hold longer conversations, and maintain more context â but also costs more to run.
A neural network architecture where every parameter is activated for every input. All weights participate in every forward pass. Dense models are simpler to train and reason about, but become expensive at very large scales since doubling the parameters doubles the compute required for each inference. Examples: GPT-4 (believed to be dense), Llama 3, Mistral 7B.
Federal Risk and Authorization Management Program. A US government program that standardizes security assessment and authorization for cloud services used by federal agencies. Any cloud provider wanting to sell to the US federal government must obtain FedRAMP authorization. Three impact levels: Low, Moderate (most civilian agencies), High (sensitive government data). The process is lengthy (12â18 months) and expensive â effectively a barrier to entry. In the AI context: FedRAMP authorization is required for AI platforms used in government (defense, healthcare, intelligence). AWS GovCloud and Azure Government have FedRAMP High; Google Vertex AI and some OpenAI/Anthropic offerings have Moderate. FedRAMP is the AI compliance requirement you hear about in US public sector deals.
When a model generates text that is factually incorrect, fabricated, or inconsistent with reality â stated with apparent confidence. Hallucinations arise because LLMs predict plausible continuations of text, not verified facts. Common examples: invented citations, wrong dates, non-existent people or products. Mitigated (but not eliminated) by RAG, grounding, and better alignment.
Health Insurance Portability and Accountability Act. A US federal law (1996) that sets strict rules on how health information (medical records, diagnoses, prescriptionsâŠ) can be stored, shared and processed. In the AI context: any cloud platform or application that handles patient data in the US must be "HIPAA-compliant" â meaning the vendor signs a Business Associate Agreement (BAA) and implements specific technical safeguards (encryption, access controls, audit logs). AWS Bedrock, Azure OpenAI and Google Vertex AI all offer HIPAA-eligible configurations. Non-compliance can lead to fines up to $1.9M per violation category per year.
Short for Key-Value Cache. A memory optimization used during LLM inference (text generation).
How transformers generate text: they produce tokens one at a time. For each new token, the attention mechanism needs the Key (K) and Value (V) matrices for *all previous tokens* â to understand what came before. Without a cache, those matrices are recomputed from scratch at every step: cost grows quadratically with context length.
What the KV cache does: after computing K and V for each token, it stores them. When generating the next token, only the new token's K/V are computed â all previous ones are read from cache. This reduces per-token cost from O(nÂČ) to O(n).
The trade-off: the cache lives in GPU/CPU memory. For a 70B model with a 128K context, the KV cache can consume tens of gigabytes of VRAM â often more than the model weights themselves.
Key related concepts: - GQA / MQA (Grouped/Multi-Query Attention): architectural variants (Llama 3, Gemma, Mistral)
that share K/V heads across query heads, shrinking cache size 4â8Ă.
- Paged Attention (vLLM): manages the cache like OS virtual memory to avoid fragmentation
and maximize GPU utilization in multi-user serving.
- Prefix caching: reuses the cached K/V of a repeated system prompt across requests â
saving latency and cost (supported by Anthropic, OpenAI, Google APIs).
A language model trained on massive text datasets with billions of parameters. The "large" refers to both the size of the training data and the number of parameters. LLMs can understand and generate coherent text, answer questions, write code, and more. Examples: GPT-4, Claude, Gemini, Llama.
A subfield of artificial intelligence where systems learn from data to improve their performance on a task, without being explicitly programmed for each case. Deep Learning and LLMs are branches of Machine Learning.
Model Context Protocol. An open standard (governed by the Linux Foundation) that defines how AI applications connect to external data sources and tools. MCP uses a client-server model: the MCP Host (your app) contains an MCP Client that communicates with one or more MCP Servers, each exposing three types of capabilities: Resources (read-only data), Tools (functions with side effects), and Prompts (reusable templates). Before MCP, every AI application had to build custom integrations for every external resource. MCP standardizes this layer so that any compliant client can talk to any compliant server. Official SDKs exist for 10 languages. A community registry lists hundreds of production servers covering filesystems, databases, GitHub, Slack, browser automation, and more.
An architecture where the model is divided into many specialized sub-networks ("experts"). For each token, a routing mechanism activates only a small subset of experts (typically 2â8). Result: a model can have hundreds of billions of total parameters but use only a fraction per inference â achieving high capacity at lower compute cost. Examples: GPT-4 (believed MoE), Mixtral 8x7B, Gemini 1.5, DeepSeek-V3.
A compression technique that reduces the numerical precision of model weights â and sometimes activations â from high-precision formats (FP32, BF16) to lower-precision ones (INT8, FP8, INT4). The goal: smaller model, faster inference, less memory, minimal quality loss.
Why it matters: a 70B parameter model in BF16 requires ~140 GB of VRAM. In INT4, it fits in ~35 GB â runnable on 2Ă consumer GPUs or a single high-end one.
Common formats: - FP32 (32-bit float): training reference, rarely used for inference - BF16 / FP16 (16-bit): standard inference today â good quality, half the FP32 size - INT8 (8-bit integer): ~2Ă smaller than BF16, fast on modern GPUs, minimal degradation - FP8 (8-bit float): newer format (H100 GPUs), better than INT8 for preserving range - INT4 (4-bit integer): aggressive compression, some quality loss, used in consumer setups
Popular quantization methods: - GPTQ: post-training quantization (PTQ), layer-by-layer, good INT4 quality - AWQ (Activation-aware Weight Quantization): preserves important weights, better quality than GPTQ - GGUF (llama.cpp): file format that bundles model + quantization metadata for CPU/GPU inference
KV cache quantization: quantization can also be applied to the KV cache (not just weights), reducing its memory footprint for long contexts â distinct from model weight quantization.
Not the same as KV cache: quantization compresses the model itself at load time; the KV cache is a runtime optimization for storing attention states during generation.
Retrieval-Augmented Generation. A pattern that solves the knowledge cutoff and hallucination problems by injecting real-time or domain-specific information into the prompt at query time. How it works: (1) the user query is converted to a vector, (2) semantically similar documents are retrieved from a knowledge base, (3) those documents are added to the prompt, (4) the model answers based on the provided context. Widely adopted in production â the most reliable way to give an LLM access to specific data.
Reinforcement Learning from Human Feedback. A training technique used to align LLMs with human preferences after the initial pre-training phase. Human raters compare model outputs and rank them; these preferences train a reward model which then guides the LLM via reinforcement learning to produce more helpful, harmless, and honest responses. Used by OpenAI (InstructGPT, ChatGPT) and Anthropic (Claude).
Service Organization Control 2. An auditing framework created by the AICPA (American Institute of CPAs) that evaluates how a cloud service provider manages data security. SOC 2 covers five "Trust Service Criteria": Security, Availability, Processing Integrity, Confidentiality, and Privacy. Two levels: Type I (snapshot audit â controls exist at a point in time), Type II (operational audit â controls worked consistently over 6â12 months). Type II is the standard enterprise buyers require. In the AI context: SOC 2 Type II is the baseline certification expected from any AI API provider (OpenAI, Anthropic, Google, AWSâŠ) before enterprise procurement teams will approve it.
A parameter (typically between 0 and 2) that controls how random or deterministic a model's outputs are. Technically: it scales the probability distribution over the vocabulary before sampling the next token. Low temperature â the model concentrates probability on the most likely tokens (predictable, repetitive). High temperature â probability is spread more evenly (creative, varied, but also more likely to hallucinate). Temperature 0: near-deterministic â the model almost always picks the highest-probability token. Good for: code generation, factual Q&A, structured outputs. Temperature 0.7â1: balanced creativity. Good for: chat, writing assistance. Temperature > 1: highly creative/random. Good for: brainstorming, creative writing. Often used alongside top-p (nucleus sampling) to further control output diversity.
The smallest unit of text a language model processes. A token is roughly 0.75 words in English â "tokenization" is typically split into ["token", "ization"]. Models don't see raw text: they convert everything into a sequence of token IDs before processing. The number of tokens in a request determines its cost and speed.
The neural network architecture that underpins virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al., Google). Key innovation: the self-attention mechanism, which lets the model weigh the relevance of every word against every other word in the input simultaneously â instead of processing tokens one by one like older RNNs. This parallelism made it possible to train on massive datasets using GPUs/TPUs at scale. Two variants: Encoder (BERT â good for understanding/classification), Decoder (GPT, Claude, Llama â good for generation), and Encoder-Decoder (T5, translation). Every LLM you use today â GPT-4, Claude, Gemini, Llama â is a Transformer.
The numerical parameters learned by a neural network during training. Weights encode everything the model has "learned" â grammar, facts, reasoning patterns. When a model is said to have billions of parameters, those parameters are its weights. "Open weights" means the trained weight files are publicly released, allowing anyone to run or fine-tune the model locally â even if the training data or code are not shared.