Quantization — Memory by Format

Memory footprint by numerical format

Same model, different memory — depending on the numerical precision used to store the parameters.

FP32 32 bits / value

~280 GB

Training reference — rarely used for inference

BF16 16 bits / value

~140 GB

Standard inference today — half the size of FP32, minimal quality loss

FP8 / INT8 8 bits / value

~70 GB

Optimized inference — faster on modern GPUs (H100), near-BF16 quality

INT4 4 bits / value

~35 GB

Aggressive compression — fits on consumer hardware (GPTQ, AWQ, GGUF)

Trade-off: less memory means lower numerical precision — but the quality loss is often negligible in practice for inference. Modern quantization methods (AWQ, GPTQ) are optimized to preserve the most important weights.