News

Quantization — Memory by Format

Model quantization · 70B parameters reference

Memory footprint by numerical format

Same model, different memory — depending on the numerical precision used to store the parameters.

FP32 32 bits / value
~280 GB
Training reference — rarely used for inference
BF16 16 bits / value
~140 GB
Standard inference today — half the size of FP32, minimal quality loss
FP8 / INT8 8 bits / value
~70 GB
Optimized inference — faster on modern GPUs (H100), near-BF16 quality
INT4 4 bits / value
~35 GB
Aggressive compression — fits on consumer hardware (GPTQ, AWQ, GGUF)
Trade-off: less memory means lower numerical precision — but the quality loss is often negligible in practice for inference. Modern quantization methods (AWQ, GPTQ) are optimized to preserve the most important weights.