Quantisation
Quantisation reduces the numerical precision of model weights and activations — from 32-bit floating point (FP32) to 16-bit (FP16/BF16), 8-bit (INT8/FP8), or 4-bit (INT4) — to decrease memory footprint and increase inference throughput. A model quantised from FP16 to INT8 uses approximately half the VRAM and can process tokens roughly twice as fast, at the cost of modest accuracy degradation. Quantisation is the primary technique for making large language models economically viable for inference at scale.
Post-training quantisation (PTQ) converts a pre-trained model to lower precision without retraining. Quantisation-aware training (QAT) incorporates quantisation during the training process, producing more accurate quantised models. GPTQ, AWQ, and GGML are popular quantisation methods for LLMs. The NVIDIA Blackwell architecture adds native FP4 support, enabling 4-bit inference with hardware acceleration. The practical impact: a 70B-parameter model that requires 140 GB in FP16 fits in approximately 35 GB at INT4, enabling single-GPU deployment on hardware with sufficient VRAM.
Quantisation trends affect the GPU hardware required for inference at scale, which in turn affects capacity planning and pricing models. Operators serving inference workloads need to account for quantisation in their fleet sizing — a model that requires 8 GPUs in FP16 may only need 2 GPUs at INT4.
This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.