Training neural networks in full FP32 precision is expensive: a 7B parameter model with optimizer state and gradients requires over 100 GB of memory. Quantized training reduces this cost by using lower-precision formats throughout the training loop — not just at inference. This page follows Lecture 30 by Thien Tran, with additional coverage of Quartet 4-bit training from Lecture 69.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The accompanying Colab notebook from Lecture 30 contains runnable code for all techniques on this page.
Why train in low precision
Full-precision training (FP32 weights + FP32 optimizer states) is the most memory-hungry configuration. The memory cost per parameter in bytes:| Configuration | Bytes per parameter |
|---|---|
| FP32 weights + FP32 Adam | 16 (param + grad + 2 × optimizer state) |
| BF16 mixed precision + FP32 Adam states | 12 |
| FP8 weights + FP8 gradients | 2–4 |
| 4-bit weights (Quartet) | ~0.5–1 |
FP16 / BF16 mixed precision training
Mixed precision training keeps a FP32 master copy of weights for the optimizer update, but performs the forward and backward pass in FP16 or BF16. This is the standard training regime for most modern LLMs.- FP16
- BF16
- 5-bit exponent, 10-bit mantissa
- Range: ~6.5 × 10⁻⁵ to 6.5 × 10⁴
- Requires loss scaling to avoid underflow in gradients
- Supported on Volta and later
FP8 training (emerging standard)
FP8 training uses 8-bit floating-point for both the forward pass and gradient computation, with FP32 master weights for the optimizer. NVIDIA Hopper (H100) introduced native FP8 Tensor Cores capable of 2× the throughput of BF16. Two FP8 variants are used together in training:| Format | Exponent bits | Mantissa bits | Use |
|---|---|---|---|
| E4M3 | 4 | 3 | Forward pass (weights, activations) — more precision |
| E5M2 | 5 | 2 | Backward pass (gradients) — wider range |
FP8 training requires Hopper (H100/H200) for hardware acceleration. On older GPUs, FP8 operations fall back to software emulation and are slower than BF16.
INT8 training challenges
INT8 training is harder than INT8 inference because gradients have a much wider dynamic range than activations. Challenges include:- Gradient underflow: small gradient values round to zero in INT8
- Gradient overflow: large gradient spikes exceed INT8 range
- Accumulation errors: repeated INT8 operations compound quantization error
Stochastic rounding
Standard rounding (round-to-nearest) introduces systematic bias: values consistently below the midpoint always round down. Stochastic rounding adds random noise before rounding, making the expected value of the rounded result equal to the original value.Loss scaling for numerical stability
When training with FP16, gradients can underflow to zero (the smallest representable FP16 value is ~6 × 10⁻⁵). Loss scaling multiplies the loss by a large constant before backpropagation, shifting the gradient magnitude into the representable range.GradScaler automates this with dynamic scaling: it increases the scale factor when no overflow is detected and decreases it after an overflow.
Quantized optimizers: 8-bit Adam
The Adam optimizer stores two FP32 momentum buffers per parameter, doubling the memory footprint beyond the model itself. Thebitsandbytes library provides 8-bit quantized optimizer states using block-wise dynamic quantization.
Quartet 4-bit training (Lecture 69)
Quartet, presented in Lecture 69 by Roberto Castro and Andrei Panferov, extends quantized training to 4-bit weights. It combines:- 4-bit weight quantization using a custom INT4 format
- FP16 activations and gradients
- A per-group scale factor updated each step
- A fast INT4 GEMM backend via qutlass
Hardware support
Hopper FP8
H100 and H200 include native FP8 (E4M3 and E5M2) Tensor Cores. Provides 2× throughput over BF16 Tensor Cores. Required for production FP8 training.
Turing / Ampere INT8
INT8 Tensor Cores available from Turing (RTX 20xx) onwards. Ampere (A100) adds BF16 Tensor Cores. Primarily used for INT8 inference; INT8 training is experimental.
Further reading
- Lecture 30 slides — Thien Tran’s quantized training overview
- Lecture 30 notebook — Runnable Colab examples
- Quartet paper — 4-bit training from IST-DASLab
- bitsandbytes — 8-bit optimizer and quantization library
- TransformerEngine — NVIDIA’s FP8 training library