Every neural network computation is subject to the limits of finite-precision arithmetic. Understanding how floating-point formats represent numbers, where errors arise, and how to keep computations numerically stable is essential for GPU programmers working on deep learning. This page follows Lecture 84 by Paulius Micikevicius of NVIDIA (slides), a pioneer of mixed-precision training.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
IEEE 754 floating-point formats
Every IEEE 754 floating-point number uses three fields: a sign bit, an exponent, and a mantissa (also called the significand or fraction).Format overview
| Format | Total bits | Exponent bits | Mantissa bits | Max value | Min normal |
|---|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | ~3.4 × 10³⁸ | ~1.2 × 10⁻³⁸ |
| FP16 | 16 | 5 | 10 | 65504 | ~6.1 × 10⁻⁵ |
| BF16 | 16 | 8 | 7 | ~3.4 × 10³⁸ | ~1.2 × 10⁻³⁸ |
| FP8 E4M3 | 8 | 4 | 3 | 448 | ~1.6 × 10⁻² |
| FP8 E5M2 | 8 | 5 | 2 | 57344 | ~1.5 × 10⁻⁵ |
| FP4 E2M1 | 4 | 2 | 1 | 6 | 1 |
BF16 shares its exponent width with FP32. This means BF16 and FP32 have identical dynamic range — BF16 is simply a truncated FP32. Converting FP32 to BF16 is a right-shift of the mantissa bits with rounding.
Exponent bits vs. mantissa bits tradeoff
For a fixed bit budget, every bit moved from mantissa to exponent doubles the dynamic range while halving the precision at any given magnitude.- More exponent bits
- More mantissa bits
FP8 E5M2 (5 exponent, 2 mantissa):
- Dynamic range: ~10⁻⁵ to ~5.7 × 10⁴
- Only 4 representable levels per power of two
- Good for gradients, which span many orders of magnitude
- Poor absolute precision — suitable when range matters more than accuracy
Catastrophic cancellation and loss scaling
Catastrophic cancellation occurs when two nearly equal numbers are subtracted, causing most significant bits to cancel and leaving a result dominated by rounding error.Why BF16 is preferred over FP16 for training
FP16’s narrow exponent range (5 bits, max ~65504) causes two problems in training:- Gradient underflow: small gradients round to zero
- Activation overflow: large intermediate values exceed FP16 range
- Model weights rarely need more than 7–8 bits of mantissa precision
- Gradient magnitudes span many orders of magnitude and benefit from wider range
- BF16 requires no loss scaling, simplifying training code
FP8 variants: E4M3 and E5M2
NVIDIA H100 introduced two FP8 formats with complementary roles:Integer formats: INT8 and INT4 for inference
Integer formats use fixed-point representation: a scale maps integer values to floating-point magnitudes. They have no exponent — all representable values are uniformly spaced.| Format | Range | Levels | Primary use |
|---|---|---|---|
| INT8 | −128 to 127 | 256 | Inference (weights + activations) |
| UINT8 | 0 to 255 | 256 | Post-ReLU activations |
| INT4 | −8 to 7 | 16 | Weight-only quantization |
| INT2 | −2 to 1 | 4 | Extreme compression (research) |
| INT1 | −1 to 1 (or 0/1) | 2 | Binarized networks |
- Strength: simple, fast, hardware-friendly (no exponent decode needed)
- Weakness: cannot represent small values near zero accurately — values are all equal size regardless of magnitude
Mixed precision training recipes
Paulius Micikevicius co-authored the original mixed-precision training paper (Micikevicius et al., 2018). The standard recipe:Maintain FP32 master weights
Store a full-precision copy of weights for the optimizer update. This prevents accumulated rounding error from degrading convergence.
Run forward and backward in BF16 or FP16
Cast weights to lower precision before each forward pass. Activations and intermediate values stay in BF16/FP16.
Accumulate gradients in FP32
The gradient accumulation buffer should remain in FP32 to prevent precision loss during accumulation over many steps.
Apply loss scaling (FP16 only)
Multiply loss by a scale factor before backpropagation. Unscale gradients before the optimizer step. Skip this step for BF16.
Numerical debugging techniques
When a model produces NaN or Inf values, or accuracy is unexpectedly poor, these techniques help isolate the problem:Check for NaN/Inf during training
Check for NaN/Inf during training
Monitor activation statistics
Monitor activation statistics
Compare FP32 vs. low-precision outputs
Compare FP32 vs. low-precision outputs
Use torch.backends.cudnn.deterministic
Use torch.backends.cudnn.deterministic
torch.autograd.detect_anomaly() is invaluable for tracking down NaN sources but adds significant overhead. Use it only for debugging, not in production training runs.Further reading
- Lecture 84 slides — Paulius Micikevicius (NVIDIA) on numerics and AI
- Mixed Precision Training paper — Micikevicius et al. (2018), the foundational reference
- FP8 Formats for Deep Learning — NVIDIA/Arm/Intel FP8 specification
- LLM.int8() — Handling outlier activations in INT8 quantization
- SmoothQuant — Migrating quantization difficulty from activations to weights