Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Every neural network computation is subject to the limits of finite-precision arithmetic. Understanding how floating-point formats represent numbers, where errors arise, and how to keep computations numerically stable is essential for GPU programmers working on deep learning. This page follows Lecture 84 by Paulius Micikevicius of NVIDIA (slides), a pioneer of mixed-precision training.

IEEE 754 floating-point formats

Every IEEE 754 floating-point number uses three fields: a sign bit, an exponent, and a mantissa (also called the significand or fraction).
value = (-1)^sign × 2^(exponent - bias) × (1 + mantissa)
The exponent determines the range (how large or small the number can be), and the mantissa determines precision (how many significant digits you get within that range).

Format overview

FormatTotal bitsExponent bitsMantissa bitsMax valueMin normal
FP3232823~3.4 × 10³⁸~1.2 × 10⁻³⁸
FP161651065504~6.1 × 10⁻⁵
BF161687~3.4 × 10³⁸~1.2 × 10⁻³⁸
FP8 E4M3843448~1.6 × 10⁻²
FP8 E5M285257344~1.5 × 10⁻⁵
FP4 E2M142161
BF16 shares its exponent width with FP32. This means BF16 and FP32 have identical dynamic range — BF16 is simply a truncated FP32. Converting FP32 to BF16 is a right-shift of the mantissa bits with rounding.

Exponent bits vs. mantissa bits tradeoff

For a fixed bit budget, every bit moved from mantissa to exponent doubles the dynamic range while halving the precision at any given magnitude.
FP8 E5M2 (5 exponent, 2 mantissa):
  • Dynamic range: ~10⁻⁵ to ~5.7 × 10⁴
  • Only 4 representable levels per power of two
  • Good for gradients, which span many orders of magnitude
  • Poor absolute precision — suitable when range matters more than accuracy
This tradeoff explains why FP8 training uses two formats: E4M3 for the forward pass (weights and activations need precision) and E5M2 for gradients (gradients need range).

Catastrophic cancellation and loss scaling

Catastrophic cancellation occurs when two nearly equal numbers are subtracted, causing most significant bits to cancel and leaving a result dominated by rounding error.
# Example of catastrophic cancellation in FP16
import torch

a = torch.tensor(1.0000, dtype=torch.float16)
b = torch.tensor(0.9998, dtype=torch.float16)

# In FP16: a and b may round to the same value
result = a - b
# Expected: 0.0002, Actual: 0.0 (catastrophic cancellation)
print(result)  # tensor(0., dtype=torch.float16)

# In FP32: sufficient mantissa bits to represent the difference
result_fp32 = a.float() - b.float()
print(result_fp32)  # tensor(0.0002)
Loss scaling counteracts a related issue in FP16 training: gradient underflow. When loss values are small, gradients may fall below the smallest representable FP16 value (~6 × 10⁻⁵) and become zero.
# Loss scaling manually
loss_scale = 65536.0

scaled_loss = loss * loss_scale
scaled_loss.backward()

# Unscale before clipping and optimizer step
for p in model.parameters():
    if p.grad is not None:
        p.grad /= loss_scale

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

Why BF16 is preferred over FP16 for training

FP16’s narrow exponent range (5 bits, max ~65504) causes two problems in training:
  1. Gradient underflow: small gradients round to zero
  2. Activation overflow: large intermediate values exceed FP16 range
BF16 solves both with 8 exponent bits (matching FP32), at the cost of 3 mantissa bits (vs. FP16’s 10). For neural networks, range matters more than per-value precision because:
  • Model weights rarely need more than 7–8 bits of mantissa precision
  • Gradient magnitudes span many orders of magnitude and benefit from wider range
  • BF16 requires no loss scaling, simplifying training code
When using BF16, you can remove GradScaler entirely from your training loop. Simply use torch.autocast(device_type="cuda", dtype=torch.bfloat16) without a scaler.

FP8 variants: E4M3 and E5M2

NVIDIA H100 introduced two FP8 formats with complementary roles:
import torch

# E4M3: forward pass (weights, activations)
# Max value: 448, 8 levels per power-of-two interval
weight_fp8 = weight.to(torch.float8_e4m3fn)

# E5M2: backward pass (gradients)
# Max value: 57344, 4 levels per power-of-two interval
grad_fp8 = grad.to(torch.float8_e5m2)

# Per-tensor scaling before conversion
max_val = weight.abs().max()
scale = 448.0 / max_val  # map max to E4M3 max representable value
weight_fp8 = (weight * scale).to(torch.float8_e4m3fn)
FP8 requires explicit per-tensor (or per-channel) scaling because its limited range cannot cover typical neural network value distributions without scaling. Always compute and apply a scale before converting to FP8.

Integer formats: INT8 and INT4 for inference

Integer formats use fixed-point representation: a scale maps integer values to floating-point magnitudes. They have no exponent — all representable values are uniformly spaced.
FormatRangeLevelsPrimary use
INT8−128 to 127256Inference (weights + activations)
UINT80 to 255256Post-ReLU activations
INT4−8 to 716Weight-only quantization
INT2−2 to 14Extreme compression (research)
INT1−1 to 1 (or 0/1)2Binarized networks
Uniform spacing is both a strength and a weakness:
  • Strength: simple, fast, hardware-friendly (no exponent decode needed)
  • Weakness: cannot represent small values near zero accurately — values are all equal size regardless of magnitude
This is why INT8 works well for weights (roughly Gaussian distribution) but struggles with activations that have large outliers (see SmoothQuant, LLM.int8() for solutions).

Mixed precision training recipes

Paulius Micikevicius co-authored the original mixed-precision training paper (Micikevicius et al., 2018). The standard recipe:
1

Maintain FP32 master weights

Store a full-precision copy of weights for the optimizer update. This prevents accumulated rounding error from degrading convergence.
2

Run forward and backward in BF16 or FP16

Cast weights to lower precision before each forward pass. Activations and intermediate values stay in BF16/FP16.
3

Accumulate gradients in FP32

The gradient accumulation buffer should remain in FP32 to prevent precision loss during accumulation over many steps.
4

Apply loss scaling (FP16 only)

Multiply loss by a scale factor before backpropagation. Unscale gradients before the optimizer step. Skip this step for BF16.
5

Update FP32 master weights

Apply the optimizer update to FP32 weights. The next forward pass will re-cast to lower precision.
# Complete BF16 mixed precision training loop (no loss scaling needed)
from torch.amp import autocast

model = model.cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(device_type="cuda", dtype=torch.bfloat16):
        output = model(batch["input_ids"])
        loss = loss_fn(output, batch["labels"])
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Numerical debugging techniques

When a model produces NaN or Inf values, or accuracy is unexpectedly poor, these techniques help isolate the problem:
def check_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            if torch.isnan(param.grad).any():
                print(f"NaN gradient in {name}")
            if torch.isinf(param.grad).any():
                print(f"Inf gradient in {name}")

# Or use anomaly detection (slow but thorough)
with torch.autograd.detect_anomaly():
    loss.backward()
# Hook to track per-layer activation ranges
def make_hook(name):
    def hook(module, input, output):
        if isinstance(output, torch.Tensor):
            print(f"{name}: min={output.min():.4f}, "
                  f"max={output.max():.4f}, "
                  f"nan={output.isnan().any()}")
    return hook

for name, layer in model.named_modules():
    layer.register_forward_hook(make_hook(name))
# Run the same batch in FP32 and BF16, compare outputs
with torch.no_grad():
    out_fp32 = model_fp32(batch)
    with autocast(device_type="cuda", dtype=torch.bfloat16):
        out_bf16 = model_bf16(batch)

rel_error = (out_fp32 - out_bf16.float()).abs() / out_fp32.abs().clamp(min=1e-6)
print(f"Max relative error: {rel_error.max():.6f}")
print(f"Mean relative error: {rel_error.mean():.6f}")
# Force deterministic CUDA operations for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Set seeds
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.autograd.detect_anomaly() is invaluable for tracking down NaN sources but adds significant overhead. Use it only for debugging, not in production training runs.

Further reading

Build docs developers (and LLMs) love