Numerics and AI: Floating-Point for GPU Programmers

Every neural network computation is subject to the limits of finite-precision arithmetic. Understanding how floating-point formats represent numbers, where errors arise, and how to keep computations numerically stable is essential for GPU programmers working on deep learning. This page follows Lecture 84 by Paulius Micikevicius of NVIDIA (slides), a pioneer of mixed-precision training.

IEEE 754 floating-point formats

Every IEEE 754 floating-point number uses three fields: a sign bit, an exponent, and a mantissa (also called the significand or fraction).

value = (-1)^sign × 2^(exponent - bias) × (1 + mantissa)

The exponent determines the range (how large or small the number can be), and the mantissa determines precision (how many significant digits you get within that range).

Format overview

Format	Total bits	Exponent bits	Mantissa bits	Max value	Min normal
FP32	32	8	23	~3.4 × 10³⁸	~1.2 × 10⁻³⁸
FP16	16	5	10	65504	~6.1 × 10⁻⁵
BF16	16	8	7	~3.4 × 10³⁸	~1.2 × 10⁻³⁸
FP8 E4M3	8	4	3	448	~1.6 × 10⁻²
FP8 E5M2	8	5	2	57344	~1.5 × 10⁻⁵
FP4 E2M1	4	2	1	6	1

BF16 shares its exponent width with FP32. This means BF16 and FP32 have identical dynamic range — BF16 is simply a truncated FP32. Converting FP32 to BF16 is a right-shift of the mantissa bits with rounding.

Exponent bits vs. mantissa bits tradeoff

For a fixed bit budget, every bit moved from mantissa to exponent doubles the dynamic range while halving the precision at any given magnitude.

More exponent bits
More mantissa bits

FP8 E5M2 (5 exponent, 2 mantissa):

Dynamic range: ~10⁻⁵ to ~5.7 × 10⁴
Only 4 representable levels per power of two
Good for gradients, which span many orders of magnitude
Poor absolute precision — suitable when range matters more than accuracy

This tradeoff explains why FP8 training uses two formats: E4M3 for the forward pass (weights and activations need precision) and E5M2 for gradients (gradients need range).

Catastrophic cancellation and loss scaling

Catastrophic cancellation occurs when two nearly equal numbers are subtracted, causing most significant bits to cancel and leaving a result dominated by rounding error.

# Example of catastrophic cancellation in FP16
import torch

a = torch.tensor(1.0000, dtype=torch.float16)
b = torch.tensor(0.9998, dtype=torch.float16)

# In FP16: a and b may round to the same value
result = a - b
# Expected: 0.0002, Actual: 0.0 (catastrophic cancellation)
print(result)  # tensor(0., dtype=torch.float16)

# In FP32: sufficient mantissa bits to represent the difference
result_fp32 = a.float() - b.float()
print(result_fp32)  # tensor(0.0002)

Loss scaling counteracts a related issue in FP16 training: gradient underflow. When loss values are small, gradients may fall below the smallest representable FP16 value (~6 × 10⁻⁵) and become zero.

# Loss scaling manually
loss_scale = 65536.0

scaled_loss = loss * loss_scale
scaled_loss.backward()

# Unscale before clipping and optimizer step
for p in model.parameters():
    if p.grad is not None:
        p.grad /= loss_scale

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

Why BF16 is preferred over FP16 for training

FP16’s narrow exponent range (5 bits, max ~65504) causes two problems in training:

Gradient underflow: small gradients round to zero
Activation overflow: large intermediate values exceed FP16 range

BF16 solves both with 8 exponent bits (matching FP32), at the cost of 3 mantissa bits (vs. FP16’s 10). For neural networks, range matters more than per-value precision because:

Model weights rarely need more than 7–8 bits of mantissa precision
Gradient magnitudes span many orders of magnitude and benefit from wider range
BF16 requires no loss scaling, simplifying training code

When using BF16, you can remove GradScaler entirely from your training loop. Simply use torch.autocast(device_type="cuda", dtype=torch.bfloat16) without a scaler.

FP8 variants: E4M3 and E5M2

NVIDIA H100 introduced two FP8 formats with complementary roles:

import torch

# E4M3: forward pass (weights, activations)
# Max value: 448, 8 levels per power-of-two interval
weight_fp8 = weight.to(torch.float8_e4m3fn)

# E5M2: backward pass (gradients)
# Max value: 57344, 4 levels per power-of-two interval
grad_fp8 = grad.to(torch.float8_e5m2)

# Per-tensor scaling before conversion
max_val = weight.abs().max()
scale = 448.0 / max_val  # map max to E4M3 max representable value
weight_fp8 = (weight * scale).to(torch.float8_e4m3fn)

FP8 requires explicit per-tensor (or per-channel) scaling because its limited range cannot cover typical neural network value distributions without scaling. Always compute and apply a scale before converting to FP8.

Integer formats: INT8 and INT4 for inference

Integer formats use fixed-point representation: a scale maps integer values to floating-point magnitudes. They have no exponent — all representable values are uniformly spaced.

Format	Range	Levels	Primary use
INT8	−128 to 127	256	Inference (weights + activations)
UINT8	0 to 255	256	Post-ReLU activations
INT4	−8 to 7	16	Weight-only quantization
INT2	−2 to 1	4	Extreme compression (research)
INT1	−1 to 1 (or 0/1)	2	Binarized networks

Uniform spacing is both a strength and a weakness:

Strength: simple, fast, hardware-friendly (no exponent decode needed)
Weakness: cannot represent small values near zero accurately — values are all equal size regardless of magnitude

This is why INT8 works well for weights (roughly Gaussian distribution) but struggles with activations that have large outliers (see SmoothQuant, LLM.int8() for solutions).

Mixed precision training recipes

Paulius Micikevicius co-authored the original mixed-precision training paper (Micikevicius et al., 2018). The standard recipe:

Maintain FP32 master weights

Store a full-precision copy of weights for the optimizer update. This prevents accumulated rounding error from degrading convergence.

Run forward and backward in BF16 or FP16

Cast weights to lower precision before each forward pass. Activations and intermediate values stay in BF16/FP16.

Accumulate gradients in FP32

The gradient accumulation buffer should remain in FP32 to prevent precision loss during accumulation over many steps.

Apply loss scaling (FP16 only)

Multiply loss by a scale factor before backpropagation. Unscale gradients before the optimizer step. Skip this step for BF16.

Update FP32 master weights

Apply the optimizer update to FP32 weights. The next forward pass will re-cast to lower precision.

# Complete BF16 mixed precision training loop (no loss scaling needed)
from torch.amp import autocast

model = model.cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(device_type="cuda", dtype=torch.bfloat16):
        output = model(batch["input_ids"])
        loss = loss_fn(output, batch["labels"])
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Numerical debugging techniques

When a model produces NaN or Inf values, or accuracy is unexpectedly poor, these techniques help isolate the problem:

Check for NaN/Inf during training

def check_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            if torch.isnan(param.grad).any():
                print(f"NaN gradient in {name}")
            if torch.isinf(param.grad).any():
                print(f"Inf gradient in {name}")

# Or use anomaly detection (slow but thorough)
with torch.autograd.detect_anomaly():
    loss.backward()

Monitor activation statistics

# Hook to track per-layer activation ranges
def make_hook(name):
    def hook(module, input, output):
        if isinstance(output, torch.Tensor):
            print(f"{name}: min={output.min():.4f}, "
                  f"max={output.max():.4f}, "
                  f"nan={output.isnan().any()}")
    return hook

for name, layer in model.named_modules():
    layer.register_forward_hook(make_hook(name))

Compare FP32 vs. low-precision outputs

# Run the same batch in FP32 and BF16, compare outputs
with torch.no_grad():
    out_fp32 = model_fp32(batch)
    with autocast(device_type="cuda", dtype=torch.bfloat16):
        out_bf16 = model_bf16(batch)

rel_error = (out_fp32 - out_bf16.float()).abs() / out_fp32.abs().clamp(min=1e-6)
print(f"Max relative error: {rel_error.max():.6f}")
print(f"Mean relative error: {rel_error.mean():.6f}")

Use torch.backends.cudnn.deterministic

# Force deterministic CUDA operations for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Set seeds
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

torch.autograd.detect_anomaly() is invaluable for tracking down NaN sources but adds significant overhead. Use it only for debugging, not in production training runs.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Numerics and AI: Floating-Point for GPU Programmers

IEEE 754 floating-point formats

Format overview

Exponent bits vs. mantissa bits tradeoff

Catastrophic cancellation and loss scaling

Why BF16 is preferred over FP16 for training

FP8 variants: E4M3 and E5M2

Integer formats: INT8 and INT4 for inference

Mixed precision training recipes

Numerical debugging techniques

Further reading

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​IEEE 754 floating-point formats

​Format overview

​Exponent bits vs. mantissa bits tradeoff

​Catastrophic cancellation and loss scaling

​Why BF16 is preferred over FP16 for training

​FP8 variants: E4M3 and E5M2

​Integer formats: INT8 and INT4 for inference

​Mixed precision training recipes

​Numerical debugging techniques

​Further reading

Build docs developers (and LLMs) love

IEEE 754 floating-point formats

Format overview

Exponent bits vs. mantissa bits tradeoff

Catastrophic cancellation and loss scaling

Why BF16 is preferred over FP16 for training

FP8 variants: E4M3 and E5M2

Integer formats: INT8 and INT4 for inference

Mixed precision training recipes

Numerical debugging techniques

Further reading