Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Training neural networks in full FP32 precision is expensive: a 7B parameter model with optimizer state and gradients requires over 100 GB of memory. Quantized training reduces this cost by using lower-precision formats throughout the training loop — not just at inference. This page follows Lecture 30 by Thien Tran, with additional coverage of Quartet 4-bit training from Lecture 69.
The accompanying Colab notebook from Lecture 30 contains runnable code for all techniques on this page.

Why train in low precision

Full-precision training (FP32 weights + FP32 optimizer states) is the most memory-hungry configuration. The memory cost per parameter in bytes:
ConfigurationBytes per parameter
FP32 weights + FP32 Adam16 (param + grad + 2 × optimizer state)
BF16 mixed precision + FP32 Adam states12
FP8 weights + FP8 gradients2–4
4-bit weights (Quartet)~0.5–1
Reducing precision also increases compute throughput: BF16 Tensor Cores on Ampere are 2× faster than FP32, and FP8 Tensor Cores on Hopper are 2× faster still.

FP16 / BF16 mixed precision training

Mixed precision training keeps a FP32 master copy of weights for the optimizer update, but performs the forward and backward pass in FP16 or BF16. This is the standard training regime for most modern LLMs.
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # required for FP16, not needed for BF16

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(dtype=torch.bfloat16):
        output = model(batch)
        loss = criterion(output, labels)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
  • 5-bit exponent, 10-bit mantissa
  • Range: ~6.5 × 10⁻⁵ to 6.5 × 10⁴
  • Requires loss scaling to avoid underflow in gradients
  • Supported on Volta and later
Prefer BF16 over FP16 for training. BF16’s wider exponent range eliminates the need for loss scaling and reduces numerical instability, with no meaningful accuracy difference for most LLM training runs.

FP8 training (emerging standard)

FP8 training uses 8-bit floating-point for both the forward pass and gradient computation, with FP32 master weights for the optimizer. NVIDIA Hopper (H100) introduced native FP8 Tensor Cores capable of 2× the throughput of BF16. Two FP8 variants are used together in training:
FormatExponent bitsMantissa bitsUse
E4M343Forward pass (weights, activations) — more precision
E5M252Backward pass (gradients) — wider range
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# Define FP8 recipe
fp8_recipe = recipe.DelayedScaling(
    margin=0,
    interval=1,
    fp8_format=recipe.Format.HYBRID,  # E4M3 forward, E5M2 backward
    amax_history_len=16,
    amax_compute_algo="max",
)

# Use TransformerEngine modules for FP8
model = te.Linear(1024, 1024)

with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input_tensor)
FP8 training requires Hopper (H100/H200) for hardware acceleration. On older GPUs, FP8 operations fall back to software emulation and are slower than BF16.

INT8 training challenges

INT8 training is harder than INT8 inference because gradients have a much wider dynamic range than activations. Challenges include:
  • Gradient underflow: small gradient values round to zero in INT8
  • Gradient overflow: large gradient spikes exceed INT8 range
  • Accumulation errors: repeated INT8 operations compound quantization error
Practical INT8 training usually quantizes weights and activations only, keeping gradients in FP16 or BF16. Full INT8 gradient training remains an active research area.
Avoid quantizing gradient accumulators to INT8. Use at least FP16 for the gradient accumulation buffer to prevent significant accuracy degradation.

Stochastic rounding

Standard rounding (round-to-nearest) introduces systematic bias: values consistently below the midpoint always round down. Stochastic rounding adds random noise before rounding, making the expected value of the rounded result equal to the original value.
import torch

def stochastic_round(x: torch.Tensor, bits: int = 8) -> torch.Tensor:
    """Stochastic rounding to simulate low-precision arithmetic."""
    scale = 2 ** (bits - 1) - 1
    x_scaled = x * scale
    # Add uniform noise in [-0.5, 0.5] before rounding
    noise = torch.rand_like(x_scaled) - 0.5
    return (x_scaled + noise).round() / scale
Stochastic rounding is essential for low-precision training to converge: it prevents systematic drift that accumulates over many gradient steps. Most FP8 training frameworks apply stochastic rounding to weight updates.

Loss scaling for numerical stability

When training with FP16, gradients can underflow to zero (the smallest representable FP16 value is ~6 × 10⁻⁵). Loss scaling multiplies the loss by a large constant before backpropagation, shifting the gradient magnitude into the representable range.
# Manual loss scaling
LOSS_SCALE = 1024.0

loss = criterion(output, labels)
scaled_loss = loss * LOSS_SCALE
scaled_loss.backward()

# Unscale before the optimizer step
for param in model.parameters():
    if param.grad is not None:
        param.grad.data /= LOSS_SCALE

optimizer.step()
PyTorch’s GradScaler automates this with dynamic scaling: it increases the scale factor when no overflow is detected and decreases it after an overflow.
scaler = torch.cuda.amp.GradScaler(init_scale=1024.0, growth_interval=2000)

Quantized optimizers: 8-bit Adam

The Adam optimizer stores two FP32 momentum buffers per parameter, doubling the memory footprint beyond the model itself. The bitsandbytes library provides 8-bit quantized optimizer states using block-wise dynamic quantization.
import bitsandbytes as bnb

# Drop-in replacement for torch.optim.Adam
optimizer = bnb.optim.Adam8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
)

# Or use AdamW with 8-bit states
optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    weight_decay=0.01,
)
8-bit Adam achieves near-identical convergence to FP32 Adam at roughly half the optimizer memory, enabling training of larger models on the same hardware.
8-bit Adam is production-ready and widely used in fine-tuning workflows. The memory savings are particularly impactful when training on a single GPU.

Quartet 4-bit training (Lecture 69)

Quartet, presented in Lecture 69 by Roberto Castro and Andrei Panferov, extends quantized training to 4-bit weights. It combines:
  • 4-bit weight quantization using a custom INT4 format
  • FP16 activations and gradients
  • A per-group scale factor updated each step
  • A fast INT4 GEMM backend via qutlass
# Quartet training (conceptual)
from quartet import Quartet4bitLinear

# Replace standard linear layers
model = replace_linear_with_quartet(model, group_size=128)

# Training proceeds normally — Quartet handles the 4-bit forward/backward
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=1e-4)
for batch in dataloader:
    output = model(batch)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()
See the Quartet paper and code for full implementation details.

Hardware support

Hopper FP8

H100 and H200 include native FP8 (E4M3 and E5M2) Tensor Cores. Provides 2× throughput over BF16 Tensor Cores. Required for production FP8 training.

Turing / Ampere INT8

INT8 Tensor Cores available from Turing (RTX 20xx) onwards. Ampere (A100) adds BF16 Tensor Cores. Primarily used for INT8 inference; INT8 training is experimental.

Further reading

Build docs developers (and LLMs) love