Quantized Training: Low-Precision Gradient Updates

Training neural networks in full FP32 precision is expensive: a 7B parameter model with optimizer state and gradients requires over 100 GB of memory. Quantized training reduces this cost by using lower-precision formats throughout the training loop — not just at inference. This page follows Lecture 30 by Thien Tran, with additional coverage of Quartet 4-bit training from Lecture 69.

The accompanying Colab notebook from Lecture 30 contains runnable code for all techniques on this page.

Why train in low precision

Full-precision training (FP32 weights + FP32 optimizer states) is the most memory-hungry configuration. The memory cost per parameter in bytes:

Configuration	Bytes per parameter
FP32 weights + FP32 Adam	16 (param + grad + 2 × optimizer state)
BF16 mixed precision + FP32 Adam states	12
FP8 weights + FP8 gradients	2–4
4-bit weights (Quartet)	~0.5–1

Reducing precision also increases compute throughput: BF16 Tensor Cores on Ampere are 2× faster than FP32, and FP8 Tensor Cores on Hopper are 2× faster still.

FP16 / BF16 mixed precision training

Mixed precision training keeps a FP32 master copy of weights for the optimizer update, but performs the forward and backward pass in FP16 or BF16. This is the standard training regime for most modern LLMs.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # required for FP16, not needed for BF16

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(dtype=torch.bfloat16):
        output = model(batch)
        loss = criterion(output, labels)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

FP16
BF16

5-bit exponent, 10-bit mantissa
Range: ~6.5 × 10⁻⁵ to 6.5 × 10⁴
Requires loss scaling to avoid underflow in gradients
Supported on Volta and later

Prefer BF16 over FP16 for training. BF16’s wider exponent range eliminates the need for loss scaling and reduces numerical instability, with no meaningful accuracy difference for most LLM training runs.

FP8 training (emerging standard)

FP8 training uses 8-bit floating-point for both the forward pass and gradient computation, with FP32 master weights for the optimizer. NVIDIA Hopper (H100) introduced native FP8 Tensor Cores capable of 2× the throughput of BF16. Two FP8 variants are used together in training:

Format	Exponent bits	Mantissa bits	Use
E4M3	4	3	Forward pass (weights, activations) — more precision
E5M2	5	2	Backward pass (gradients) — wider range

import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# Define FP8 recipe
fp8_recipe = recipe.DelayedScaling(
    margin=0,
    interval=1,
    fp8_format=recipe.Format.HYBRID,  # E4M3 forward, E5M2 backward
    amax_history_len=16,
    amax_compute_algo="max",
)

# Use TransformerEngine modules for FP8
model = te.Linear(1024, 1024)

with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input_tensor)

FP8 training requires Hopper (H100/H200) for hardware acceleration. On older GPUs, FP8 operations fall back to software emulation and are slower than BF16.

INT8 training challenges

INT8 training is harder than INT8 inference because gradients have a much wider dynamic range than activations. Challenges include:

Gradient underflow: small gradient values round to zero in INT8
Gradient overflow: large gradient spikes exceed INT8 range
Accumulation errors: repeated INT8 operations compound quantization error

Practical INT8 training usually quantizes weights and activations only, keeping gradients in FP16 or BF16. Full INT8 gradient training remains an active research area.

Avoid quantizing gradient accumulators to INT8. Use at least FP16 for the gradient accumulation buffer to prevent significant accuracy degradation.

Stochastic rounding

Standard rounding (round-to-nearest) introduces systematic bias: values consistently below the midpoint always round down. Stochastic rounding adds random noise before rounding, making the expected value of the rounded result equal to the original value.

import torch

def stochastic_round(x: torch.Tensor, bits: int = 8) -> torch.Tensor:
    """Stochastic rounding to simulate low-precision arithmetic."""
    scale = 2 ** (bits - 1) - 1
    x_scaled = x * scale
    # Add uniform noise in [-0.5, 0.5] before rounding
    noise = torch.rand_like(x_scaled) - 0.5
    return (x_scaled + noise).round() / scale

Stochastic rounding is essential for low-precision training to converge: it prevents systematic drift that accumulates over many gradient steps. Most FP8 training frameworks apply stochastic rounding to weight updates.

Loss scaling for numerical stability

When training with FP16, gradients can underflow to zero (the smallest representable FP16 value is ~6 × 10⁻⁵). Loss scaling multiplies the loss by a large constant before backpropagation, shifting the gradient magnitude into the representable range.

# Manual loss scaling
LOSS_SCALE = 1024.0

loss = criterion(output, labels)
scaled_loss = loss * LOSS_SCALE
scaled_loss.backward()

# Unscale before the optimizer step
for param in model.parameters():
    if param.grad is not None:
        param.grad.data /= LOSS_SCALE

optimizer.step()

PyTorch’s GradScaler automates this with dynamic scaling: it increases the scale factor when no overflow is detected and decreases it after an overflow.

scaler = torch.cuda.amp.GradScaler(init_scale=1024.0, growth_interval=2000)

Quantized optimizers: 8-bit Adam

The Adam optimizer stores two FP32 momentum buffers per parameter, doubling the memory footprint beyond the model itself. The bitsandbytes library provides 8-bit quantized optimizer states using block-wise dynamic quantization.

import bitsandbytes as bnb

# Drop-in replacement for torch.optim.Adam
optimizer = bnb.optim.Adam8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
)

# Or use AdamW with 8-bit states
optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    weight_decay=0.01,
)

8-bit Adam achieves near-identical convergence to FP32 Adam at roughly half the optimizer memory, enabling training of larger models on the same hardware.

8-bit Adam is production-ready and widely used in fine-tuning workflows. The memory savings are particularly impactful when training on a single GPU.

Quartet 4-bit training (Lecture 69)

Quartet, presented in Lecture 69 by Roberto Castro and Andrei Panferov, extends quantized training to 4-bit weights. It combines:

4-bit weight quantization using a custom INT4 format
FP16 activations and gradients
A per-group scale factor updated each step
A fast INT4 GEMM backend via qutlass

# Quartet training (conceptual)
from quartet import Quartet4bitLinear

# Replace standard linear layers
model = replace_linear_with_quartet(model, group_size=128)

# Training proceeds normally — Quartet handles the 4-bit forward/backward
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=1e-4)
for batch in dataloader:
    output = model(batch)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

See the Quartet paper and code for full implementation details.

Hardware support

Hopper FP8

H100 and H200 include native FP8 (E4M3 and E5M2) Tensor Cores. Provides 2× throughput over BF16 Tensor Cores. Required for production FP8 training.

Turing / Ampere INT8

INT8 Tensor Cores available from Turing (RTX 20xx) onwards. Ampere (A100) adds BF16 Tensor Cores. Primarily used for INT8 inference; INT8 training is experimental.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Quantized Training: Low-Precision Gradient Updates

Why train in low precision

FP16 / BF16 mixed precision training

FP8 training (emerging standard)

INT8 training challenges

Stochastic rounding

Loss scaling for numerical stability

Quantized optimizers: 8-bit Adam

Quartet 4-bit training (Lecture 69)

Hardware support

Hopper FP8

Turing / Ampere INT8

Further reading

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​Why train in low precision

​FP16 / BF16 mixed precision training

​FP8 training (emerging standard)

​INT8 training challenges

​Stochastic rounding

​Loss scaling for numerical stability

​Quantized optimizers: 8-bit Adam

​Quartet 4-bit training (Lecture 69)

​Hardware support

Hopper FP8

Turing / Ampere INT8

​Further reading

Build docs developers (and LLMs) love

Why train in low precision

FP16 / BF16 mixed precision training

FP8 training (emerging standard)

INT8 training challenges

Stochastic rounding

Loss scaling for numerical stability

Quantized optimizers: 8-bit Adam

Quartet 4-bit training (Lecture 69)

Hardware support

Further reading