Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Quantization reduces the numerical precision of model weights and activations to lower memory usage and increase inference throughput. This page covers the techniques and tradeoffs covered in Lecture 7 by Charles Hernandez, with additional material from Lecture 42 on INT8 matrix multiplication on Turing GPUs.

Why quantization

Modern LLMs are memory-bandwidth bound during inference. A 70B parameter model in FP32 requires 280 GB of memory — far beyond what fits on a single GPU. Quantization attacks this problem directly:
  • Model size: INT8 weights use 4× less memory than FP32; INT4 uses 8×
  • Inference throughput: lower-precision compute units (like Tensor Cores with DP4A) can execute more operations per second
  • Memory bandwidth: moving fewer bytes per parameter allows faster token generation
The primary cost is accuracy degradation, which quantization research aims to minimize.

Post-training quantization (PTQ) overview

PTQ applies quantization after a model has been fully trained in FP32 or BF16, without further gradient updates.
1

Calibrate the model

Run a small calibration dataset through the model to collect activation statistics (min, max, percentiles).
2

Determine quantization parameters

Compute scale and zero-point values for each tensor based on the observed range.
3

Quantize weights and activations

Map floating-point values to integers using the derived parameters.
4

Evaluate accuracy

Benchmark on a downstream task to verify that accuracy degradation is within tolerance.
PTQ is the fastest path to a quantized model but introduces more accuracy loss than quantization-aware training, especially for sub-8-bit precisions.

INT8 quantization: symmetric vs. asymmetric

Quantization maps a floating-point range [min, max] to an integer range (e.g., [-128, 127] for INT8).
The float range is symmetric around zero. The scale s is computed as:
s = max(abs(min_val), abs(max_val)) / 127.0
x_int8 = round(x / s).clamp(-128, 127)
Dequantization is just multiplication: x_fp = x_int8 * s. This is simpler and preferred for weights.
Use symmetric quantization for weights (their distributions are roughly symmetric around zero) and asymmetric for activations (which are often non-negative after ReLU or SiLU).

INT8 GEMM on NVIDIA GPUs

Tensor Cores and the DP4A instruction

NVIDIA Turing and later architectures include dedicated INT8 Tensor Core hardware. The key instruction is DP4A (Dot Product of 4 elements, Accumulate):
DP4A: int32 = dot(int8x4, int8x4) + int32_accumulator
This computes a dot product of four INT8 pairs and accumulates into INT32, enabling high-throughput matrix multiplications entirely in integer arithmetic.
// Pseudocode for INT8 GEMM using DP4A on Turing
// Inputs: A (M x K, int8), B (K x N, int8)
// Output: C (M x N, int32)
for (int m = 0; m < M; m++) {
  for (int n = 0; n < N; n++) {
    int32_t acc = 0;
    for (int k = 0; k < K; k += 4) {
      // DP4A processes 4 elements at once
      acc = __dp4a(*(int32_t*)&A[m][k],
                   *(int32_t*)&B[k][n],
                   acc);
    }
    C[m][n] = acc;
  }
}
After the integer GEMM, dequantize by multiplying by the per-channel scale factors:
# Dequantize output
C_fp = C_int32.float() * (scale_A * scale_B)

INT8 matrix multiply on Turing (Lecture 42)

Lecture 42 (slide deck: int8_mm_turing.pdf) covers the specifics of INT8 matrix multiplication on the Turing microarchitecture. Key points:
  • Turing introduces the first consumer-class INT8 Tensor Cores
  • The ldmatrix instruction loads tiles from shared memory into registers efficiently for tensor core operations
  • Achieving peak INT8 throughput requires careful attention to memory layout: weights should be in column-major order, activations in row-major
  • INT8 throughput is 2× INT16 and 4× FP32 on Turing
Turing INT8 Tensor Cores require the K dimension to be a multiple of 16. Pad your matrices if necessary.

CUDA vs. Triton for quantized kernels

Lecture 7 slides (PDF) compare CUDA and Triton for implementing quantized GEMM kernels.
// CUDA INT8 GEMM kernel (simplified)
__global__ void int8_gemm(
    const int8_t* A, const int8_t* B,
    int32_t* C, int M, int N, int K) {
  // Use wmma (Warp Matrix Multiply-Accumulate) for INT8 Tensor Cores
  using namespace nvcuda::wmma;
  fragment<matrix_a, 16, 16, 16, int8_t, row_major> a_frag;
  fragment<matrix_b, 16, 16, 16, int8_t, col_major> b_frag;
  fragment<accumulator, 16, 16, 16, int32_t> c_frag;

  fill_fragment(c_frag, 0);
  load_matrix_sync(a_frag, A + ..., K);
  load_matrix_sync(b_frag, B + ..., N);
  mma_sync(c_frag, a_frag, b_frag, c_frag);
  store_matrix_sync(C + ..., c_frag, N, mem_row_major);
}
Pros: Full control over memory layout, instruction selection, and register usage. Can match peak hardware throughput.Cons: Verbose; requires deep knowledge of Tensor Core tiling constraints and shared memory layout.
For standard INT8 GEMM on well-supported hardware, start with Triton. Fall back to CUDA when you need sub-8-bit formats, custom dequantization fusions, or architecture-specific instructions.

Sub-8-bit quantization: INT4 and NF4

Going below 8 bits further reduces memory at the cost of accuracy.

INT4

Four-bit integers span [-8, 7] (signed) or [0, 15] (unsigned). Two INT4 values pack into one byte. Typical usage: weight-only quantization (activations stay in FP16/BF16).
# Pack two INT4 values into one uint8
def pack_int4(low: torch.Tensor, high: torch.Tensor) -> torch.Tensor:
    """Pack two tensors of 4-bit values into uint8."""
    return (high.to(torch.uint8) << 4) | (low.to(torch.uint8) & 0xF)

# Dequantize packed INT4 weights
def dequantize_int4(packed: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
    low = (packed & 0xF).to(torch.int8) - 8   # signed [-8, 7]
    high = (packed >> 4).to(torch.int8) - 8
    return torch.stack([low, high], dim=-1).float() * scale

NF4 (NormalFloat 4-bit)

NF4 is a non-uniform 4-bit format from QLoRA that places quantization levels according to the quantiles of a normal distribution, not linearly. This matches the typical distribution of neural network weights.
# NF4 quantization levels (from QLoRA paper)
NF4_TABLE = [
    -1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848,
    -0.0911, 0.0,   0.0796,  0.1609,  0.2461,  0.3379,
     0.4407, 0.5626, 0.7230,  1.0
]
NF4 is implemented in the bitsandbytes library and is the default format for QLoRA fine-tuning. It consistently outperforms linear INT4 for weight quantization.

Quantization-aware training (QAT) overview

QAT inserts fake quantization operations during forward passes so gradients flow through quantized representations during training. This closes the accuracy gap compared to PTQ, especially at INT4 and below.
import torch.quantization as tq

# Insert fake quantization observers
model.qconfig = tq.get_default_qat_qconfig('fbgemm')
tq.prepare_qat(model, inplace=True)

# Train normally — fake quant ops simulate quantization error
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
    loss = criterion(model(batch), labels)
    loss.backward()
    optimizer.step()

# Convert to actual quantized model
tq.convert(model.eval(), inplace=True)

Tools: bitsandbytes, GPTQ, AWQ

bitsandbytes

Drop-in INT8 and INT4 (NF4/FP4) quantization for Hugging Face models. Supports 8-bit Adam optimizer.

GPTQ

Layer-wise second-order PTQ that minimizes quantization error. Standard for INT4 LLM deployment.

AWQ

Activation-aware weight quantization. Protects salient weights based on activation magnitudes.
# Load a model in INT8 with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
# Load a GPTQ-quantized model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

Further reading

Build docs developers (and LLMs) love