Quantization reduces the numerical precision of model weights and activations to lower memory usage and increase inference throughput. This page covers the techniques and tradeoffs covered in Lecture 7 by Charles Hernandez, with additional material from Lecture 42 on INT8 matrix multiplication on Turing GPUs.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Why quantization
Modern LLMs are memory-bandwidth bound during inference. A 70B parameter model in FP32 requires 280 GB of memory — far beyond what fits on a single GPU. Quantization attacks this problem directly:- Model size: INT8 weights use 4× less memory than FP32; INT4 uses 8×
- Inference throughput: lower-precision compute units (like Tensor Cores with DP4A) can execute more operations per second
- Memory bandwidth: moving fewer bytes per parameter allows faster token generation
Post-training quantization (PTQ) overview
PTQ applies quantization after a model has been fully trained in FP32 or BF16, without further gradient updates.Calibrate the model
Run a small calibration dataset through the model to collect activation statistics (min, max, percentiles).
Determine quantization parameters
Compute scale and zero-point values for each tensor based on the observed range.
Quantize weights and activations
Map floating-point values to integers using the derived parameters.
PTQ is the fastest path to a quantized model but introduces more accuracy loss than quantization-aware training, especially for sub-8-bit precisions.
INT8 quantization: symmetric vs. asymmetric
Quantization maps a floating-point range[min, max] to an integer range (e.g., [-128, 127] for INT8).
- Symmetric
- Asymmetric
The float range is symmetric around zero. The scale Dequantization is just multiplication:
s is computed as:x_fp = x_int8 * s. This is simpler and preferred for weights.INT8 GEMM on NVIDIA GPUs
Tensor Cores and the DP4A instruction
NVIDIA Turing and later architectures include dedicated INT8 Tensor Core hardware. The key instruction is DP4A (Dot Product of 4 elements, Accumulate):INT8 matrix multiply on Turing (Lecture 42)
Lecture 42 (slide deck:int8_mm_turing.pdf) covers the specifics of INT8 matrix multiplication on the Turing microarchitecture. Key points:
- Turing introduces the first consumer-class INT8 Tensor Cores
- The
ldmatrixinstruction loads tiles from shared memory into registers efficiently for tensor core operations - Achieving peak INT8 throughput requires careful attention to memory layout: weights should be in column-major order, activations in row-major
- INT8 throughput is 2× INT16 and 4× FP32 on Turing
CUDA vs. Triton for quantized kernels
Lecture 7 slides (PDF) compare CUDA and Triton for implementing quantized GEMM kernels.- CUDA
- Triton
Sub-8-bit quantization: INT4 and NF4
Going below 8 bits further reduces memory at the cost of accuracy.INT4
Four-bit integers span[-8, 7] (signed) or [0, 15] (unsigned). Two INT4 values pack into one byte. Typical usage: weight-only quantization (activations stay in FP16/BF16).
NF4 (NormalFloat 4-bit)
NF4 is a non-uniform 4-bit format from QLoRA that places quantization levels according to the quantiles of a normal distribution, not linearly. This matches the typical distribution of neural network weights.NF4 is implemented in the
bitsandbytes library and is the default format for QLoRA fine-tuning. It consistently outperforms linear INT4 for weight quantization.Quantization-aware training (QAT) overview
QAT inserts fake quantization operations during forward passes so gradients flow through quantized representations during training. This closes the accuracy gap compared to PTQ, especially at INT4 and below.Tools: bitsandbytes, GPTQ, AWQ
bitsandbytes
Drop-in INT8 and INT4 (NF4/FP4) quantization for Hugging Face models. Supports 8-bit Adam optimizer.
GPTQ
Layer-wise second-order PTQ that minimizes quantization error. Standard for INT4 LLM deployment.
AWQ
Activation-aware weight quantization. Protects salient weights based on activation magnitudes.
Further reading
- Lecture 7 slides (PDF) — Charles Hernandez’s CUDA vs. Triton comparison
- Lecture 42: INT8 matrix multiply on Turing —
int8_mm_turing.pdf - QLoRA paper — NF4 and double quantization
- GPTQ paper — Layer-wise quantization via Hessian approximation