Advanced Quantization: From INT8 to Sub-4-Bit

Quantization reduces the numerical precision of model weights and activations to lower memory usage and increase inference throughput. This page covers the techniques and tradeoffs covered in Lecture 7 by Charles Hernandez, with additional material from Lecture 42 on INT8 matrix multiplication on Turing GPUs.

Why quantization

Modern LLMs are memory-bandwidth bound during inference. A 70B parameter model in FP32 requires 280 GB of memory — far beyond what fits on a single GPU. Quantization attacks this problem directly:

Model size: INT8 weights use 4× less memory than FP32; INT4 uses 8×
Inference throughput: lower-precision compute units (like Tensor Cores with DP4A) can execute more operations per second
Memory bandwidth: moving fewer bytes per parameter allows faster token generation

The primary cost is accuracy degradation, which quantization research aims to minimize.

Post-training quantization (PTQ) overview

PTQ applies quantization after a model has been fully trained in FP32 or BF16, without further gradient updates.

Calibrate the model

Run a small calibration dataset through the model to collect activation statistics (min, max, percentiles).

Determine quantization parameters

Compute scale and zero-point values for each tensor based on the observed range.

Quantize weights and activations

Map floating-point values to integers using the derived parameters.

Evaluate accuracy

Benchmark on a downstream task to verify that accuracy degradation is within tolerance.

PTQ is the fastest path to a quantized model but introduces more accuracy loss than quantization-aware training, especially for sub-8-bit precisions.

INT8 quantization: symmetric vs. asymmetric

Quantization maps a floating-point range [min, max] to an integer range (e.g., [-128, 127] for INT8).

Symmetric
Asymmetric

The float range is symmetric around zero. The scale s is computed as:

s = max(abs(min_val), abs(max_val)) / 127.0
x_int8 = round(x / s).clamp(-128, 127)

Dequantization is just multiplication: x_fp = x_int8 * s. This is simpler and preferred for weights.

The full [min, max] float range maps to [0, 255] (uint8) or [-128, 127] with a non-zero zero-point z:

s = (max_val - min_val) / 255.0
z = round(-min_val / s)
x_uint8 = round(x / s + z).clamp(0, 255)

Dequantization: x_fp = (x_uint8 - z) * s. More accurate for asymmetric activation distributions (e.g., post-ReLU).

Use symmetric quantization for weights (their distributions are roughly symmetric around zero) and asymmetric for activations (which are often non-negative after ReLU or SiLU).

INT8 GEMM on NVIDIA GPUs

Tensor Cores and the DP4A instruction

NVIDIA Turing and later architectures include dedicated INT8 Tensor Core hardware. The key instruction is DP4A (Dot Product of 4 elements, Accumulate):

DP4A: int32 = dot(int8x4, int8x4) + int32_accumulator

This computes a dot product of four INT8 pairs and accumulates into INT32, enabling high-throughput matrix multiplications entirely in integer arithmetic.

// Pseudocode for INT8 GEMM using DP4A on Turing
// Inputs: A (M x K, int8), B (K x N, int8)
// Output: C (M x N, int32)
for (int m = 0; m < M; m++) {
  for (int n = 0; n < N; n++) {
    int32_t acc = 0;
    for (int k = 0; k < K; k += 4) {
      // DP4A processes 4 elements at once
      acc = __dp4a(*(int32_t*)&A[m][k],
                   *(int32_t*)&B[k][n],
                   acc);
    }
    C[m][n] = acc;
  }
}

After the integer GEMM, dequantize by multiplying by the per-channel scale factors:

# Dequantize output
C_fp = C_int32.float() * (scale_A * scale_B)

INT8 matrix multiply on Turing (Lecture 42)

Lecture 42 (slide deck: int8_mm_turing.pdf) covers the specifics of INT8 matrix multiplication on the Turing microarchitecture. Key points:

Turing introduces the first consumer-class INT8 Tensor Cores
The ldmatrix instruction loads tiles from shared memory into registers efficiently for tensor core operations
Achieving peak INT8 throughput requires careful attention to memory layout: weights should be in column-major order, activations in row-major
INT8 throughput is 2× INT16 and 4× FP32 on Turing

Turing INT8 Tensor Cores require the K dimension to be a multiple of 16. Pad your matrices if necessary.

CUDA vs. Triton for quantized kernels

Lecture 7 slides (PDF) compare CUDA and Triton for implementing quantized GEMM kernels.

CUDA
Triton

// CUDA INT8 GEMM kernel (simplified)
__global__ void int8_gemm(
    const int8_t* A, const int8_t* B,
    int32_t* C, int M, int N, int K) {
  // Use wmma (Warp Matrix Multiply-Accumulate) for INT8 Tensor Cores
  using namespace nvcuda::wmma;
  fragment<matrix_a, 16, 16, 16, int8_t, row_major> a_frag;
  fragment<matrix_b, 16, 16, 16, int8_t, col_major> b_frag;
  fragment<accumulator, 16, 16, 16, int32_t> c_frag;

  fill_fragment(c_frag, 0);
  load_matrix_sync(a_frag, A + ..., K);
  load_matrix_sync(b_frag, B + ..., N);
  mma_sync(c_frag, a_frag, b_frag, c_frag);
  store_matrix_sync(C + ..., c_frag, N, mem_row_major);
}

Pros: Full control over memory layout, instruction selection, and register usage. Can match peak hardware throughput.Cons: Verbose; requires deep knowledge of Tensor Core tiling constraints and shared memory layout.

import triton
import triton.language as tl

@triton.jit
def int8_gemm_kernel(
    A_ptr, B_ptr, C_ptr,
    M, N, K,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.int32)

    for k in range(0, K, BLOCK_K):
        a = tl.load(A_ptr + pid_m * BLOCK_M * K + k, ...)
        b = tl.load(B_ptr + k * N + pid_n * BLOCK_N, ...)
        acc += tl.dot(a.to(tl.int8), b.to(tl.int8), out_dtype=tl.int32)

    tl.store(C_ptr + pid_m * BLOCK_M * N + pid_n * BLOCK_N, acc)

Pros: Concise; Triton handles tiling and shared memory automatically; easier to iterate on.Cons: Less control; compiler may not always generate optimal code for unusual quantization schemes.

For standard INT8 GEMM on well-supported hardware, start with Triton. Fall back to CUDA when you need sub-8-bit formats, custom dequantization fusions, or architecture-specific instructions.

Sub-8-bit quantization: INT4 and NF4

Going below 8 bits further reduces memory at the cost of accuracy.

INT4

Four-bit integers span [-8, 7] (signed) or [0, 15] (unsigned). Two INT4 values pack into one byte. Typical usage: weight-only quantization (activations stay in FP16/BF16).

# Pack two INT4 values into one uint8
def pack_int4(low: torch.Tensor, high: torch.Tensor) -> torch.Tensor:
    """Pack two tensors of 4-bit values into uint8."""
    return (high.to(torch.uint8) << 4) | (low.to(torch.uint8) & 0xF)

# Dequantize packed INT4 weights
def dequantize_int4(packed: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
    low = (packed & 0xF).to(torch.int8) - 8   # signed [-8, 7]
    high = (packed >> 4).to(torch.int8) - 8
    return torch.stack([low, high], dim=-1).float() * scale

NF4 (NormalFloat 4-bit)

NF4 is a non-uniform 4-bit format from QLoRA that places quantization levels according to the quantiles of a normal distribution, not linearly. This matches the typical distribution of neural network weights.

# NF4 quantization levels (from QLoRA paper)
NF4_TABLE = [
    -1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848,
    -0.0911, 0.0,   0.0796,  0.1609,  0.2461,  0.3379,
     0.4407, 0.5626, 0.7230,  1.0
]

NF4 is implemented in the bitsandbytes library and is the default format for QLoRA fine-tuning. It consistently outperforms linear INT4 for weight quantization.

Quantization-aware training (QAT) overview

QAT inserts fake quantization operations during forward passes so gradients flow through quantized representations during training. This closes the accuracy gap compared to PTQ, especially at INT4 and below.

import torch.quantization as tq

# Insert fake quantization observers
model.qconfig = tq.get_default_qat_qconfig('fbgemm')
tq.prepare_qat(model, inplace=True)

# Train normally — fake quant ops simulate quantization error
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
    loss = criterion(model(batch), labels)
    loss.backward()
    optimizer.step()

# Convert to actual quantized model
tq.convert(model.eval(), inplace=True)

Tools: bitsandbytes, GPTQ, AWQ

bitsandbytes

Drop-in INT8 and INT4 (NF4/FP4) quantization for Hugging Face models. Supports 8-bit Adam optimizer.

GPTQ

Layer-wise second-order PTQ that minimizes quantization error. Standard for INT4 LLM deployment.

AWQ

Activation-aware weight quantization. Protects salient weights based on activation magnitudes.

# Load a model in INT8 with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Load a GPTQ-quantized model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Advanced Quantization: From INT8 to Sub-4-Bit

Why quantization

Post-training quantization (PTQ) overview

INT8 quantization: symmetric vs. asymmetric

INT8 GEMM on NVIDIA GPUs

Tensor Cores and the DP4A instruction

INT8 matrix multiply on Turing (Lecture 42)

CUDA vs. Triton for quantized kernels

Sub-8-bit quantization: INT4 and NF4

INT4

NF4 (NormalFloat 4-bit)

Quantization-aware training (QAT) overview

Tools: bitsandbytes, GPTQ, AWQ

bitsandbytes

GPTQ

AWQ

Further reading

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​Why quantization

​Post-training quantization (PTQ) overview

​INT8 quantization: symmetric vs. asymmetric

​INT8 GEMM on NVIDIA GPUs

​Tensor Cores and the DP4A instruction

​INT8 matrix multiply on Turing (Lecture 42)

​CUDA vs. Triton for quantized kernels

​Sub-8-bit quantization: INT4 and NF4

​INT4

​NF4 (NormalFloat 4-bit)

​Quantization-aware training (QAT) overview

​Tools: bitsandbytes, GPTQ, AWQ

bitsandbytes

GPTQ

AWQ

​Further reading

Build docs developers (and LLMs) love

Why quantization

Post-training quantization (PTQ) overview

INT8 quantization: symmetric vs. asymmetric

INT8 GEMM on NVIDIA GPUs

Tensor Cores and the DP4A instruction

INT8 matrix multiply on Turing (Lecture 42)

CUDA vs. Triton for quantized kernels

Sub-8-bit quantization: INT4 and NF4

INT4

NF4 (NormalFloat 4-bit)

Quantization-aware training (QAT) overview

Tools: bitsandbytes, GPTQ, AWQ

Further reading