Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
torchvision.models.quantization provides INT8-quantized variants of several popular classification models. Quantized models represent weights and activations as 8-bit integers instead of 32-bit floats, which results in models that are typically up to 4× smaller and run 2–3× faster on CPU compared to their float counterparts — with only a small accuracy trade-off. All quantized models are optimised for CPU inference and are not currently supported on GPU.
Available quantized models
The following models have quantized variants. Each builder function accepts the sameweights and progress arguments as its full-precision counterpart, plus a quantize boolean flag.
| Builder | Quantized weights class | Float weights class | Training method |
|---|---|---|---|
quantization.googlenet | GoogLeNet_QuantizedWeights | GoogLeNet_Weights | PTQ |
quantization.inception_v3 | Inception_V3_QuantizedWeights | Inception_V3_Weights | PTQ |
quantization.mobilenet_v2 | MobileNet_V2_QuantizedWeights | MobileNet_V2_Weights | QAT |
quantization.mobilenet_v3_large | MobileNet_V3_Large_QuantizedWeights | MobileNet_V3_Large_Weights | QAT |
quantization.resnet18 | ResNet18_QuantizedWeights | ResNet18_Weights | PTQ |
quantization.resnet50 | ResNet50_QuantizedWeights | ResNet50_Weights | PTQ |
quantization.resnext101_32x8d | ResNeXt101_32X8D_QuantizedWeights | ResNeXt101_32X8D_Weights | PTQ |
quantization.resnext101_64x4d | ResNeXt101_64X4D_QuantizedWeights | ResNeXt101_64X4D_Weights | PTQ |
quantization.shufflenet_v2_x0_5 | ShuffleNet_V2_X0_5_QuantizedWeights | ShuffleNet_V2_X0_5_Weights | PTQ |
quantization.shufflenet_v2_x1_0 | ShuffleNet_V2_X1_0_QuantizedWeights | ShuffleNet_V2_X1_0_Weights | PTQ |
quantization.shufflenet_v2_x1_5 | ShuffleNet_V2_X1_5_QuantizedWeights | ShuffleNet_V2_X1_5_Weights | PTQ |
quantization.shufflenet_v2_x2_0 | ShuffleNet_V2_X2_0_QuantizedWeights | ShuffleNet_V2_X2_0_Weights | PTQ |
PTQ = Post-Training Quantization (calibration-based). QAT = Quantization-Aware Training (simulated quantization during training). MobileNetV2 and MobileNetV3 Large use QAT weights, which generally yield better accuracy than PTQ at the same bit-width.
Loading a quantized model
Passquantize=True together with a quantized weights object to get a ready-to-use INT8 model. The weights object also carries the matched preprocessing transforms.
Quantized models must be in eval mode for inference. Quantization operators do not support training mode.
The quantize parameter
Every builder in torchvision.models.quantization accepts a quantize keyword argument:
quantize=True— returns a fully quantized INT8 model ready for CPU inference. Weights are stored as 8-bit integers and linear/convolutional operations use integer arithmetic.quantize=False(default) — returns the float model built from a quantization-friendly architecture (fused modules,FloatFunctionalfor addition/concatenation). This is useful when you want the float backbone before applying your own quantization pipeline.
End-to-end inference example
Backends: FBGEMM vs QNNPACK
TorchVision quantized models target one of two PyTorch quantization backends, depending on the model and its pretrained weights:FBGEMM (x86)
Used by ResNet, ResNeXt, GoogLeNet, Inception V3, and ShuffleNetV2 weights. Optimised for x86 CPUs with AVX2/AVX-512 support. Best choice for server and desktop inference.
QNNPACK (ARM)
Used by MobileNetV2 and MobileNetV3 Large weights. Optimised for ARM CPUs. Best choice for mobile and embedded devices.
quantize=True. You can also set it explicitly:
Post-Training Static Quantization (PTSQ)
If the pretrained quantized weights don’t suit your deployment target (e.g., you want a custom calibration dataset), you can run your own Post-Training Static Quantization using the quantization-friendly float models as a starting point.Load the float model with quantization-aware architecture
Use
quantize=False to get the QuantizableMobileNetV2 (or equivalent) with FloatFunctional modules substituted in place of plain tensor operations.Fuse Conv + BN + ReLU layers
Fusing adjacent Conv → BatchNorm → ReLU into a single operation reduces memory traffic and is required for accurate static quantization.
Calibrate with representative data
Run a few hundred representative batches through the model. The observer modules inserted by
prepare collect activation statistics.