Use this file to discover all available pages before exploring further.
torchvision.models provides a comprehensive collection of pretrained image classification architectures — from classic CNNs like AlexNet and VGG through residual networks, efficient mobile-friendly models, and modern vision transformers. Every model uses the new weights API: each builder function accepts a typed weights argument that bundles the pretrained parameters together with the preprocessing transforms that match how the model was originally trained. Passing weights=None constructs a randomly initialised model. All models default to num_classes=1000 for ImageNet and output raw logits of shape [batch_size, 1000].
from torchvision.models import ( resnet50, ResNet50_Weights, efficientnet_b0, EfficientNet_B0_Weights, vit_b_16, ViT_B_16_Weights,)# ResNet-50 with latest (V2) ImageNet weightsmodel = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)# EfficientNet-B0model = efficientnet_b0(weights=EfficientNet_B0_Weights.IMAGENET1K_V1)# Vision Transformer B/16 with SWAG end-to-end fine-tuned weightsmodel = vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1)# Use DEFAULT to always get the best available weightsmodel = resnet50(weights=ResNet50_Weights.DEFAULT)model.eval()# Retrieve the matching preprocessing pipelinepreprocess = ResNet50_Weights.DEFAULT.transforms()
All classification models output logits (unnormalised scores) of shape [batch_size, num_classes] where num_classes=1000 by default for ImageNet. Apply torch.nn.functional.softmax to convert to probabilities, or torch.argmax to obtain the predicted class index.
To fine-tune on a custom number of classes you can either replace the head after loading pretrained weights, or pass num_classes directly at construction time (skips the pretrained head).
import torch.nn as nnfrom torchvision.models import resnet50, ResNet50_Weightsnum_classes = 10# Option 1 — load pretrained backbone, then replace the headmodel = resnet50(weights=ResNet50_Weights.DEFAULT)model.fc = nn.Linear(model.fc.in_features, num_classes) # ResNet uses model.fc# Option 2 — construct with custom num_classes (no pretrained head)from torchvision.models import efficientnet_b0model = efficientnet_b0(weights=None, num_classes=num_classes)# Vision Transformer head is nested under model.heads.headfrom torchvision.models import vit_b_16, ViT_B_16_Weightsmodel = vit_b_16(weights=ViT_B_16_Weights.DEFAULT)model.heads.head = nn.Linear(model.heads.head.in_features, num_classes)# ConvNeXt uses model.classifier[2]; Swin Transformer uses model.headfrom torchvision.models import convnext_tiny, swin_tmodel = convnext_tiny(weights=None)model.classifier[2] = nn.Linear(model.classifier[2].in_features, num_classes)
Residual networks introduced skip connections to enable very deep networks. TorchVision ships the original ResNets, aggregated residual transformations (ResNeXt), and Wide ResNets.
from torchvision.models import ( resnet18, ResNet18_Weights, resnet34, ResNet34_Weights, resnet50, ResNet50_Weights, resnet101, ResNet101_Weights, resnet152, ResNet152_Weights, resnext50_32x4d, ResNeXt50_32X4D_Weights, resnext101_32x8d, ResNeXt101_32X8D_Weights, resnext101_64x4d, ResNeXt101_64X4D_Weights, wide_resnet50_2, Wide_ResNet50_2_Weights, wide_resnet101_2, Wide_ResNet101_2_Weights,)# V2 weights are higher accuracy and should be preferredmodel = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)# ResNeXt — grouped convolutions, competitive with ResNet-200model = resnext101_32x8d(weights=ResNeXt101_32X8D_Weights.IMAGENET1K_V2)# Wide ResNet — wider rather than deepermodel = wide_resnet50_2(weights=Wide_ResNet50_2_Weights.IMAGENET1K_V2)
The classification head is model.fc — an nn.Linear(in_features, 1000) layer.
The EfficientNet head is a two-element nn.Sequential: model.classifier[0] is nn.Dropout and model.classifier[1] is the nn.Linear(in_features, num_classes) layer to replace for fine-tuning.
Several large RegNet-Y models (regnet_y_16gf, regnet_y_32gf, regnet_y_128gf) have additional IMAGENET1K_SWAG_E2E_V1 and IMAGENET1K_SWAG_LINEAR_V1 weights from self-supervised pre-training with SWAG, yielding significantly higher top-1 accuracy.
Hierarchical transformers with shifted-window self-attention. V2 adds log-spaced continuous relative position biases for improved transfer to higher resolutions.
Pure convolutional architecture inspired by transformer design choices (large kernels, LayerNorm, GELU), matching Swin Transformer accuracy with a standard ConvNet training recipe.
Dense connections — every layer receives feature maps from all previous layers in the same block — maximise feature reuse and reduce the number of parameters.
The weights object carries the matching preprocessing transforms so the input pipeline is always consistent with training.
import torchfrom PIL import Imagefrom torchvision.models import resnet50, ResNet50_Weightsweights = ResNet50_Weights.DEFAULT # alias for the best available weightsmodel = resnet50(weights=weights)model.eval()# weights.transforms() returns a torchvision transform pipelinepreprocess = weights.transforms()img = Image.open("cat.jpg").convert("RGB")batch = preprocess(img).unsqueeze(0) # [1, 3, 224, 224]with torch.inference_mode(): logits = model(batch) # [1, 1000]class_id = logits.argmax(dim=1).item()class_name = weights.meta["categories"][class_id]print(class_name)
Always call model.eval() before inference. Batch Normalization and Dropout behave differently in training mode and will produce incorrect predictions if left in training mode.