TorchVision is designed so that you can go from a raw image file to a meaningful prediction in fewer than fifteen lines of Python. This guide walks through two complete end-to-end examples — image classification with ResNet-50 and object detection with Faster R-CNN — using only real, stable APIs. By the end you will understand how TorchVision’s weights system, built-in preprocessing transforms, and I/O utilities all fit together.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
Install
torch and torchvision together to guarantee version compatibility. The command below installs the CPU build; replace the index URL with a CUDA variant for GPU support (see the Installation guide for details).TorchVision v0.13 introduced a typed
WeightsEnum API that replaces the old pretrained=True boolean flag. Every weight entry bundles the download URL, the correct preprocessing transforms, and metadata such as the accuracy on ImageNet and the list of class names — everything you need in one object.from torchvision.models import resnet50, ResNet50_Weights
# ResNet50_Weights.DEFAULT always resolves to the best available weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.eval()
Calling
model.eval() is required before running inference. Without it, layers such as BatchNorm and Dropout behave as if the model is still training, which produces incorrect and non-deterministic predictions.from torchvision.models import ResNet50_Weights
# List every available weights entry for ResNet-50
for w in ResNet50_Weights:
print(w, w.meta["_metrics"]["ImageNet-1K"]["acc@1"])
The
weights object exposes a transforms() factory that returns the exact preprocessing pipeline the model was trained with — no need to hard-code Normalize mean/std values or resize dimensions by hand.from torchvision.io import read_image
# Read a JPEG or PNG from disk as a uint8 RGB tensor [C, H, W]
img = read_image("path/to/image.jpg")
# Build the preprocessing pipeline from the weights metadata
preprocess = weights.transforms()
# Apply preprocessing and add the batch dimension → [1, C, H, W]
batch = preprocess(img).unsqueeze(0)
read_image returns a torch.Tensor of shape [C, H, W] with dtype=torch.uint8. The preprocess transform handles resizing, center-cropping, conversion to float32, and normalization automatically.Pass the preprocessed batch through the model, apply softmax to get probabilities, and look up the predicted class name from the weights metadata.
import torch
# Forward pass — no gradients needed for inference
with torch.no_grad():
prediction = model(batch).squeeze(0).softmax(0)
# Highest-probability class
class_id = prediction.argmax().item()
score = prediction[class_id].item()
# Class name is stored directly in the weights metadata
category_name = weights.meta["categories"][class_id]
print(f"{category_name}: {100 * score:.1f}%")
# e.g. "golden retriever: 96.5%"
The full weights metadata dictionary also contains useful fields like
recipe (a link to the training configuration) and acc@1 / acc@5 (top-1 and top-5 ImageNet accuracies):print(weights.meta["_metrics"]["ImageNet-1K"]["acc@1"]) # e.g. 80.858
print(weights.meta["recipe"]) # URL to training recipe
The same pattern — load weights, derive transforms, run inference — applies to every model family in TorchVision. Here is a complete object detection example using Faster R-CNN with a ResNet-50 FPN backbone:
from torchvision.models.detection import fasterrcnn_resnet50_fpn, FasterRCNN_ResNet50_FPN_Weights
from torchvision.io import read_image
# Load model with best available COCO weights
weights = FasterRCNN_ResNet50_FPN_Weights.DEFAULT
model = fasterrcnn_resnet50_fpn(weights=weights)
model.eval()
# Preprocessing for detection models returns a list of float tensors
preprocess = weights.transforms()
img = read_image("path/to/image.jpg")
batch = [preprocess(img)]
with torch.no_grad():
predictions = model(batch)
# predictions is a list of dicts, one per image
boxes = predictions[0]["boxes"] # FloatTensor[N, 4] in [x1, y1, x2, y2] format
labels = predictions[0]["labels"] # Int64Tensor[N]
scores = predictions[0]["scores"] # FloatTensor[N]
# Filter to high-confidence detections
keep = scores > 0.8
print(f"Detected {keep.sum().item()} objects with score > 0.8")
# Map label IDs to COCO category names
categories = weights.meta["categories"]
for box, label, score in zip(boxes[keep], labels[keep], scores[keep]):
print(f" {categories[label]}: {score:.2f} box={box.tolist()}")
Next Steps
Now that you have run your first predictions, explore the rest of TorchVision to go deeper:Models Overview
Browse all available architectures and learn how to fine-tune pre-trained models on custom datasets.
Transforms Overview
Master the v2 transforms API and TVTensors for consistent augmentation across images, masks, and bounding boxes.