TorchVision is PyTorch’s official computer vision library, maintained by the PyTorch team and contributors since 2016. It bundles the most commonly needed building blocks for computer vision workflows — from loading raw images and videos to applying state-of-the-art pre-trained models — all deeply integrated with PyTorch’s tensor ecosystem and autograd engine. Whether you are training a custom model from scratch, fine-tuning a pre-trained backbone, or running inference in production, TorchVision provides the primitives to get the job done without reinventing the wheel.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
What’s Inside TorchVision
TorchVision is organized into six top-level modules, each covering a distinct concern in a typical computer vision pipeline.torchvision.models
Pre-trained architectures for classification, detection, segmentation, optical flow, and video understanding. Weights are versioned via the
WeightsEnum API introduced in v0.13.torchvision.transforms
Image and video augmentation pipelines. The modern
v2 API operates on PIL images, tensors, and the new TVTensors types (masks, bounding boxes, keypoints) in a single, composable pass.torchvision.datasets
Ready-to-use
torch.utils.data.Dataset implementations for dozens of popular benchmarks including ImageNet, COCO, VOC, CIFAR, and more.torchvision.ops
Computer-vision–specific operators such as
nms, roi_align, deform_conv2d, and box_iou that are not present in core PyTorch.torchvision.io
Low-level image and video reading and writing backed by native C++ codecs. Supports JPEG, PNG, WebP, and video via PyAV (FFmpeg bindings).
torchvision.utils
Visualization helpers such as
make_grid, draw_bounding_boxes, draw_segmentation_masks, and draw_keypoints for inspecting model outputs.Modern Transforms and TVTensors
TorchVision v0.15 introduced a redesignedtorchvision.transforms.v2 API alongside a new type system called TVTensors. TVTensors are thin torch.Tensor subclasses — Image, BoundingBoxes, Mask, and Video — that carry semantic metadata so that a single transform pipeline can correctly handle every tensor type simultaneously. For example, a RandomHorizontalFlip applied to a (image, BoundingBoxes, Mask) tuple will flip all three in a geometrically consistent way.
The
v2 transforms API is the recommended path for all new projects. The legacy torchvision.transforms (v1) namespace remains available for backwards compatibility but will not receive new features.Version Compatibility
TorchVision releases are tightly coupled to PyTorch releases. Always install matching versions using the official PyTorch installer at pytorch.org/get-started/locally. The table below shows recent stable pairings and their supported Python ranges.torch | torchvision | Python |
|---|---|---|
main / nightly | main / nightly | >=3.10, <=3.14 |
2.12 | 0.27 | >=3.10, <=3.14 |
2.11 | 0.26 | >=3.10, <=3.14 |
2.10 | 0.25 | >=3.10, <=3.14 |
2.9 | 0.24 | >=3.10, <=3.14 |
2.8 | 0.23 | >=3.9, <=3.13 |
2.7 | 0.22 | >=3.9, <=3.13 |
2.6 | 0.21 | >=3.9, <=3.12 |
Image Backends
TorchVision supports multiple image-loading backends, selectable at runtime viatorchvision.set_image_backend():
- PIL / Pillow (default) — The reference backend. Supports the widest range of image formats and operations.
- Pillow-SIMD — A drop-in replacement for Pillow that uses SIMD CPU instructions for significantly faster decoding. Swap in without any code changes.
- accimage — Uses the Intel IPP library. Generally faster than Pillow for JPEG decoding but supports fewer operations and formats.
For most workflows the default PIL backend is sufficient. Consider Pillow-SIMD or accimage only when image loading is a measured bottleneck in your training pipeline.