TorchVision transforms are composable building blocks for data augmentation and preprocessing. They live inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
torchvision.transforms.v2 and work on a variety of input types — raw tensors, PIL images, segmentation masks, bounding boxes, keypoints, and videos — all within a single, unified pipeline. Whether you are training an image classifier, an object detector, or a video model, transforms let you define your preprocessing once and apply it consistently across every modality in your sample.
V1 vs V2
TorchVision ships two generations of transforms. The legacytorchvision.transforms (v1) only operates on single images. The newer torchvision.transforms.v2 namespace — released in TorchVision 0.15 — extends every transform to handle all vision modalities simultaneously. New features and performance improvements are added exclusively to v2, so it is the recommended choice for all new projects.
transforms (v1)
- Image-only pipeline
- Slower PIL-based defaults
- No bounding box or mask support
ToTensor()for dtype conversion
transforms.v2 (recommended)
- Images, boxes, masks, keypoints, videos
- Faster tensor-native backend
- Arbitrary input structures (dicts, lists)
ToDtype()replacesToTensor()
A Basic Pipeline
The example below shows a typical image-classification pipeline using the v2 API.Compose chains transforms sequentially, passing the same input through each step in order.
ToDtype(torch.float32, scale=True) converts the image from uint8 [0, 255] to float32 [0.0, 1.0] in one step, replacing the deprecated ToTensor() from v1.
TVTensors
TVTensors are typedtorch.Tensor subclasses that carry the metadata transforms need to correctly augment non-image data. When a transform such as RandomHorizontalFlip receives a BoundingBoxes tensor, it knows to mirror the coordinates rather than the pixel values. All TVTensor types are importable from torchvision.tv_tensors.
Image
Shape
[..., C, H, W]. Interchangeable with a plain tensor inside transforms, but explicitly marks pixel data.BoundingBoxes
Shape
[N, K] where K=4 for axis-aligned boxes (XYXY, XYWH, CXCYWH) or K=5/8 for rotated boxes (XYWHR, CXCYWHR, XYXYXYXY). Requires format and canvas_size (H, W).Mask
Shape
[..., H, W]. Used for segmentation masks and binary detection masks.KeyPoints
Shape
[..., 2] (x, y per point). Requires canvas_size. Supports polygons, skeletons, and polylines.Video
Shape
[T, C, H, W] or [..., T, C, H, W]. Spatial transforms are applied identically across all frames.Wrapping tensors
You can wrap any existing tensor into a TVTensor at sample-load time:Multi-Modality in One Pass
One of the headline features of v2 is that a singletransform(...) call handles all modalities of a sample simultaneously, keeping spatial augmentations — crops, flips, rotations — perfectly consistent:
Class-Based vs. Functional API
TorchVision provides two interfaces for applying transforms. Class-based transforms (e.g.,T.RandomHorizontalFlip(p=0.5)) are stateful objects that can be composed with Compose or torch.nn.Sequential, serialized, and used in production pipelines. Parameters are sampled at call time via make_params.
Functional transforms in torchvision.transforms.v2.functional are pure, stateless functions that give you direct, low-level control. They are useful when you want to apply the same pre-sampled parameters to multiple tensors, or when building custom transforms:
Functional transforms skip the TVTensor dispatch logic — they operate on the
underlying tensor data directly. Use the class-based API when you need
multi-modality consistency.
Transform Categories
Geometry
Spatial transforms that alter image dimensions or layout:
Resize, RandomCrop, RandomResizedCrop, RandomHorizontalFlip, RandomRotation, RandomAffine, ElasticTransform, Pad, CenterCrop, FiveCrop, TenCrop.Color / Photometric
Pixel-value transforms that do not change spatial layout:
ColorJitter, Grayscale, RandomGrayscale, RandomInvert, RandomPosterize, RandomSolarize, RandomAutocontrast, RandomEqualize, RandomAdjustSharpness, RandomPhotometricDistort.Augmentation
Advanced learned and stochastic augmentation strategies:
AutoAugment, RandAugment, TrivialAugmentWide, AugMix, CutMix, MixUp, RandomErasing, JPEG.Type Conversion
Transforms that change the tensor type or encoding without modifying spatial content:
ToDtype, ToImage, PILToTensor, ToPILImage, ToPureTensor.Composition
Meta-transforms for combining others:
Compose, RandomApply, RandomChoice, RandomOrder.Composing with torch.nn.Sequential
For TorchScript compatibility, replace Compose with torch.nn.Sequential. Only scriptable transforms (those that operate purely on torch.Tensor) are supported in this mode: