The transforms v2 API was designed from the ground up to handle the full range of computer vision tasks, not just image classification. By wrapping your tensors in typedDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
TVTensor subclasses, you give every transform the context it needs to augment images, bounding boxes, segmentation masks, keypoints, and videos correctly and consistently within a single pipeline call. This guide walks through the essential concepts for working with detection and segmentation data using torchvision.transforms.v2.
TVTensors
TVTensors aretorch.Tensor subclasses that annotate data with its semantic type. Transforms in v2 inspect these types at runtime and dispatch the appropriate transformation logic — a RandomHorizontalFlip will mirror pixel data for images but mirror coordinate values for bounding boxes, all from the same transform(img, boxes) call.
All TVTensor types are exported from torchvision.tv_tensors:
Image
Shape
[..., C, H, W]. Created from a tensor, PIL image, or ndarray. Values are not rescaled on construction.BoundingBoxes
Shape
[N, 4]. Requires a format (axis-aligned: XYXY, XYWH, CXCYWH; rotated: XYWHR, CXCYWHR, XYXYXYXY) and canvas_size (H, W).Mask
Shape
[..., H, W]. Used for both semantic segmentation masks and per-instance binary masks.KeyPoints
Shape
[..., 2] (x, y per point). Requires canvas_size. Supports polygons, skeletons, and polylines.Video
Shape
[T, C, H, W]. Spatial transforms are applied frame-by-frame with identical parameters.Wrapping tensors as TVTensors
Construct TVTensors from any tensor-like data.BoundingBoxes additionally requires a coordinate format and the canvas_size of the corresponding image:
TVTensors are thin wrappers — they carry zero overhead beyond the attached
metadata attributes. All standard
torch.Tensor operations work on them
unchanged.Transforms on Detection Data
Once your tensors are wrapped, pass them together to any v2 transform. The same spatial parameters (crop coordinates, flip decision, rotation angle, etc.) are applied consistently across all inputs:BoundingBoxFormat options
| Format | Description |
|---|---|
XYXY | Top-left corner (x1, y1) and bottom-right corner (x2, y2). |
XYWH | Top-left corner (x1, y1), width w, and height h. |
CXCYWH | Center (cx, cy), width w, and height h. |
XYWHR | Top-left (x1, y1), width, height, and rotation angle (degrees). Rotated box. |
CXCYWHR | Center (cx, cy), width, height, and rotation angle. Rotated box. |
XYXYXYXY | All four corners in order: top-left, top-right, bottom-right, bottom-left. Rotated box. |
ConvertBoundingBoxFormat:
Wrapping Existing Datasets
For popular built-in TorchVision datasets, thewrap_dataset_for_transforms_v2 utility automatically converts raw dataset outputs into the appropriate TVTensor types:
wrap_dataset_for_transforms_v2 supports CocoDetection, VOCDetection,
VOCSegmentation, CelebA, Kitti, OxfordIIITPet, Cityscapes,
WIDERFace, and video classification datasets like Kinetics. Image
classification datasets (e.g., ImageNet) are a no-op — they already work
with v2 out of the box.The set_return_type Context Manager
By default, torch operations on TVTensors return plain torch.Tensor objects (the type annotation is stripped). Use set_return_type to preserve the TVTensor subclass through standard torch ops:
Migrating from V1 to V2
Update the import
Replace
import torchvision.transforms as T with import torchvision.transforms.v2 as T. For most image classification pipelines, this is the only change required.Replace ToTensor with ToImage + ToDtype
ToTensor() is deprecated in v2 because it silently rescales values from [0, 255] to [0.0, 1.0] and changes the dtype in a single opaque step.Use the explicit two-step replacement instead:Wrap multi-modal data
If your dataset returns bounding boxes or masks alongside images, wrap them in the appropriate TVTensor types (or use
wrap_dataset_for_transforms_v2). Pass all modalities as separate arguments to the transform.Composing with torch.nn.Sequential
For TorchScript-compatible pipelines, use torch.nn.Sequential instead of Compose. Only transforms that operate on torch.Tensor (no PIL, no lambda functions) are scriptable: