TorchVision provides three families of temporal and multi-view datasets: video datasets for action recognition, optical flow datasets for dense motion estimation, and stereo matching datasets for disparity / depth estimation. All three families extendDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
VisionDataset and follow the same transform convention; video datasets additionally ship with clip samplers compatible with distributed training.
Video Datasets
Kinetics (400 / 600 / 700)
The DeepMind Kinetics family of large-scale action-recognition benchmarks. The dataset treats each video as a collection of fixed-length clips;__len__ returns the total number of clips, not videos.
__getitem__ returns (video, audio, label):
| Return value | Shape / type | Description |
|---|---|---|
video | Tensor[T, C, H, W] (TCHW) or Tensor[T, H, W, C] (THWC) | T frames as uint8 |
audio | Tensor[K, L] | K audio channels, L sample points, float |
label | int | Action class index |
HMDB51
51-class human motion database with videos from movies and online sources.__getitem__ returns (video, audio, label) — same layout as Kinetics.
Download the split annotation files separately from the HMDB51 dataset page. Pass the path to these files as
annotation_path.UCF101
101-class action recognition dataset collected from YouTube.__getitem__ returns (video, audio, label).
Annotation split files for UCF101 can be downloaded from the THUMOS Challenge page.
MovingMNIST
Synthetic dataset of bouncing MNIST digits; useful for video prediction research.__getitem__ returns Tensor[T, H, W] — a sequence of grayscale frames.
Video Samplers
For distributed training, TorchVision provides clip-aware samplers intorchvision.datasets.samplers:
| Sampler | Description |
|---|---|
RandomClipSampler(video_clips, max_clips_per_video) | Randomly samples up to max_clips_per_video clips from each video |
UniformClipSampler(video_clips, num_clips_per_video) | Uniformly samples exactly num_clips_per_video clips per video |
DistributedSampler(dataset, group_size=1) | Distributes groups of group_size consecutive clips across ranks; ensures temporally-adjacent clips stay on the same GPU |
Optical Flow Datasets
All optical flow datasets extend the internalFlowDataset base class. __getitem__ returns (img1, img2, flow) — a pair of consecutive frames and the ground-truth forward flow field. Datasets with a built-in validity mask return a 4-tuple (img1, img2, flow, valid_flow_mask).
Flow tensors have shape (2, H, W) (dx, dy channels) as numpy.ndarray.
Sintel
Rendered synthetic sequences from the Blender short film, in clean and final render passes.__getitem__ returns (img1, img2, flow) where flow is None when split="test". Flow shape: (2, H, W).
KittiFlow
KITTI 2015 optical flow benchmark derived from driving sequences with sparse LiDAR-validated ground truth.__getitem__ always returns (img1, img2, flow, valid_flow_mask) — a 4-tuple because KittiFlow has a built-in validity mask. valid_flow_mask is a boolean ndarray of shape (H, W) indicating which pixels have valid flow. Both flow and valid_flow_mask are None when split="test".
FlyingChairs
Large synthetic dataset of 2D chair images composited over background images.You must also download
FlyingChairs_train_val.txt from the dataset page and place it under root/FlyingChairs/.__getitem__ returns (img1, img2, flow). Flow shape: (2, H, W).
FlyingThings3D
Synthetic 3D scenes with randomly flying everyday objects; provides clean and final render passes, and supports left/right cameras.__getitem__ returns (img1, img2, flow). Flow shape: (2, H, W).
HD1K
High-Definition 1K — driving sequences with dense flow annotation.__getitem__ returns (img1, img2, flow, valid_flow_mask) (built-in validity mask). Shape: (2, H, W).
Optical Flow Dataset Summary
| Class | Split support | Validity mask | Flow format |
|---|---|---|---|
Sintel | train / test, pass clean/final/both | No (can be generated by transforms) | ndarray (2, H, W) |
KittiFlow | train / test | ✅ Built-in | ndarray (2, H, W) |
FlyingChairs | train / val | No | ndarray (2, H, W) |
FlyingThings3D | train / test, pass + camera | No | ndarray (2, H, W) |
HD1K | train / test | ✅ Built-in | ndarray (2, H, W) |
Stereo Matching Datasets
All stereo datasets extendStereoMatchingDataset. __getitem__ returns (img_left, img_right, disparity) or a 4-tuple (img_left, img_right, disparity, valid_mask) when a built-in mask is available. Images are PIL Images; disparity is an ndarray of shape (1, H, W) (left disparity only), or None for test splits without annotations.
Kitti2012Stereo
KITTI 2012 stereo benchmark from driving sequences.Kitti2015Stereo
KITTI 2015 stereo benchmark with denser LiDAR-derived ground truth.CarlaStereo
Carla simulator high-resolution training data, linked from the CREStereo project.Middlebury2014Stereo
Indoor stereo scenes with photorealistic lighting variation.CREStereo
Synthetic stereo pairs across four object domains (ShapeNet, reflective objects, trees, holes).FallingThingsStereo
Synthetic objects dropped onto various backgrounds.SceneFlowStereo
Covers three variants of the synthetic SceneFlow benchmark.SintelStereo
Stereo variant of the Sintel synthetic benchmark.InStereo2k
Real-world indoor stereo dataset with 2 000 scene pairs.ETH3DStereo
High-resolution indoor and outdoor stereo pairs from the ETH3D benchmark.Stereo Matching Dataset Summary
| Class | Split support | Built-in mask | download=True |
|---|---|---|---|
Kitti2012Stereo | train / test | ✅ Built-in | ❌ Manual |
Kitti2015Stereo | train / test | ✅ Built-in | ❌ Manual |
CarlaStereo | training only | No | ❌ Manual |
Middlebury2014Stereo | train / additional / test | ✅ Built-in | ✅ |
CREStereo | training only | ✅ Built-in | ❌ Manual |
FallingThingsStereo | single / mixed / both variants | No | ❌ Manual |
SceneFlowStereo | training only | No | ❌ Manual |
SintelStereo | training only | ✅ Built-in | ❌ Manual |
InStereo2k | train / test | No | ❌ Manual |
ETH3DStereo | train / test | ✅ Built-in | ❌ Manual |