TorchVision includes datasets for object detection, instance segmentation, semantic segmentation, image captioning, and face detection. Each dataset returns anDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
(image, target) tuple where target carries the task-specific annotation structure — bounding boxes, segmentation masks, captions, or keypoints — exactly as produced by the original dataset authors.
COCO
CocoDetection
The MS COCO Detection / Segmentation benchmark. Requirespycocotools (pip install pycocotools).
__getitem__ returns (PIL.Image, list[dict]) where each dict is a raw COCO annotation record:
| Key | Type | Description |
|---|---|---|
id | int | Unique annotation ID |
image_id | int | Corresponding image ID |
category_id | int | Category index |
segmentation | list | RLE or polygon masks |
bbox | list[float] | [x, y, width, height] in pixels |
area | float | Bounding-box area |
iscrowd | int | 0 = individual instance, 1 = crowd region |
CocoCaptions
Image captioning split of MS COCO. Shares the same constructor asCocoDetection.
__getitem__ returns (PIL.Image, list[str]) — a PIL image and a list of caption strings for that image.
Pascal VOC
Both VOC classes share the same base constructor. Supports dataset years 2007 through 2012.VOCDetection
__getitem__ returns (PIL.Image, dict) where the dict is a parsed XML annotation tree. The top-level key is "annotation", containing:
"folder","filename","size"(width, height, depth)"object"— a list of dicts, each with"name","pose","truncated","difficult", and"bndbox"(xmin,ymin,xmax,ymax)
VOCSegmentation
__getitem__ returns (PIL.Image, PIL.Image) — the input image and its palette-mode segmentation mask (one pixel value per semantic class, with 255 for the void/boundary label).
Cityscapes
Urban street-scene dataset with 19 semantic classes, available in fine (gtFine) and coarse (gtCoarse) annotation quality. Cityscapes is not automatically downloadable; register at cityscapes-dataset.com to obtain the archives.
target_type accepts a single string or a list of types; when a list is passed, __getitem__ returns a list of targets in the same order.
__getitem__ returns (PIL.Image, target) where target depends on target_type:
target_type | Return type | Description |
|---|---|---|
"semantic" | PIL.Image | Per-pixel semantic class index (train IDs) |
"instance" | PIL.Image | Per-pixel instance ID |
"color" | PIL.Image | RGB-coloured semantic label image |
"polygon" | dict | Raw polygon annotation JSON dict |
SBDataset
The Semantic Boundaries Dataset (SBD) provides additional segmentation and boundary annotations for PASCAL VOC images.The SBD train/val splits differ from the official PASCAL VOC splits. Some VOC
train images appear in SBD’s val set. Requires scipy.__getitem__ returns:
- In
"boundaries"mode:(PIL.Image, ndarray[C, H, W])— one boundary map per class - In
"segmentation"mode:(PIL.Image, PIL.Image)— the input image and a segmentation mask
WIDERFace
Face detection dataset with images covering 61 event categories and varying difficulty levels.Requires
gdown (pip install gdown) for automatic download.__getitem__ returns (PIL.Image, dict | None). For "train" and "val" splits the dict contains:
| Key | Type | Description |
|---|---|---|
"bbox" | Tensor[N, 4] | Bounding boxes in [x, y, w, h] format |
"blur" | Tensor[N] | Blur level (0–2) |
"expression" | Tensor[N] | Expression label |
"illumination" | Tensor[N] | Illumination label |
"occlusion" | Tensor[N] | Occlusion level (0–2) |
"pose" | Tensor[N] | Pose label |
"invalid" | Tensor[N] | Invalid flag |
target is None for the "test" split (no annotations provided).
Kitti
KITTI Vision Benchmark Suite for 2D object detection and depth estimation. Images come with per-instance 3D bounding box annotations in.txt files.
__getitem__ returns (PIL.Image, list[dict]). Each dict represents one annotated object with keys including "type", "truncated", "occluded", "alpha", "bbox" (2D box), "dimensions", "location", "rotation_y".
LFW (Labeled Faces in the Wild)
Face recognition dataset with two task variants. Note that automatic download is no longer supported — download the dataset manually from vis-www.cs.umass.edu/lfw.LFWPeople
Face identification — each sample is a face image with a person identity label.__getitem__ returns (PIL.Image, int) — face image and person ID.
LFWPairs
Face verification — each sample is a pair of face images with a binary same/different label.__getitem__ returns (PIL.Image, PIL.Image, int) — two face images and a binary label (1 = same person, 0 = different).
Dataset Summary
| Class | Task | Splits | download=True | __getitem__ target type |
|---|---|---|---|---|
CocoDetection | Object detection / instance segmentation | train / val / test | ❌ Manual | list[dict] (COCO annotations) |
CocoCaptions | Image captioning | train / val | ❌ Manual | list[str] |
VOCDetection | Object detection | train / trainval / val / test (2007) | ✅ | dict (XML tree) |
VOCSegmentation | Semantic segmentation | train / trainval / val / test (2007) | ✅ | PIL.Image mask |
Cityscapes | Semantic / instance segmentation | train / val / test / train_extra | ❌ Manual | depends on target_type |
SBDataset | Boundaries / segmentation | train / val / train_noval | ✅ | ndarray or PIL.Image |
WIDERFace | Face detection | train / val / test | ✅ (needs gdown) | dict of tensors |
Kitti | 3D object detection | train / test (via train bool) | ✅ | list[dict] |
LFWPeople | Face identification | 10fold / train / test | ❌ Manual | int |
LFWPairs | Face verification | 10fold / train / test | ❌ Manual | int (binary label) |