TorchVision provides three families of pre-trained semantic segmentation models — DeepLabV3, FCN, and LRASPP — that assign a class label to every pixel in an image. Unlike instance segmentation (which separates individual objects), semantic segmentation produces a single flat label map. All pretrained weights were trained on a 21-class subset of COCO 2017 that matches the Pascal VOC categories, making them immediately useful for outdoor-scene understanding tasks.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
Semantic segmentation models take a single batched tensor
[B, 3, H, W] as input (unlike detection models, which take a list). The weights.transforms() preprocessor handles resizing to 520×520 and ImageNet normalization automatically.PASCAL VOC Class Categories
All pretrained segmentation weights use the following 21-class vocabulary (index 0 is background):| Index | Class | Index | Class | Index | Class |
|---|---|---|---|---|---|
| 0 | __background__ | 7 | car | 14 | motorbike |
| 1 | aeroplane | 8 | cat | 15 | person |
| 2 | bicycle | 9 | chair | 16 | pottedplant |
| 3 | bird | 10 | cow | 17 | sheep |
| 4 | boat | 11 | diningtable | 18 | sofa |
| 5 | bottle | 12 | dog | 19 | train |
| 6 | bus | 13 | horse | 20 | tvmonitor |
Input / Output Contract
DeepLabV3
DeepLabV3 uses Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing resolution. Dilated (atrous) convolutions with rates{12, 24, 36} are applied in parallel, allowing the network to aggregate information across large receptive fields while maintaining the spatial output stride.
deeplabv3_resnet50
ResNet-50 backbone. Fastest ResNet variant.
mIoU: 66.4 | 42.0M params | 178.7 GFLOPs
deeplabv3_resnet101
ResNet-101 backbone. Higher accuracy.
mIoU: 67.4 | 61.0M params | 258.7 GFLOPs
deeplabv3_mobilenet_v3_large
MobileNetV3-Large. Mobile-friendly.
mIoU: 60.3 | 11.0M params | 10.5 GFLOPs
The
DEFAULT alias for all three DeepLabV3 weight enums is COCO_WITH_VOC_LABELS_V1 — trained on COCO images filtered to the 20 Pascal VOC object categories (plus background), giving 21 output classes total.FCN (Fully Convolutional Network)
FCN was one of the first end-to-end deep networks for dense prediction. It replaces the fully-connected classification head with convolutional layers and uses skip connections from earlier pooling layers to recover spatial detail. TorchVision ships two backbone variants, both using the same ResNet FPN feature extractor.fcn_resnet50
ResNet-50 backbone.
mIoU: 60.5 · pixel acc: 91.4% | 35.3M params | 152.7 GFLOPs
fcn_resnet101
ResNet-101 backbone. Higher accuracy.
mIoU: 63.7 · pixel acc: 91.9% | 54.3M params | 232.7 GFLOPs
LRASPP (Lite R-ASPP)
LRASPP (Lite Reduced Atrous Spatial Pyramid Pooling) is a mobile-first segmentation head introduced in the MobileNetV3 paper. It simplifies the ASPP module by using a single large-kernel average-pooling branch and depthwise convolutions, trading a few mIoU points for a dramatic reduction in parameters and FLOPs.| Weight | mIoU | Pixel Acc | Params | GFLOPs | File size |
|---|---|---|---|---|---|
COCO_WITH_VOC_LABELS_V1 (DEFAULT) | 57.9 | 91.2% | 3.2M | 2.1 | 12.5 MB |
Auxiliary Loss During Training
DeepLabV3 and FCN both support an auxiliary classification head attached to an intermediate layer (layer3 of ResNet). When aux_loss=True, the output["aux"] key is populated during the forward pass.
When loading pretrained weights (
weights=DeepLabV3_ResNet50_Weights.DEFAULT), aux_loss is automatically set to True because the pretrained checkpoint includes the auxiliary head parameters.Complete Inference Example
Model Comparison
| Model | mIoU | Pixel Acc | Params | GFLOPs | File size |
|---|---|---|---|---|---|
deeplabv3_resnet101 | 67.4 | 92.4% | 61.0M | 258.7 | 233 MB |
deeplabv3_resnet50 | 66.4 | 92.4% | 42.0M | 178.7 | 161 MB |
fcn_resnet101 | 63.7 | 91.9% | 54.3M | 232.7 | 208 MB |
fcn_resnet50 | 60.5 | 91.4% | 35.3M | 152.7 | 135 MB |
deeplabv3_mobilenet_v3_large | 60.3 | 91.2% | 11.0M | 10.5 | 42 MB |
lraspp_mobilenet_v3_large | 57.9 | 91.2% | 3.2M | 2.1 | 12.5 MB |