Use this file to discover all available pages before exploring further.
TorchVision bundles a full suite of pre-trained object detection models that span a wide range of accuracy–speed trade-offs. Every model follows the same simple contract: pass a list of [C, H, W] float tensors in the 0–1 range, get back a list of prediction dictionaries. All weights were trained on COCO 2017 (80 foreground classes + 1 background = 91 total indices) and ship with a matching transforms() preprocessor so there is nothing to configure manually.
All detection models expect a Python list of tensors, not a single batched tensor. Each tensor can have a different spatial size — the model’s internal GeneralizedRCNNTransform handles resizing and normalization automatically.
In model.train() the model accepts a second argument: a list of target dictionaries, one per image. It returns a Dict[str, Tensor] of losses rather than predictions.
images = [preprocess(img)]targets = [{ "boxes": torch.tensor([[100., 50., 300., 250.]], dtype=torch.float32), "labels": torch.tensor([1], dtype=torch.int64),}]model.train()loss_dict = model(images, targets)# Keys: 'loss_classifier', 'loss_box_reg', 'loss_objectness', 'loss_rpn_box_reg'total_loss = sum(loss for loss in loss_dict.values())total_loss.backward()
Boxes must be in [x1, y1, x2, y2] (XYXY) absolute pixel format with 0 ≤ x1 < x2 ≤ W and 0 ≤ y1 < y2 ≤ H. Labels must be torch.int64.
Faster R-CNN is a two-stage detector that uses a Region Proposal Network (RPN) to generate candidate bounding boxes, then refines them through a second classification and regression head. The FPN backbone extracts multi-scale features, making it strong on both large and small objects.
fasterrcnn_resnet50_fpn
ResNet-50 + FPN backbone. The canonical baseline — good accuracy with reasonable throughput.
COCO box mAP: 37.0 | 41.8M params | 134.4 GFLOPs
fasterrcnn_resnet50_fpn_v2
Improved training recipe with deeper RPN and box heads.
COCO box mAP: 46.7 | 43.7M params | 280.4 GFLOPs
Same backbone, fixed 320×320 input for maximum speed on edge devices.
COCO box mAP: 22.8 | 19.4M params | 0.72 GFLOPs
from torchvision.models.detection import ( fasterrcnn_resnet50_fpn, FasterRCNN_ResNet50_FPN_Weights, fasterrcnn_resnet50_fpn_v2, FasterRCNN_ResNet50_FPN_V2_Weights, fasterrcnn_mobilenet_v3_large_fpn, FasterRCNN_MobileNet_V3_Large_FPN_Weights, fasterrcnn_mobilenet_v3_large_320_fpn, FasterRCNN_MobileNet_V3_Large_320_FPN_Weights,)# V1 — faithful to the original papermodel_v1 = fasterrcnn_resnet50_fpn( weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT)# V2 — enhanced recipe, higher accuracymodel_v2 = fasterrcnn_resnet50_fpn_v2( weights=FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT)# MobileNet — high-resolution mobile variantmodel_mob = fasterrcnn_mobilenet_v3_large_fpn( weights=FasterRCNN_MobileNet_V3_Large_FPN_Weights.DEFAULT)# MobileNet 320 — ultra-fast mobile variant (320 × 320)model_320 = fasterrcnn_mobilenet_v3_large_320_fpn( weights=FasterRCNN_MobileNet_V3_Large_320_FPN_Weights.DEFAULT)
fasterrcnn_resnet50_fpn_v2 is the recommended default for most production use-cases: its improved training recipe (deeper convolutional RPN/box heads + BatchNorm) gives ~10 mAP points over V1 with only a ~2× compute increase.
Mask R-CNN extends Faster R-CNN with a parallel instance segmentation head that predicts a binary pixel mask for each detected object. The training target dictionary requires an additional masks key.
Keypoint R-CNN adds a keypoint prediction head on top of Faster R-CNN. The pretrained weights detect 17 COCO person keypoints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles).
The third column of keypoints is a visibility flag: 0 = not labeled, 1 = labeled but occluded, 2 = labeled and visible. Training targets also require a keypoints field of shape [N, K, 3].
FCOS (Fully Convolutional One-Stage Object Detection) is an anchor-free detector. It avoids anchor hyperparameter tuning by predicting bounding box offsets directly from feature map locations, using a centerness branch to suppress low-quality detections.
Weight
Box mAP
Params
GFLOPs
File size
COCO_V1 (DEFAULT)
39.2
32.3M
128.2
123.6 MB
from torchvision.models.detection import ( fcos_resnet50_fpn, FCOS_ResNet50_FPN_Weights,)weights = FCOS_ResNet50_FPN_Weights.DEFAULTmodel = fcos_resnet50_fpn(weights=weights)model.eval()preprocess = weights.transforms()batch = [preprocess(read_image("image.jpg"))]with torch.no_grad(): predictions = model(batch)# Same output schema as Faster R-CNN:# 'boxes', 'labels', 'scores'print(predictions[0]["boxes"].shape) # [N, 4]
FCOS is a good drop-in replacement for Faster R-CNN when you want to avoid anchor grid tuning. It achieves competitive mAP (~39.2) at lower compute (128 GFLOPs vs 280 for Faster R-CNN V2) while sharing the same inference API.
RetinaNet is a one-stage detector that introduces Focal Loss to address the class imbalance problem between foreground and background anchors during training. It uses an FPN backbone with two subnetworks (classification and box regression) that share weights across all pyramid levels.
retinanet_resnet50_fpn
Standard recipe from the original paper.
COCO box mAP: 36.4 | 34.0M params | 151.5 GFLOPs
retinanet_resnet50_fpn_v2
Enhanced training recipe with BatchNorm heads.
COCO box mAP: 41.5 | 38.2M params | 152.2 GFLOPs
SSD (Single Shot MultiBox Detector) predicts boxes at multiple fixed aspect-ratio anchors across several feature maps in a single forward pass. SSDLite replaces standard convolutions with depthwise-separable convolutions and pairs with a MobileNetV3 backbone for deployment on mobile hardware.
SSD and SSDLite internally resize all images to a fixed spatial size (300×300 or 320×320 respectively) regardless of the input dimensions. The output boxes are rescaled back to the original image coordinates before being returned.
All detection models accept any backbone with an out_channels attribute. The following snippet shows how to attach a custom MobileNetV2 backbone to FasterRCNN: