Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt

Use this file to discover all available pages before exploring further.

torchvision.ops is a library of vision-specific operations that sit below the model level: box manipulation, region-of-interest pooling, custom losses, deformable layers, and regularization blocks. These primitives are used internally by TorchVision’s detection and segmentation models (Faster R-CNN, RetinaNet, Mask R-CNN, etc.) and are fully available for use in custom architectures. Many operations include both a functional form and an nn.Module wrapper.

Box Operations

All box functions expect tensors in (x1, y1, x2, y2) (XYXY) format unless noted otherwise. Coordinates are floating-point.

Non-Maximum Suppression

import torch
from torchvision.ops import nms, batched_nms

boxes = torch.tensor([
    [100., 50., 300., 250.],
    [110., 55., 310., 260.],  # overlapping box
    [400., 100., 600., 350.],
], dtype=torch.float32)
scores = torch.tensor([0.9, 0.75, 0.85])

# Standard NMS — returns indices of kept boxes
keep = nms(boxes, scores, iou_threshold=0.5)
print(keep)  # tensor([0, 2]) — indices of kept boxes

# Batched NMS — applies NMS per class (idxs selects the class)
idxs = torch.tensor([0, 0, 1])
keep_batched = batched_nms(boxes, scores, idxs, iou_threshold=0.5)
FunctionSignatureDescription
nms(boxes, scores, iou_threshold) -> TensorStandard NMS; returns kept indices sorted by descending score
batched_nms(boxes, scores, idxs, iou_threshold) -> TensorNMS applied independently per element in idxs

IoU Metrics

from torchvision.ops import box_iou, generalized_box_iou, distance_box_iou, complete_box_iou

boxes1 = torch.tensor([[0., 0., 10., 10.]])
boxes2 = torch.tensor([[5., 5., 15., 15.], [20., 20., 30., 30.]])

iou = box_iou(boxes1, boxes2)           # Tensor[1, 2]
giou = generalized_box_iou(boxes1, boxes2)  # Tensor[1, 2] — handles non-overlapping boxes
diou = distance_box_iou(boxes1, boxes2)     # Tensor[1, 2] — adds centre-point distance term
ciou = complete_box_iou(boxes1, boxes2)     # Tensor[1, 2] — adds aspect-ratio consistency term
FunctionReturnNotes
box_iouTensor[N, M]Standard intersection-over-union
generalized_box_iouTensor[N, M]GIoU — non-zero gradient even for non-overlapping boxes
distance_box_iouTensor[N, M]DIoU — minimises centre-point distance
complete_box_iouTensor[N, M]CIoU — includes aspect-ratio penalty

Box Utilities

from torchvision.ops import (
    box_area, box_convert,
    clip_boxes_to_image, remove_small_boxes, masks_to_boxes
)

# Compute area (supports XYXY by default, also accepts XYWH / CXCYWH via fmt=)
areas = box_area(boxes)  # Tensor[N]

# Convert between formats: "xyxy", "xywh", "cxcywh", "xywhr", "cxcywhr", "xyxyxyxy"
xywh_boxes = box_convert(boxes, in_fmt="xyxy", out_fmt="xywh")
cxcywh_boxes = box_convert(boxes, in_fmt="xyxy", out_fmt="cxcywh")

# Clip boxes to image boundaries
clipped = clip_boxes_to_image(boxes, size=(480, 640))  # (H, W)

# Remove tiny boxes (degenerate proposals)
keep = remove_small_boxes(boxes, min_size=10.0)  # returns valid indices

# Convert binary segmentation masks to bounding boxes
masks = torch.zeros(3, 100, 100, dtype=torch.bool)
masks[0, 20:50, 30:70] = True
tight_boxes = masks_to_boxes(masks)  # Tensor[3, 4] in XYXY format

RoI Pooling

Region-of-Interest (RoI) operations extract fixed-size feature maps for each proposed bounding box, enabling box-specific classification and regression heads.

RoIAlign

from torchvision.ops import RoIAlign
import torch

roi_align = RoIAlign(
    output_size=(7, 7),
    spatial_scale=1.0 / 16,  # feature map stride (e.g., 16 for a stride-16 backbone)
    sampling_ratio=-1,        # -1 = adaptive; >0 = fixed grid per bin
    aligned=True,             # half-pixel offset; default is False
)

feature_map = torch.rand(1, 256, 56, 56)
# rois: [batch_idx, x1, y1, x2, y2] in image coordinates
rois = torch.tensor([[0., 10., 10., 100., 100.]])
pooled = roi_align(feature_map, [rois[:, 1:]])  # Tensor[1, 256, 7, 7]
RoIAlign is differentiable and preferred over RoIPool for most use cases. The aligned parameter defaults to False; setting it to True shifts coordinates by half a pixel to better align with the feature grid, matching the Detectron2 convention (recommended for new models).

Other RoI Poolers

ClassDescription
RoIPool(output_size, spatial_scale)Hard max-pool; not differentiable w.r.t. coordinates
PSRoIAlign(output_size, spatial_scale, sampling_ratio)Position-sensitive RoI align (R-FCN)
PSRoIPool(output_size, spatial_scale)Position-sensitive max-pool
MultiScaleRoIAlign(featmap_names, output_size, sampling_ratio)Assigns each RoI to a feature level by scale; used in Faster R-CNN + FPN
from torchvision.ops import MultiScaleRoIAlign

# Used inside Faster R-CNN / Mask R-CNN
roi_pooler = MultiScaleRoIAlign(
    featmap_names=["0", "1", "2", "3"],
    output_size=7,
    sampling_ratio=2,
)

Loss Functions

Focal Loss

sigmoid_focal_loss addresses class imbalance in dense detection by down-weighting well-classified examples:
from torchvision.ops import sigmoid_focal_loss
import torch

# Both tensors: Tensor[N, C], float
inputs = torch.randn(8, 80)   # raw logits
targets = torch.zeros(8, 80)
targets[0, 3] = 1.0           # class 3 positive

loss = sigmoid_focal_loss(
    inputs, targets,
    alpha=0.25,     # balances positive vs negative examples
    gamma=2.0,      # focusing parameter; 0 = standard BCE
    reduction="sum",
)

IoU-Based Losses

These losses operate directly on box coordinates, providing smooth gradients even when boxes do not overlap:
from torchvision.ops import (
    generalized_box_iou_loss,
    distance_box_iou_loss,
    complete_box_iou_loss,
)

pred_boxes = torch.tensor([[10., 10., 50., 50.]])
gt_boxes   = torch.tensor([[15., 15., 55., 55.]])

giou_loss = generalized_box_iou_loss(pred_boxes, gt_boxes, reduction="mean")
diou_loss = distance_box_iou_loss(pred_boxes, gt_boxes, reduction="mean")
ciou_loss = complete_box_iou_loss(pred_boxes, gt_boxes, reduction="mean")
Loss functionNotes
generalized_box_iou_lossGIoU loss; non-zero gradient for non-overlapping boxes
distance_box_iou_lossDIoU loss; penalises centre-point distance
complete_box_iou_lossCIoU loss; adds aspect-ratio consistency term

Layers and Modules

Deformable Convolution

DeformConv2d implements Deformable Convolutional Networks v2 (Zhu et al., 2019). It learns spatial sampling offsets and modulation masks that allow the receptive field to adapt to object shape.
from torchvision.ops import DeformConv2d, deform_conv2d
import torch

# Module form
dcn = DeformConv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    stride=1,
    padding=1,
)

x = torch.rand(1, 64, 32, 32)
kH, kW = 3, 3

# Offset: Tensor[B, 2 * kH * kW, out_H, out_W]
offset = torch.rand(1, 2 * kH * kW, 32, 32)
# Mask (v2 only): Tensor[B, kH * kW, out_H, out_W]
mask = torch.sigmoid(torch.rand(1, kH * kW, 32, 32))

out = dcn(x, offset, mask)  # Tensor[1, 64, 32, 32]
offset and mask are typically produced by a small auxiliary convolutional head whose output spatial size matches that of the DCN output. When mask is None, the layer reverts to Deformable Conv v1 behaviour.

Feature Pyramid Network

FeaturePyramidNetwork takes a dict of multi-scale feature maps and produces same-channel FPN representations through lateral connections and top-down merging:
from torchvision.ops import FeaturePyramidNetwork
import torch
from collections import OrderedDict

fpn = FeaturePyramidNetwork(
    in_channels_list=[256, 512, 1024, 2048],  # backbone channel counts per level
    out_channels=256,
    extra_blocks=None,  # optionally add P6, P7 via LastLevelMaxPool or LastLevelP6P7
    norm_layer=None,    # optional normalization layer applied after each lateral conv
)

# Provide an OrderedDict of feature maps from your backbone
backbone_features = OrderedDict([
    ("layer1", torch.rand(1, 256, 56, 56)),
    ("layer2", torch.rand(1, 512, 28, 28)),
    ("layer3", torch.rand(1, 1024, 14, 14)),
    ("layer4", torch.rand(1, 2048, 7, 7)),
])

fpn_features = fpn(backbone_features)
# Each value: Tensor[1, 256, H, W]

Conv2dNormActivation / Conv3dNormActivation

Fused Conv → Norm → Activation blocks with sensible defaults. Used throughout EfficientNet, MobileNet, Swin, and Video Swin:
from torchvision.ops import Conv2dNormActivation, Conv3dNormActivation
import torch.nn as nn

# 2D: Conv2d → BatchNorm2d → ReLU (defaults)
block_2d = Conv2dNormActivation(
    in_channels=32,
    out_channels=64,
    kernel_size=3,
    stride=2,
    norm_layer=nn.BatchNorm2d,       # default
    activation_layer=nn.ReLU,        # default
)

# 3D variant for video models: Conv3d → BatchNorm3d → ReLU
block_3d = Conv3dNormActivation(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    norm_layer=nn.BatchNorm3d,
)

SqueezeExcitation

Channel-wise attention from Hu et al., 2018:
from torchvision.ops import SqueezeExcitation
import torch

se = SqueezeExcitation(
    input_channels=64,
    squeeze_channels=16,   # bottleneck width (typically input_channels // 4)
)

x = torch.rand(1, 64, 14, 14)
out = se(x)  # same shape — applies channel-wise scaling

MLP

Multi-layer perceptron module used in transformer architectures (Swin, MViT):
from torchvision.ops import MLP
import torch

mlp = MLP(
    in_channels=768,
    hidden_channels=[3072, 768],  # list of layer widths
    dropout=0.1,
)

x = torch.rand(1, 196, 768)
out = mlp(x)  # Tensor[1, 196, 768]

FrozenBatchNorm2d

BatchNorm with permanently frozen running statistics and affine parameters. Used in detection backbones where fine-tuning BN stats is undesirable:
from torchvision.ops import FrozenBatchNorm2d
import torch

fbn = FrozenBatchNorm2d(num_features=64)
x = torch.rand(2, 64, 14, 14)
out = fbn(x)  # statistics never updated during training

Permute

Lightweight dimension-permutation module (used in Swin Transformer patch merging):
from torchvision.ops import Permute
import torch

perm = Permute(dims=[0, 2, 3, 1])  # NCHW → NHWC
x = torch.rand(1, 64, 14, 14)
out = perm(x)  # Tensor[1, 14, 14, 64]

Regularization

Stochastic Depth (Drop Path)

StochasticDepth randomly drops entire residual branches during training. It is the primary regularizer in Swin Transformer, ConvNeXt, and MViT:
from torchvision.ops import StochasticDepth, stochastic_depth
import torch

# Module form — plug in as a residual wrapper
sd = StochasticDepth(p=0.1, mode="row")

residual = torch.rand(4, 64, 14, 14)
out = sd(residual)  # randomly zeroed rows during training

# Functional form
out = stochastic_depth(residual, p=0.1, mode="batch", training=True)
mode="row" drops individual samples independently; mode="batch" drops the entire batch with probability p.

DropBlock

Drops contiguous spatial blocks of activations — more aggressive than standard dropout for convolutional feature maps:
from torchvision.ops import DropBlock2d, DropBlock3d
import torch

# 2D — for image feature maps
db2d = DropBlock2d(p=0.1, block_size=7)
x2d = torch.rand(2, 64, 28, 28)
out2d = db2d(x2d)

# 3D — for video feature maps
db3d = DropBlock3d(p=0.1, block_size=5)
x3d = torch.rand(2, 64, 8, 14, 14)
out3d = db3d(x3d)

NMS + Box Conversion Example

import torch
from torchvision.ops import nms, box_iou, box_convert

boxes = torch.tensor([
    [100., 50., 300., 250.],
    [110., 55., 310., 260.],  # overlapping box
    [400., 100., 600., 350.],
], dtype=torch.float32)
scores = torch.tensor([0.9, 0.75, 0.85])

keep = nms(boxes, scores, iou_threshold=0.5)
print(keep)  # tensor([0, 2]) — indices of kept boxes

# Convert box formats
xyxy_boxes = box_convert(boxes, in_fmt="xyxy", out_fmt="xywh")
print(xyxy_boxes)
# tensor([[100.,  50., 200., 200.],
#         [110.,  55., 200., 205.],
#         [400., 100., 200., 250.]])

RoIAlign Example

from torchvision.ops import RoIAlign
import torch

roi_align = RoIAlign(
    output_size=(7, 7),
    spatial_scale=1.0 / 16,  # feature map stride
    sampling_ratio=-1,
    aligned=True,             # default is False; set True for Detectron2-style alignment
)

feature_map = torch.rand(1, 256, 56, 56)
rois = torch.tensor([[0., 10., 10., 100., 100.]])  # [batch_idx, x1, y1, x2, y2]
pooled = roi_align(feature_map, [rois[:, 1:]])  # Tensor[1, 256, 7, 7]
print(pooled.shape)  # torch.Size([1, 256, 7, 7])

Quick Reference

Box Operations

nms, batched_nms, box_iou, generalized_box_iou, distance_box_iou, complete_box_iou, box_area, box_convert, clip_boxes_to_image, remove_small_boxes, masks_to_boxes

RoI Pooling

RoIAlign, roi_align, RoIPool, roi_pool, PSRoIAlign, ps_roi_align, PSRoIPool, ps_roi_pool, MultiScaleRoIAlign

Loss Functions

sigmoid_focal_loss, generalized_box_iou_loss, distance_box_iou_loss, complete_box_iou_loss

Layers

DeformConv2d, deform_conv2d, FeaturePyramidNetwork, Conv2dNormActivation, Conv3dNormActivation, SqueezeExcitation, MLP, FrozenBatchNorm2d, Permute

Regularization

StochasticDepth, stochastic_depth, DropBlock2d, DropBlock3d, drop_block2d, drop_block3d

Build docs developers (and LLMs) love