torchvision.ops: Vision-Specific Primitives and Layers

torchvision.ops is a library of vision-specific operations that sit below the model level: box manipulation, region-of-interest pooling, custom losses, deformable layers, and regularization blocks. These primitives are used internally by TorchVision’s detection and segmentation models (Faster R-CNN, RetinaNet, Mask R-CNN, etc.) and are fully available for use in custom architectures. Many operations include both a functional form and an nn.Module wrapper.

Box Operations

All box functions expect tensors in (x1, y1, x2, y2) (XYXY) format unless noted otherwise. Coordinates are floating-point.

Non-Maximum Suppression

import torch
from torchvision.ops import nms, batched_nms

boxes = torch.tensor([
    [100., 50., 300., 250.],
    [110., 55., 310., 260.],  # overlapping box
    [400., 100., 600., 350.],
], dtype=torch.float32)
scores = torch.tensor([0.9, 0.75, 0.85])

# Standard NMS — returns indices of kept boxes
keep = nms(boxes, scores, iou_threshold=0.5)
print(keep)  # tensor([0, 2]) — indices of kept boxes

# Batched NMS — applies NMS per class (idxs selects the class)
idxs = torch.tensor([0, 0, 1])
keep_batched = batched_nms(boxes, scores, idxs, iou_threshold=0.5)

Function	Signature	Description
`nms`	`(boxes, scores, iou_threshold) -> Tensor`	Standard NMS; returns kept indices sorted by descending score
`batched_nms`	`(boxes, scores, idxs, iou_threshold) -> Tensor`	NMS applied independently per element in `idxs`

IoU Metrics

from torchvision.ops import box_iou, generalized_box_iou, distance_box_iou, complete_box_iou

boxes1 = torch.tensor([[0., 0., 10., 10.]])
boxes2 = torch.tensor([[5., 5., 15., 15.], [20., 20., 30., 30.]])

iou = box_iou(boxes1, boxes2)           # Tensor[1, 2]
giou = generalized_box_iou(boxes1, boxes2)  # Tensor[1, 2] — handles non-overlapping boxes
diou = distance_box_iou(boxes1, boxes2)     # Tensor[1, 2] — adds centre-point distance term
ciou = complete_box_iou(boxes1, boxes2)     # Tensor[1, 2] — adds aspect-ratio consistency term

Function	Return	Notes
`box_iou`	`Tensor[N, M]`	Standard intersection-over-union
`generalized_box_iou`	`Tensor[N, M]`	GIoU — non-zero gradient even for non-overlapping boxes
`distance_box_iou`	`Tensor[N, M]`	DIoU — minimises centre-point distance
`complete_box_iou`	`Tensor[N, M]`	CIoU — includes aspect-ratio penalty

Box Utilities

from torchvision.ops import (
    box_area, box_convert,
    clip_boxes_to_image, remove_small_boxes, masks_to_boxes
)

# Compute area (supports XYXY by default, also accepts XYWH / CXCYWH via fmt=)
areas = box_area(boxes)  # Tensor[N]

# Convert between formats: "xyxy", "xywh", "cxcywh", "xywhr", "cxcywhr", "xyxyxyxy"
xywh_boxes = box_convert(boxes, in_fmt="xyxy", out_fmt="xywh")
cxcywh_boxes = box_convert(boxes, in_fmt="xyxy", out_fmt="cxcywh")

# Clip boxes to image boundaries
clipped = clip_boxes_to_image(boxes, size=(480, 640))  # (H, W)

# Remove tiny boxes (degenerate proposals)
keep = remove_small_boxes(boxes, min_size=10.0)  # returns valid indices

# Convert binary segmentation masks to bounding boxes
masks = torch.zeros(3, 100, 100, dtype=torch.bool)
masks[0, 20:50, 30:70] = True
tight_boxes = masks_to_boxes(masks)  # Tensor[3, 4] in XYXY format

RoI Pooling

Region-of-Interest (RoI) operations extract fixed-size feature maps for each proposed bounding box, enabling box-specific classification and regression heads.

RoIAlign

from torchvision.ops import RoIAlign
import torch

roi_align = RoIAlign(
    output_size=(7, 7),
    spatial_scale=1.0 / 16,  # feature map stride (e.g., 16 for a stride-16 backbone)
    sampling_ratio=-1,        # -1 = adaptive; >0 = fixed grid per bin
    aligned=True,             # half-pixel offset; default is False
)

feature_map = torch.rand(1, 256, 56, 56)
# rois: [batch_idx, x1, y1, x2, y2] in image coordinates
rois = torch.tensor([[0., 10., 10., 100., 100.]])
pooled = roi_align(feature_map, [rois[:, 1:]])  # Tensor[1, 256, 7, 7]

RoIAlign is differentiable and preferred over RoIPool for most use cases. The aligned parameter defaults to False; setting it to True shifts coordinates by half a pixel to better align with the feature grid, matching the Detectron2 convention (recommended for new models).

Other RoI Poolers

Class	Description
`RoIPool(output_size, spatial_scale)`	Hard max-pool; not differentiable w.r.t. coordinates
`PSRoIAlign(output_size, spatial_scale, sampling_ratio)`	Position-sensitive RoI align (R-FCN)
`PSRoIPool(output_size, spatial_scale)`	Position-sensitive max-pool
`MultiScaleRoIAlign(featmap_names, output_size, sampling_ratio)`	Assigns each RoI to a feature level by scale; used in Faster R-CNN + FPN

from torchvision.ops import MultiScaleRoIAlign

# Used inside Faster R-CNN / Mask R-CNN
roi_pooler = MultiScaleRoIAlign(
    featmap_names=["0", "1", "2", "3"],
    output_size=7,
    sampling_ratio=2,
)

Loss Functions

Focal Loss

sigmoid_focal_loss addresses class imbalance in dense detection by down-weighting well-classified examples:

from torchvision.ops import sigmoid_focal_loss
import torch

# Both tensors: Tensor[N, C], float
inputs = torch.randn(8, 80)   # raw logits
targets = torch.zeros(8, 80)
targets[0, 3] = 1.0           # class 3 positive

loss = sigmoid_focal_loss(
    inputs, targets,
    alpha=0.25,     # balances positive vs negative examples
    gamma=2.0,      # focusing parameter; 0 = standard BCE
    reduction="sum",
)

IoU-Based Losses

These losses operate directly on box coordinates, providing smooth gradients even when boxes do not overlap:

from torchvision.ops import (
    generalized_box_iou_loss,
    distance_box_iou_loss,
    complete_box_iou_loss,
)

pred_boxes = torch.tensor([[10., 10., 50., 50.]])
gt_boxes   = torch.tensor([[15., 15., 55., 55.]])

giou_loss = generalized_box_iou_loss(pred_boxes, gt_boxes, reduction="mean")
diou_loss = distance_box_iou_loss(pred_boxes, gt_boxes, reduction="mean")
ciou_loss = complete_box_iou_loss(pred_boxes, gt_boxes, reduction="mean")

Loss function	Notes
`generalized_box_iou_loss`	GIoU loss; non-zero gradient for non-overlapping boxes
`distance_box_iou_loss`	DIoU loss; penalises centre-point distance
`complete_box_iou_loss`	CIoU loss; adds aspect-ratio consistency term

Layers and Modules

Deformable Convolution

DeformConv2d implements Deformable Convolutional Networks v2 (Zhu et al., 2019). It learns spatial sampling offsets and modulation masks that allow the receptive field to adapt to object shape.

from torchvision.ops import DeformConv2d, deform_conv2d
import torch

# Module form
dcn = DeformConv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    stride=1,
    padding=1,
)

x = torch.rand(1, 64, 32, 32)
kH, kW = 3, 3

# Offset: Tensor[B, 2 * kH * kW, out_H, out_W]
offset = torch.rand(1, 2 * kH * kW, 32, 32)
# Mask (v2 only): Tensor[B, kH * kW, out_H, out_W]
mask = torch.sigmoid(torch.rand(1, kH * kW, 32, 32))

out = dcn(x, offset, mask)  # Tensor[1, 64, 32, 32]

offset and mask are typically produced by a small auxiliary convolutional head whose output spatial size matches that of the DCN output. When mask is None, the layer reverts to Deformable Conv v1 behaviour.

Feature Pyramid Network

FeaturePyramidNetwork takes a dict of multi-scale feature maps and produces same-channel FPN representations through lateral connections and top-down merging:

from torchvision.ops import FeaturePyramidNetwork
import torch
from collections import OrderedDict

fpn = FeaturePyramidNetwork(
    in_channels_list=[256, 512, 1024, 2048],  # backbone channel counts per level
    out_channels=256,
    extra_blocks=None,  # optionally add P6, P7 via LastLevelMaxPool or LastLevelP6P7
    norm_layer=None,    # optional normalization layer applied after each lateral conv
)

# Provide an OrderedDict of feature maps from your backbone
backbone_features = OrderedDict([
    ("layer1", torch.rand(1, 256, 56, 56)),
    ("layer2", torch.rand(1, 512, 28, 28)),
    ("layer3", torch.rand(1, 1024, 14, 14)),
    ("layer4", torch.rand(1, 2048, 7, 7)),
])

fpn_features = fpn(backbone_features)
# Each value: Tensor[1, 256, H, W]

Conv2dNormActivation / Conv3dNormActivation

Fused Conv → Norm → Activation blocks with sensible defaults. Used throughout EfficientNet, MobileNet, Swin, and Video Swin:

from torchvision.ops import Conv2dNormActivation, Conv3dNormActivation
import torch.nn as nn

# 2D: Conv2d → BatchNorm2d → ReLU (defaults)
block_2d = Conv2dNormActivation(
    in_channels=32,
    out_channels=64,
    kernel_size=3,
    stride=2,
    norm_layer=nn.BatchNorm2d,       # default
    activation_layer=nn.ReLU,        # default
)

# 3D variant for video models: Conv3d → BatchNorm3d → ReLU
block_3d = Conv3dNormActivation(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    norm_layer=nn.BatchNorm3d,
)

SqueezeExcitation

Channel-wise attention from Hu et al., 2018:

from torchvision.ops import SqueezeExcitation
import torch

se = SqueezeExcitation(
    input_channels=64,
    squeeze_channels=16,   # bottleneck width (typically input_channels // 4)
)

x = torch.rand(1, 64, 14, 14)
out = se(x)  # same shape — applies channel-wise scaling

MLP

Multi-layer perceptron module used in transformer architectures (Swin, MViT):

from torchvision.ops import MLP
import torch

mlp = MLP(
    in_channels=768,
    hidden_channels=[3072, 768],  # list of layer widths
    dropout=0.1,
)

x = torch.rand(1, 196, 768)
out = mlp(x)  # Tensor[1, 196, 768]

FrozenBatchNorm2d

BatchNorm with permanently frozen running statistics and affine parameters. Used in detection backbones where fine-tuning BN stats is undesirable:

from torchvision.ops import FrozenBatchNorm2d
import torch

fbn = FrozenBatchNorm2d(num_features=64)
x = torch.rand(2, 64, 14, 14)
out = fbn(x)  # statistics never updated during training

Permute

Lightweight dimension-permutation module (used in Swin Transformer patch merging):

from torchvision.ops import Permute
import torch

perm = Permute(dims=[0, 2, 3, 1])  # NCHW → NHWC
x = torch.rand(1, 64, 14, 14)
out = perm(x)  # Tensor[1, 14, 14, 64]

Regularization

Stochastic Depth (Drop Path)

StochasticDepth randomly drops entire residual branches during training. It is the primary regularizer in Swin Transformer, ConvNeXt, and MViT:

from torchvision.ops import StochasticDepth, stochastic_depth
import torch

# Module form — plug in as a residual wrapper
sd = StochasticDepth(p=0.1, mode="row")

residual = torch.rand(4, 64, 14, 14)
out = sd(residual)  # randomly zeroed rows during training

# Functional form
out = stochastic_depth(residual, p=0.1, mode="batch", training=True)

mode="row" drops individual samples independently; mode="batch" drops the entire batch with probability p.

DropBlock

Drops contiguous spatial blocks of activations — more aggressive than standard dropout for convolutional feature maps:

from torchvision.ops import DropBlock2d, DropBlock3d
import torch

# 2D — for image feature maps
db2d = DropBlock2d(p=0.1, block_size=7)
x2d = torch.rand(2, 64, 28, 28)
out2d = db2d(x2d)

# 3D — for video feature maps
db3d = DropBlock3d(p=0.1, block_size=5)
x3d = torch.rand(2, 64, 8, 14, 14)
out3d = db3d(x3d)

NMS + Box Conversion Example

import torch
from torchvision.ops import nms, box_iou, box_convert

boxes = torch.tensor([
    [100., 50., 300., 250.],
    [110., 55., 310., 260.],  # overlapping box
    [400., 100., 600., 350.],
], dtype=torch.float32)
scores = torch.tensor([0.9, 0.75, 0.85])

keep = nms(boxes, scores, iou_threshold=0.5)
print(keep)  # tensor([0, 2]) — indices of kept boxes

# Convert box formats
xyxy_boxes = box_convert(boxes, in_fmt="xyxy", out_fmt="xywh")
print(xyxy_boxes)
# tensor([[100.,  50., 200., 200.],
#         [110.,  55., 200., 205.],
#         [400., 100., 200., 250.]])

RoIAlign Example

from torchvision.ops import RoIAlign
import torch

roi_align = RoIAlign(
    output_size=(7, 7),
    spatial_scale=1.0 / 16,  # feature map stride
    sampling_ratio=-1,
    aligned=True,             # default is False; set True for Detectron2-style alignment
)

feature_map = torch.rand(1, 256, 56, 56)
rois = torch.tensor([[0., 10., 10., 100., 100.]])  # [batch_idx, x1, y1, x2, y2]
pooled = roi_align(feature_map, [rois[:, 1:]])  # Tensor[1, 256, 7, 7]
print(pooled.shape)  # torch.Size([1, 256, 7, 7])

Quick Reference

Box Operations

nms, batched_nms, box_iou, generalized_box_iou, distance_box_iou, complete_box_iou, box_area, box_convert, clip_boxes_to_image, remove_small_boxes, masks_to_boxes

RoI Pooling

RoIAlign, roi_align, RoIPool, roi_pool, PSRoIAlign, ps_roi_align, PSRoIPool, ps_roi_pool, MultiScaleRoIAlign

Loss Functions

sigmoid_focal_loss, generalized_box_iou_loss, distance_box_iou_loss, complete_box_iou_loss

Layers

DeformConv2d, deform_conv2d, FeaturePyramidNetwork, Conv2dNormActivation, Conv3dNormActivation, SqueezeExcitation, MLP, FrozenBatchNorm2d, Permute

Regularization

StochasticDepth, stochastic_depth, DropBlock2d, DropBlock3d, drop_block2d, drop_block3d

Overview

Classification

Object Detection

Video & Optical Flow

Feature Extraction & Ops

torchvision.ops: Vision-Specific Primitives and Layers

Box Operations

Non-Maximum Suppression

IoU Metrics

Box Utilities

RoI Pooling

RoIAlign

Other RoI Poolers

Loss Functions

Focal Loss

IoU-Based Losses

Layers and Modules

Deformable Convolution

Feature Pyramid Network

Conv2dNormActivation / Conv3dNormActivation

SqueezeExcitation

MLP

FrozenBatchNorm2d

Permute

Regularization

Stochastic Depth (Drop Path)

DropBlock

NMS + Box Conversion Example

RoIAlign Example

Quick Reference

Box Operations

RoI Pooling

Loss Functions

Layers

Regularization

Build docs developers (and LLMs) love

Overview

Classification

Object Detection

Video & Optical Flow

Feature Extraction & Ops

Documentation Index

​Box Operations

​Non-Maximum Suppression

​IoU Metrics

​Box Utilities

​RoI Pooling

​RoIAlign

​Other RoI Poolers

​Loss Functions

​Focal Loss

​IoU-Based Losses

​Layers and Modules

​Deformable Convolution

​Feature Pyramid Network

​Conv2dNormActivation / Conv3dNormActivation

​SqueezeExcitation

​MLP

​FrozenBatchNorm2d

​Permute

​Regularization

​Stochastic Depth (Drop Path)

​DropBlock

​NMS + Box Conversion Example

​RoIAlign Example

​Quick Reference

Box Operations

RoI Pooling

Loss Functions

Layers

Regularization

Build docs developers (and LLMs) love

Box Operations

Non-Maximum Suppression

IoU Metrics

Box Utilities

RoI Pooling

RoIAlign

Other RoI Poolers

Loss Functions

Focal Loss

IoU-Based Losses

Layers and Modules

Deformable Convolution

Feature Pyramid Network

Conv2dNormActivation / Conv3dNormActivation

SqueezeExcitation

MLP

FrozenBatchNorm2d

Permute

Regularization

Stochastic Depth (Drop Path)

DropBlock

NMS + Box Conversion Example

RoIAlign Example

Quick Reference