Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt

Use this file to discover all available pages before exploring further.

TorchVision ships a full suite of pretrained video classification models that operate directly on raw video clips. All models are trained on the Kinetics-400 benchmark (400 action categories) and accept input tensors of shape [B, C, T, H, W] — batch, channels, time (frames), height, width. Each model family reflects a different architectural philosophy, from efficient 3D CNNs to multiscale transformers.
All video models in torchvision.models.video are in beta status. Weights default to KINETICS400_V1 unless otherwise noted.

Input Format

Video models expect clips as float tensors with shape Tensor[B, C, T, H, W]:
DimensionMeaningNotes
BBatch size≥ 1
CChannels (RGB)3
TTime / frames16 for R3D/MC3/R(2+1)D/MViT; 32 for Swin3D
H, WSpatial dimensions112 × 112 for R3D family; 224 × 224 for Swin3D, MViT, S3D
Each model’s weights.transforms() returns a VideoClassification callable that handles resizing and normalization — always apply it before running inference.

3D ResNet Family

These three models share the same VideoResNet backbone but differ in how they handle the temporal dimension. All are 18-layer networks trained with frame_rate=15, clips_per_video=5, and clip_len=16 on Kinetics-400.

R3D-18

Full 3D convolutions (3×3×3) in every layer. Strongest spatiotemporal coupling; highest parameter count of the three.

MC3-18

Mixed convolution: 3D in the first layer, 2D (spatial-only) in layers 2–4. Balances accuracy and parameter efficiency.

R(2+1)D-18

Factorised convolutions: each 3D conv split into a spatial 2D conv followed by a temporal 1D conv. Best top-1 accuracy of the trio.

Accuracy on Kinetics-400

ModelParamsTop-1Top-5
r3d_1833.4 M63.2%83.5%
mc3_1811.7 M64.0%84.1%
r2plus1d_1831.5 M67.5%86.2%

Quick Inference

import torch
from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()

preprocess = weights.transforms()

# Input: video clips as Tensor[batch, channels, frames, H, W] or [C, T, H, W]
video_clip = torch.rand(1, 3, 16, 112, 112)  # 1 clip, 3ch, 16 frames, 112x112

with torch.no_grad():
    output = model(video_clip)

predictions = output.softmax(dim=1)
class_id = predictions.argmax(dim=1).item()
print(weights.meta["categories"][class_id])

Switching Variants

from torchvision.models.video import (
    mc3_18, MC3_18_Weights,
    r2plus1d_18, R2Plus1D_18_Weights,
)

# Mixed Convolution — lightest model
mc3 = mc3_18(weights=MC3_18_Weights.DEFAULT)

# R(2+1)D — highest accuracy
r2p1d = r2plus1d_18(weights=R2Plus1D_18_Weights.DEFAULT)

MViT — Multiscale Vision Transformers

MViT processes video using a hierarchical attention mechanism that progressively increases channel capacity while reducing spatiotemporal resolution across stages. This multiscale design is more parameter-efficient than isotropic ViTs for video.

MViT V1-B

Base model from Fan et al., 2021. 36.6 M parameters. Top-1: 78.5% on Kinetics-400.

MViT V2-S

Improved version with decomposed relative position embeddings and residual pooling connections (Li et al., 2021). 34.5 M parameters. Top-1: 80.8%.

Accuracy on Kinetics-400

ModelParamsTop-1Top-5
mvit_v1_b36.6 M78.5%93.6%
mvit_v2_s34.5 M80.8%94.7%
from torchvision.models.video import mvit_v1_b, MViT_V1_B_Weights
from torchvision.models.video import mvit_v2_s, MViT_V2_S_Weights

# MViT V1
model_v1 = mvit_v1_b(weights=MViT_V1_B_Weights.DEFAULT)

# MViT V2 — improved accuracy
model_v2 = mvit_v2_s(weights=MViT_V2_S_Weights.DEFAULT)
model_v2.eval()

Video Swin Transformer

The Video Swin Transformer (Liu et al., 2021) extends the Swin Transformer to 3D by using shifted 3D windows for local spatiotemporal attention. It comes in three sizes trained on Kinetics-400, with the base model also available with ImageNet-22K pretraining for higher accuracy.

Accuracy on Kinetics-400

ModelParamsTop-1Top-5Notes
swin3d_t28.2 M77.7%93.5%Tiny
swin3d_s49.8 M79.5%94.2%Small
swin3d_b88.0 M79.4%94.4%Base (K400 only)
swin3d_b + IN22K88.0 M81.6%95.6%KINETICS400_IMAGENET22K_V1
from torchvision.models.video import (
    swin3d_t, Swin3D_T_Weights,
    swin3d_s, Swin3D_S_Weights,
    swin3d_b, Swin3D_B_Weights,
)

# Tiny — fastest
model_t = swin3d_t(weights=Swin3D_T_Weights.DEFAULT)

# Small — balanced
model_s = swin3d_s(weights=Swin3D_S_Weights.DEFAULT)

# Base — highest accuracy with ImageNet-22K pretraining
model_b = swin3d_b(weights=Swin3D_B_Weights.KINETICS400_IMAGENET22K_V1)
model_b.eval()

S3D — Separable 3D CNN

S3D (Xie et al., 2018) factorizes 3D convolutions into separate spatial and temporal convolutions for speed while retaining good accuracy. It is the most compact model in the video zoo at only 8.3 M parameters.
ModelParamsTop-1Top-5
s3d8.3 M68.4%88.1%
from torchvision.models.video import s3d, S3D_Weights

model = s3d(weights=S3D_Weights.DEFAULT)
model.eval()

preprocess = S3D_Weights.DEFAULT.transforms()

Model Comparison

Efficiency-first

Use S3D (8.3 M) or MC3-18 (11.7 M) when inference speed or memory is constrained.

Accuracy-first

Use MViT V2-S or Swin3D-B (ImageNet-22K) when top-1 accuracy is the priority.

Classic baseline

Use R(2+1)D-18 as a well-established 3D CNN baseline with broad literature support.

Architecture research

MViT V1-B is useful when studying multiscale attention ablations or reproducing the original paper.

Full Inference Pipeline

1

Load weights and model

from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()
2

Build the preprocessing transform

preprocess = weights.transforms()
# VideoClassification: crop_size=(112, 112), resize_size=(128, 171)
3

Prepare input clip

import torch

# Tensor[B, C, T, H, W] — float, already preprocessed
video_clip = torch.rand(1, 3, 16, 112, 112)
4

Run inference

with torch.no_grad():
    logits = model(video_clip)

predictions = logits.softmax(dim=1)
class_id = predictions.argmax(dim=1).item()
label = weights.meta["categories"][class_id]
print(f"Predicted action: {label}")
All video models expect pixel values normalized with ImageNet mean/std. Always call weights.transforms() rather than building your own normalization to avoid mismatches.

Build docs developers (and LLMs) love