Video Classification Models: R3D, MViT, Video Swin

TorchVision ships a full suite of pretrained video classification models that operate directly on raw video clips. All models are trained on the Kinetics-400 benchmark (400 action categories) and accept input tensors of shape [B, C, T, H, W] — batch, channels, time (frames), height, width. Each model family reflects a different architectural philosophy, from efficient 3D CNNs to multiscale transformers.

All video models in torchvision.models.video are in beta status. Weights default to KINETICS400_V1 unless otherwise noted.

Input Format

Video models expect clips as float tensors with shape Tensor[B, C, T, H, W]:

Dimension	Meaning	Notes
`B`	Batch size	≥ 1
`C`	Channels (RGB)	3
`T`	Time / frames	16 for R3D/MC3/R(2+1)D/MViT; 32 for Swin3D
`H`, `W`	Spatial dimensions	112 × 112 for R3D family; 224 × 224 for Swin3D, MViT, S3D

Each model’s weights.transforms() returns a VideoClassification callable that handles resizing and normalization — always apply it before running inference.

3D ResNet Family

These three models share the same VideoResNet backbone but differ in how they handle the temporal dimension. All are 18-layer networks trained with frame_rate=15, clips_per_video=5, and clip_len=16 on Kinetics-400.

R3D-18

Full 3D convolutions (3×3×3) in every layer. Strongest spatiotemporal coupling; highest parameter count of the three.

MC3-18

Mixed convolution: 3D in the first layer, 2D (spatial-only) in layers 2–4. Balances accuracy and parameter efficiency.

R(2+1)D-18

Factorised convolutions: each 3D conv split into a spatial 2D conv followed by a temporal 1D conv. Best top-1 accuracy of the trio.

Accuracy on Kinetics-400

Model	Params	Top-1	Top-5
`r3d_18`	33.4 M	63.2%	83.5%
`mc3_18`	11.7 M	64.0%	84.1%
`r2plus1d_18`	31.5 M	67.5%	86.2%

Quick Inference

import torch
from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()

preprocess = weights.transforms()

# Input: video clips as Tensor[batch, channels, frames, H, W] or [C, T, H, W]
video_clip = torch.rand(1, 3, 16, 112, 112)  # 1 clip, 3ch, 16 frames, 112x112

with torch.no_grad():
    output = model(video_clip)

predictions = output.softmax(dim=1)
class_id = predictions.argmax(dim=1).item()
print(weights.meta["categories"][class_id])

Switching Variants

from torchvision.models.video import (
    mc3_18, MC3_18_Weights,
    r2plus1d_18, R2Plus1D_18_Weights,
)

# Mixed Convolution — lightest model
mc3 = mc3_18(weights=MC3_18_Weights.DEFAULT)

# R(2+1)D — highest accuracy
r2p1d = r2plus1d_18(weights=R2Plus1D_18_Weights.DEFAULT)

MViT — Multiscale Vision Transformers

MViT processes video using a hierarchical attention mechanism that progressively increases channel capacity while reducing spatiotemporal resolution across stages. This multiscale design is more parameter-efficient than isotropic ViTs for video.

MViT V1-B

Base model from Fan et al., 2021. 36.6 M parameters. Top-1: 78.5% on Kinetics-400.

MViT V2-S

Improved version with decomposed relative position embeddings and residual pooling connections (Li et al., 2021). 34.5 M parameters. Top-1: 80.8%.

Accuracy on Kinetics-400

Model	Params	Top-1	Top-5
`mvit_v1_b`	36.6 M	78.5%	93.6%
`mvit_v2_s`	34.5 M	80.8%	94.7%

from torchvision.models.video import mvit_v1_b, MViT_V1_B_Weights
from torchvision.models.video import mvit_v2_s, MViT_V2_S_Weights

# MViT V1
model_v1 = mvit_v1_b(weights=MViT_V1_B_Weights.DEFAULT)

# MViT V2 — improved accuracy
model_v2 = mvit_v2_s(weights=MViT_V2_S_Weights.DEFAULT)
model_v2.eval()

Video Swin Transformer

The Video Swin Transformer (Liu et al., 2021) extends the Swin Transformer to 3D by using shifted 3D windows for local spatiotemporal attention. It comes in three sizes trained on Kinetics-400, with the base model also available with ImageNet-22K pretraining for higher accuracy.

Accuracy on Kinetics-400

Model	Params	Top-1	Top-5	Notes
`swin3d_t`	28.2 M	77.7%	93.5%	Tiny
`swin3d_s`	49.8 M	79.5%	94.2%	Small
`swin3d_b`	88.0 M	79.4%	94.4%	Base (K400 only)
`swin3d_b` + IN22K	88.0 M	81.6%	95.6%	`KINETICS400_IMAGENET22K_V1`

from torchvision.models.video import (
    swin3d_t, Swin3D_T_Weights,
    swin3d_s, Swin3D_S_Weights,
    swin3d_b, Swin3D_B_Weights,
)

# Tiny — fastest
model_t = swin3d_t(weights=Swin3D_T_Weights.DEFAULT)

# Small — balanced
model_s = swin3d_s(weights=Swin3D_S_Weights.DEFAULT)

# Base — highest accuracy with ImageNet-22K pretraining
model_b = swin3d_b(weights=Swin3D_B_Weights.KINETICS400_IMAGENET22K_V1)
model_b.eval()

S3D — Separable 3D CNN

S3D (Xie et al., 2018) factorizes 3D convolutions into separate spatial and temporal convolutions for speed while retaining good accuracy. It is the most compact model in the video zoo at only 8.3 M parameters.

Model	Params	Top-1	Top-5
`s3d`	8.3 M	68.4%	88.1%

from torchvision.models.video import s3d, S3D_Weights

model = s3d(weights=S3D_Weights.DEFAULT)
model.eval()

preprocess = S3D_Weights.DEFAULT.transforms()

Model Comparison

Efficiency-first

Use S3D (8.3 M) or MC3-18 (11.7 M) when inference speed or memory is constrained.

Accuracy-first

Use MViT V2-S or Swin3D-B (ImageNet-22K) when top-1 accuracy is the priority.

Classic baseline

Use R(2+1)D-18 as a well-established 3D CNN baseline with broad literature support.

Architecture research

MViT V1-B is useful when studying multiscale attention ablations or reproducing the original paper.

Full Inference Pipeline

Load weights and model

from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.DEFAULT
model = r3d_18(weights=weights)
model.eval()

Build the preprocessing transform

preprocess = weights.transforms()
# VideoClassification: crop_size=(112, 112), resize_size=(128, 171)

Prepare input clip

import torch

# Tensor[B, C, T, H, W] — float, already preprocessed
video_clip = torch.rand(1, 3, 16, 112, 112)

Run inference

with torch.no_grad():
    logits = model(video_clip)

predictions = logits.softmax(dim=1)
class_id = predictions.argmax(dim=1).item()
label = weights.meta["categories"][class_id]
print(f"Predicted action: {label}")

All video models expect pixel values normalized with ImageNet mean/std. Always call weights.transforms() rather than building your own normalization to avoid mismatches.

Overview

Classification

Object Detection

Video & Optical Flow

Feature Extraction & Ops

Video Classification Models: R3D, MViT, Video Swin

Input Format

3D ResNet Family

R3D-18

MC3-18

R(2+1)D-18

Accuracy on Kinetics-400

Quick Inference

Switching Variants

MViT — Multiscale Vision Transformers

MViT V1-B

MViT V2-S

Accuracy on Kinetics-400

Video Swin Transformer

Accuracy on Kinetics-400

S3D — Separable 3D CNN

Model Comparison

Efficiency-first

Accuracy-first

Classic baseline

Architecture research

Full Inference Pipeline

Build docs developers (and LLMs) love

Overview

Classification

Object Detection

Video & Optical Flow

Feature Extraction & Ops

Documentation Index

​Input Format

​3D ResNet Family

R3D-18

MC3-18

R(2+1)D-18

​Accuracy on Kinetics-400

​Quick Inference

​Switching Variants

​MViT — Multiscale Vision Transformers

MViT V1-B

MViT V2-S

​Accuracy on Kinetics-400

​Video Swin Transformer

​Accuracy on Kinetics-400

​S3D — Separable 3D CNN

​Model Comparison

Efficiency-first

Accuracy-first

Classic baseline

Architecture research

​Full Inference Pipeline

Build docs developers (and LLMs) love

Input Format

3D ResNet Family

Accuracy on Kinetics-400

Quick Inference

Switching Variants

MViT — Multiscale Vision Transformers

Accuracy on Kinetics-400

Video Swin Transformer

Accuracy on Kinetics-400

S3D — Separable 3D CNN

Model Comparison

Full Inference Pipeline