TorchVision ships a full suite of pretrained video classification models that operate directly on raw video clips. All models are trained on the Kinetics-400 benchmark (400 action categories) and accept input tensors of shapeDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
[B, C, T, H, W] — batch, channels, time (frames), height, width. Each model family reflects a different architectural philosophy, from efficient 3D CNNs to multiscale transformers.
All video models in
torchvision.models.video are in beta status. Weights default to KINETICS400_V1 unless otherwise noted.Input Format
Video models expect clips as float tensors with shapeTensor[B, C, T, H, W]:
| Dimension | Meaning | Notes |
|---|---|---|
B | Batch size | ≥ 1 |
C | Channels (RGB) | 3 |
T | Time / frames | 16 for R3D/MC3/R(2+1)D/MViT; 32 for Swin3D |
H, W | Spatial dimensions | 112 × 112 for R3D family; 224 × 224 for Swin3D, MViT, S3D |
weights.transforms() returns a VideoClassification callable that handles resizing and normalization — always apply it before running inference.
3D ResNet Family
These three models share the sameVideoResNet backbone but differ in how they handle the temporal dimension. All are 18-layer networks trained with frame_rate=15, clips_per_video=5, and clip_len=16 on Kinetics-400.
R3D-18
Full 3D convolutions
(3×3×3) in every layer. Strongest spatiotemporal coupling; highest parameter count of the three.MC3-18
Mixed convolution: 3D in the first layer, 2D (spatial-only) in layers 2–4. Balances accuracy and parameter efficiency.
R(2+1)D-18
Factorised convolutions: each 3D conv split into a spatial 2D conv followed by a temporal 1D conv. Best top-1 accuracy of the trio.
Accuracy on Kinetics-400
| Model | Params | Top-1 | Top-5 |
|---|---|---|---|
r3d_18 | 33.4 M | 63.2% | 83.5% |
mc3_18 | 11.7 M | 64.0% | 84.1% |
r2plus1d_18 | 31.5 M | 67.5% | 86.2% |
Quick Inference
Switching Variants
MViT — Multiscale Vision Transformers
MViT processes video using a hierarchical attention mechanism that progressively increases channel capacity while reducing spatiotemporal resolution across stages. This multiscale design is more parameter-efficient than isotropic ViTs for video.MViT V1-B
Base model from Fan et al., 2021. 36.6 M parameters. Top-1: 78.5% on Kinetics-400.
MViT V2-S
Improved version with decomposed relative position embeddings and residual pooling connections (Li et al., 2021). 34.5 M parameters. Top-1: 80.8%.
Accuracy on Kinetics-400
| Model | Params | Top-1 | Top-5 |
|---|---|---|---|
mvit_v1_b | 36.6 M | 78.5% | 93.6% |
mvit_v2_s | 34.5 M | 80.8% | 94.7% |
Video Swin Transformer
The Video Swin Transformer (Liu et al., 2021) extends the Swin Transformer to 3D by using shifted 3D windows for local spatiotemporal attention. It comes in three sizes trained on Kinetics-400, with the base model also available with ImageNet-22K pretraining for higher accuracy.Accuracy on Kinetics-400
| Model | Params | Top-1 | Top-5 | Notes |
|---|---|---|---|---|
swin3d_t | 28.2 M | 77.7% | 93.5% | Tiny |
swin3d_s | 49.8 M | 79.5% | 94.2% | Small |
swin3d_b | 88.0 M | 79.4% | 94.4% | Base (K400 only) |
swin3d_b + IN22K | 88.0 M | 81.6% | 95.6% | KINETICS400_IMAGENET22K_V1 |
S3D — Separable 3D CNN
S3D (Xie et al., 2018) factorizes 3D convolutions into separate spatial and temporal convolutions for speed while retaining good accuracy. It is the most compact model in the video zoo at only 8.3 M parameters.| Model | Params | Top-1 | Top-5 |
|---|---|---|---|
s3d | 8.3 M | 68.4% | 88.1% |
Model Comparison
Efficiency-first
Use S3D (8.3 M) or MC3-18 (11.7 M) when inference speed or memory is constrained.
Accuracy-first
Use MViT V2-S or Swin3D-B (ImageNet-22K) when top-1 accuracy is the priority.
Classic baseline
Use R(2+1)D-18 as a well-established 3D CNN baseline with broad literature support.
Architecture research
MViT V1-B is useful when studying multiscale attention ablations or reproducing the original paper.