TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/vision/llms.txt
Use this file to discover all available pages before exploring further.
torchvision.io module is TorchVision’s primary interface for loading and saving image and video data directly as PyTorch tensors. Rather than passing through NumPy arrays or PIL Image objects, these functions decode files straight into uint8 tensors in CHW layout, making them ready for transforms and model inference with zero extra conversion overhead. JPEG decoding also supports CUDA acceleration via nvjpeg.
ImageReadMode
ImageReadMode is an enum that controls the colour-space conversion applied during decoding. You can pass either the enum member or its string name to any mode parameter.
GRAY and GRAY_ALPHA are only supported for JPEG and PNG images. Passing mode="RGB" as a plain string is equivalent to mode=ImageReadMode.RGB.| Member | Value | Description |
|---|---|---|
UNCHANGED | 0 | Preserve the native colour space of the file |
GRAY | 1 | Force single-channel grayscale output |
GRAY_ALPHA | 2 | Grayscale with an alpha channel |
RGB | 3 | Force 3-channel RGB output |
RGB_ALPHA / RGBA | 4 | RGB with an alpha channel |
Image Decoding
decode_image
uint8 tensor of raw encoded bytes. Automatically detects the format (JPEG, PNG, GIF, WEBP) and dispatches to the appropriate decoder.
decode_image() does not support AVIF or HEIC yet — use decode_avif() / decode_heic() directly for those formats.Either a one-dimensional
uint8 tensor containing raw encoded bytes, or a path to the image file on disk.Colour-space conversion to apply during decoding. Accepts string names, e.g.
"RGB". See ImageReadMode.Apply the EXIF orientation tag to automatically rotate/flip the output tensor. Supported for JPEG and PNG only.
Tensor[C, H, W] — uint8 for 8-bit images, uint16 for 16-bit PNGs.
read_image
Path to the image file to read.
Colour-space conversion mode. See
ImageReadMode.Apply EXIF orientation transformation.
Tensor[C, H, W] uint8.
decode_jpeg
A 1-D
uint8 CPU tensor of raw JPEG bytes, or a list of such tensors. All input tensors must reside on CPU even when decoding to CUDA.Colour-space conversion mode.
Device for the output tensor. When
"cuda", nvjpeg is used for hardware-accelerated decoding. Requires CUDA ≥ 11.6 to avoid a memory leak in nvjpeg.Apply EXIF orientation (CPU only).
Tensor[C, H, W] or list[Tensor[C, H, W]] — uint8 values in [0, 255], on the requested device.
decode_png
1-D
uint8 tensor containing the raw bytes of the PNG file.Colour-space conversion mode.
Apply EXIF orientation transformation.
Tensor[C, H, W] — uint8 for 8-bit PNGs, uint16 for 16-bit PNGs.
For 16-bit PNG output, call
torchvision.transforms.v2.functional.to_dtype(img, scale=True) to convert to uint8 or float32.decode_gif
1-D contiguous
uint8 tensor of raw GIF bytes.Tensor[C, H, W]if the GIF contains a single frame.Tensor[N, C, H, W]if the GIF containsNframes (animated).
uint8 in [0, 255].
decode_webp
1-D contiguous
uint8 tensor of raw WebP bytes.Colour-space conversion mode. Use
"RGB" or "RGB_ALPHA" for explicit channel count.Tensor[C, H, W] uint8.
decode_avif
1-D contiguous
uint8 tensor of raw AVIF bytes.Colour-space conversion mode.
Tensor[C, H, W] — uint8 for 8-bit images, uint16 for higher bit-depth.
decode_heic
1-D contiguous
uint8 tensor of raw HEIC bytes.Colour-space conversion mode.
Tensor[C, H, W] — uint8 for 8-bit, uint16 for higher bit-depth.
Image Encoding
encode_jpeg
A
uint8 image tensor with C = 1 (grayscale) or C = 3 (RGB), or a list of such tensors. CUDA tensors are encoded with a CUDA-native encoder.JPEG quality factor,
1 (smallest file) to 100 (best quality).Tensor[1] or list[Tensor[1]] — 1-D uint8 tensor(s) of raw JPEG bytes.
write_jpeg
write_file(filename, encode_jpeg(input, quality))).
uint8 image tensor with C = 1 or C = 3.Destination file path.
JPEG quality factor,
1–100.encode_png
uint8 image tensor with C = 1 or C = 3.zlib compression level,
0 (no compression, largest file) to 9 (maximum compression).Tensor[1] — 1-D uint8 tensor of raw PNG bytes.
write_png
uint8 image tensor with C = 1 or C = 3.Destination file path.
zlib compression level,
0–9.File I/O
read_file
uint8 tensor. Useful for loading encoded image bytes before passing them to a format-specific decoder.
Path to the file to read.
Tensor — 1-D uint8 tensor of the file’s raw bytes.
write_file
uint8 tensor to a file on disk.
Destination file path.
1-D
uint8 tensor of bytes to write.Complete Image I/O Example
Video I/O
read_video
Path to the video file to read.
Start presentation timestamp. Interpreted as raw PTS ticks when
pts_unit="pts", or seconds when pts_unit="sec".End presentation timestamp (inclusive).
None reads to the end of the stream.Unit for
start_pts / end_pts. Either "pts" (raw ticks) or "sec" (seconds).Layout of the returned video tensor.
"THWC" (Time × Height × Width × Channels, default) or "TCHW".vframes—Tensor[T, H, W, C](or[T, C, H, W]for"TCHW")uint8video frames.aframes—Tensor[K, L]float32audio samples (Kchannels,Lsamples).info—dictwith keysvideo_fps(float) andaudio_fps(int).
read_video_timestamps
Path to the video file.
"pts" returns raw integer ticks; "sec" returns timestamps in seconds.pts— sorted list of presentation timestamps.video_fps— frames per second as afloat, orNoneif unavailable.
write_video
pip install av).
Destination file path (format inferred from extension, e.g.
.mp4, .mkv).uint8 tensor of video frames in THWC layout (C = 3 for RGB).Output frame rate.
FFmpeg video codec identifier, e.g.
"libx264", "libx265", "mpeg4".Additional FFmpeg encoder options passed as key-value strings (e.g.
{"crf": "18"}).Optional audio samples as a
float32 tensor of shape [K, L] (channels × samples).Audio sample rate in Hz (required when
audio_array is provided).FFmpeg audio codec, e.g.
"aac", "mp3".Additional FFmpeg audio encoder options.
VideoReader
VideoReader is a low-level, frame-by-frame streaming class built on PyAV. It supports fine-grained seeking and is suited for datasets that sample random frames without loading the entire file into memory.
Backend Control
TorchVision exposes functions at the top-leveltorchvision namespace to switch the underlying libraries used for image and video loading.
Image Backend
"PIL" (default, full-featured) or "accimage" (Intel IPP library, generally faster but supports fewer operations).Video Backend
"pyav" is the only supported video backend. set_video_backend() currently has no effect and exists for API compatibility.