Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
Function Signature
Description
Extracts frames from a video file or processes a sequence of image frames. Applies smart resizing and frame sampling based on FPS and pixel constraints. Supports multiple video reading backends (torchcodec, decord, torchvision).Parameters
Dictionary containing video information and processing parameters.Required keys:
video: Video source (file path, URL) or list of frame paths
fps: Target frames per second for extraction (default: 2.0)nframes: Exact number of frames to extract
fps):min_frames: Minimum frames to extract (default: 4)max_frames: Maximum frames to extract (default: 768)
video_start: Start time in secondsvideo_end: End time in seconds
resized_height: Target heightresized_width: Target widthmin_pixels: Minimum pixels per frame (default: 128 * patch_factor²)max_pixels: Maximum pixels per frame (calculated dynamically)total_pixels: Total pixel budget for all frames (default: MODEL_SEQ_LEN * patch_factor² * 0.9)
The patch size used by the vision encoder.Common values:
14for Qwen2VL and Qwen2.5VL16for Qwen3VL
Whether to return the sample FPS alongside the video tensor.
Whether to return video metadata (fps, frame indices, total frames, backend) alongside the video tensor.Required for Qwen3VL models.
Returns
Video tensor with shape
(T, C, H, W) where:T: Number of frames (divisible by 2)C: Color channels (3 for RGB)H: Frame height (divisible by patch_factor)W: Frame width (divisible by patch_factor)
Only returned if
return_video_metadata=True. Contains:fps: Original video FPSframes_indices: List of extracted frame indicestotal_num_frames: Total frames in the videovideo_backend: Backend used (“torchcodec”, “decord”, or “torchvision”)
Only returned if
return_video_sample_fps=True. The effective FPS of the sampled frames.Video Reading Backends
The function automatically selects the best available backend:- torchcodec (preferred, fastest)
- decord (good performance)
- torchvision (fallback, always available)
Usage Examples
Basic Video Loading
Custom FPS Sampling
Exact Frame Count
Time Range Selection
Frame Sequence Input
Custom Resolution
With Metadata (Qwen3VL)
Frame Sampling Algorithm
The function uses intelligent frame sampling:-
Determine frame count:
- If
nframesspecified: Use that value (rounded to multiple of 2) - If
fpsspecified: Calculatenframes = total_frames / video_fps * fps - Apply
min_framesandmax_framesconstraints
- If
- Extract frames: Uniformly sample frames across the specified range
- Resize frames: Apply smart_resize to each frame
Pixel Budget Management
The function automatically manages pixel budgets to prevent exceeding model limits:Error Handling
Invalid Time Range
Invalid Frame Count
Performance Tips
- Use torchcodec for best performance (pip install torchcodec)
- Limit max_frames to reduce processing time
- Use frame sequences for pre-extracted frames to skip video decoding
- Adjust total_pixels based on your model’s sequence length limit
Environment Variables
See Also
- process_vision_info - Process all vision content from conversations
- fetch_image - Load individual images
- smart_resize - Understanding the resize algorithm