Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
General Questions
What is Qwen3-VL?
Qwen3-VL is the most powerful vision-language model in the Qwen series. It delivers comprehensive upgrades including:- Superior text understanding & generation
- Deeper visual perception & reasoning
- Extended context length (256K native, expandable to 1M)
- Enhanced spatial and video dynamics comprehension
- Stronger agent interaction capabilities
What’s the difference between Qwen3-VL and Qwen2.5-VL?
Qwen3-VL introduces several architectural improvements:- Interleaved-MRoPE: Enhanced positional embeddings for better video reasoning
- DeepStack: Multi-level ViT feature fusion for finer details
- Text-Timestamp Alignment: Precise temporal modeling for videos
- Improved Capabilities: Better visual coding, spatial reasoning, OCR (32 languages vs 10)
- Larger Model Sizes: Up to 235B parameters with MoE architecture
Which model should I use?
For edge/mobile deployment: Qwen3-VL-2B- Smallest footprint
- Suitable for consumer GPUs
- Good performance-to-resource ratio
- RTX 3090/4090, A100 40GB
- Strong capabilities
- 30B-A3B uses MoE with only 3B active
- State-of-the-art performance
- Requires 8x H100/H200
Should I use Instruct or Thinking edition?
Instruct Edition:- General-purpose applications
- Better instruction following
- More aligned with human preferences
- Faster inference
- Complex reasoning tasks
- STEM and mathematical problems
- Causal analysis and logical reasoning
- Provides detailed thought processes
Model Capabilities
What visual tasks does Qwen3-VL support?
- Image Understanding: Object recognition, scene understanding, visual reasoning
- OCR: 32 languages, robust to blur/tilt/low-light
- Document Parsing: Layout analysis, structure extraction
- Object Grounding: 2D bounding boxes and points, 3D spatial reasoning
- Video Understanding: Long-form video (hours), temporal reasoning, second-level indexing
- Visual Coding: Generate HTML/CSS/JS/Draw.io from screenshots
- Agent Tasks: GUI interaction, tool use on PC/mobile
- Spatial Reasoning: Viewpoint, occlusion, 3D relationships
What is the maximum context length?
Default: 256K tokens Extended: Up to 1M tokens using YaRN scaling To enable 1M context, modifyconfig.json:
How many images/videos can I process at once?
Qwen3-VL supports multiple images and videos in a single conversation, limited only by context length. Example with multiple inputs:What languages does the OCR support?
Qwen3-VL supports OCR in 32 languages (expanded from 10 in Qwen2.5-VL), including:- Major world languages (English, Chinese, Spanish, French, German, etc.)
- Asian languages (Japanese, Korean, Thai, Vietnamese, etc.)
- Arabic, Hebrew, Cyrillic scripts
- Rare and ancient characters
- Technical jargon and specialized terminology
Installation & Setup
What are the minimum requirements?
Software:- Python 3.8+
- PyTorch 2.0+
- Transformers >= 4.57.0
- CUDA 11.6+ (for GPU)
- Qwen3-VL-2B: 8GB+ VRAM (consumer GPU)
- Qwen3-VL-8B: 24GB+ VRAM (RTX 3090/4090)
- Qwen3-VL-32B: 80GB+ VRAM (A100 80GB or multi-GPU)
- Qwen3-VL-235B-A22B: 8x H100/H200 recommended
How do I install Qwen3-VL?
Basic installation:Where can I download the models?
HuggingFace (global): ModelScope (optimized for mainland China): All models available in:- Base precision (BF16/FP16)
- FP8 quantized (for H100/H200)
Usage
How do I run inference?
Basic inference:How do I control image/video resolution?
Method 1: Global settings via processorHow do I process videos?
From URL or local path:Deployment
How do I deploy for production?
Recommended: vLLMCan I use the OpenAI API?
Yes! Both vLLM and the official Qwen API support OpenAI-compatible endpoints. With vLLM:How do I optimize inference speed?
- Use vLLM or SGLang instead of transformers
- Enable Flash Attention 2
- Use FP8 quantization (H100/H200)
- Batch multiple requests
- Reduce visual token budget for faster processing
Fine-tuning
Can I fine-tune Qwen3-VL?
Yes! Fine-tuning code is available for Qwen2-VL and Qwen2.5-VL, which is compatible with Qwen3-VL. Resources:- Fine-tuning Code
- Released: April 8, 2025
What data format is required for fine-tuning?
Qwen3-VL uses a conversational format with support for interleaved image/video/text:Can I fine-tune on custom visual tasks?
Yes! Qwen3-VL can be fine-tuned for:- Custom object detection/grounding
- Domain-specific OCR
- Specialized document understanding
- Custom visual reasoning tasks
- Agent behaviors
Technical Details
What’s the architecture?
Qwen3-VL introduces three key innovations:- Interleaved-MRoPE: Multi-resolution positional embeddings across time, width, and height
- DeepStack: Multi-level ViT feature fusion for fine-grained details
- Text-Timestamp Alignment: Precise temporal grounding beyond T-RoPE
What is the patch size?
- Qwen3-VL: 16×16 pixels
- Qwen2.5-VL: 14×14 pixels
qwen-vl-utils usage:
How are visual tokens calculated?
For images:- Spatial compression: 32× (16×16 patches → merged)
- Tokens ≈ (height × width) / (32 × 32)
- Spatial compression: 32×
- Temporal compression: 2×
- Tokens ≈ (frames × height × width) / (32 × 32 × 2)
min_pixels, max_pixels, total_pixels parameters.
Community & Support
Where can I get help?
- Documentation: You’re reading it!
- GitHub Issues: Qwen3-VL Issues
- Discord: Join server
- WeChat: QR code
- Troubleshooting: Common issues
How do I report a bug?
- Check existing issues
- Review troubleshooting guide
- Create new issue with:
- Model version and size
- Environment (OS, Python, CUDA, library versions)
- Minimal reproducible example
- Error messages and logs
Where can I find examples?
Cookbooks (Jupyter notebooks):- Omni Recognition
- Document Parsing
- 2D/3D Grounding
- OCR & Key Information Extraction
- Video Understanding
- Mobile/Computer Agent
- Visual Coding
- And more in the cookbooks directory