Qwen3-VL represents a comprehensive upgrade in vision-language capabilities, delivering state-of-the-art performance across multiple domains.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
Visual Agent Capabilities
GUI Interaction
Qwen3-VL can operate both PC and mobile graphical user interfaces:- Element Recognition: Identifies UI components (buttons, forms, menus)
- Function Understanding: Comprehends the purpose of interface elements
- Tool Invocation: Executes commands and interactions
- Task Completion: Performs multi-step workflows autonomously
Visual Coding
Generate code from visual inputs:- Draw.io diagrams → Structured diagram definitions
- Screenshots → HTML/CSS/JavaScript code
- Design mockups → Frontend implementations
- Video tutorials → Code reproduction
Spatial Understanding
2D Grounding
Precise object localization with multiple output formats:- Bounding boxes with relative coordinates
- Point annotations
- Multiple objects simultaneously
- Diverse positioning and labeling tasks
3D Grounding
Spatial reasoning in three dimensions:- Indoor and outdoor object detection
- 3D bounding box generation
- Viewpoint understanding
- Occlusion reasoning
- Depth perception
Advanced Spatial Perception
- Position Judgment: Determines object locations relative to each other
- Viewpoint Analysis: Understands camera angles and perspectives
- Occlusion Detection: Identifies when objects block others
- Spatial Reasoning: Infers relationships between objects in 3D space
Long Context & Video Understanding
Context Window
- Native: 256K tokens
- Extended (with YaRN): Up to 1M tokens
- Applications:
- Full book analysis
- Hours-long video processing
- Complete codebase comprehension
Video Processing
Comprehensive video understanding capabilities:- Temporal Modeling: Precise event timing and sequencing
- Second-level Indexing: Frame-accurate content retrieval
- Long Video Support: Process videos of any length
- Full Recall: Maintain context across entire video duration
Video Features
- Video OCR: Extract text from video frames
- Video Grounding: Locate specific moments
- Action Recognition: Identify activities and events
- Scene Understanding: Comprehend context changes
Enhanced OCR Capabilities
Language Support
32 languages (expanded from 10 in previous versions)- Latin scripts
- Chinese (Simplified & Traditional)
- Japanese, Korean
- Arabic, Hebrew
- Cyrillic scripts
- And 25+ more
Robust Text Recognition
Handles challenging conditions:- Low light: Dimly lit scenes
- Blur: Motion or focus blur
- Tilt: Rotated or skewed text
- Rare characters: Ancient scripts, specialized glyphs
- Technical jargon: Domain-specific terminology
Document Understanding
- Layout position information
- Structure parsing for long documents
- Table extraction
- Form understanding
- Qwen HTML format output
Multimodal Reasoning
Enhanced STEM Performance
Excels in scientific and mathematical tasks:- Visual Math: Solve problems from diagrams and equations
- Causal Analysis: Identify cause-effect relationships
- Logical Reasoning: Evidence-based conclusions
- Multi-step Problems: Complex problem decomposition
Reasoning Capabilities
- Mathematical equation solving
- Scientific diagram interpretation
- Chart and graph analysis
- Statistical reasoning
- Geometry and spatial problems
Omni Recognition
Visual Recognition Scope
Broader, higher-quality pretraining enables “recognize everything”:People & Characters
- Celebrities
- Anime characters
- Historical figures
Objects & Products
- Vehicles (cars, planes)
- Consumer products
- Brand recognition
Nature & Places
- Landmarks
- Flora and fauna
- Geographical features
Recognition Features
- Fine-grained Classification: Distinguish between similar objects
- Attribute Recognition: Identify colors, materials, styles
- Scene Understanding: Comprehend overall context
- Multi-object Recognition: Handle complex scenes
Text Understanding
On Par with Pure LLMs
Seamless text-vision fusion:- No degradation in pure text tasks
- Unified comprehension across modalities
- Lossless information processing
- Consistent performance whether input is text, image, or both
Applications
- Mixed text and image documents
- Code with visual diagrams
- Scientific papers with figures
- Interleaved content processing
Performance Highlights
Visual Tasks
State-of-the-art performance on:- General visual question answering
- Document VQA
- Chart/diagram understanding
- Scene text recognition
- Multi-image reasoning
Text-Centric Tasks
Competitive with specialized text models on:- Language understanding
- Code generation
- Mathematical reasoning
- Knowledge question answering
For detailed benchmark results, see the Performance section.