Qwen3-VL is available in multiple sizes and configurations to meet different deployment scenarios, from edge devices to cloud infrastructure.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
Model Sizes
Qwen3-VL comes in five different parameter scales:Dense Architecture Models
| Model Size | Parameters | Use Case | Hardware Requirement |
|---|---|---|---|
| 2B | 2 Billion | Edge devices, mobile | Consumer GPUs |
| 4B | 4 Billion | Lightweight deployment | Single GPU |
| 8B | 8 Billion | Balanced performance | Single GPU (16GB+) |
| 32B | 32 Billion | High-performance tasks | Multi-GPU |
Mixture-of-Experts (MoE) Architecture
| Model Size | Total Parameters | Active Parameters | Use Case |
|---|---|---|---|
| 30B-A3B | 30 Billion | 3 Billion | Efficient large-scale |
| 235B-A22B | 235 Billion | 22 Billion | State-of-the-art performance |
MoE models activate only a subset of parameters per inference, providing better efficiency and performance tradeoffs compared to dense models of similar total parameter count.
Model Editions
Each model size is available in two editions:Instruct Edition
Optimized for direct task execution- Fast inference and response generation
- Suitable for production deployments
- Optimized for instruction-following
- Lower computational overhead
Qwen3-VL-8B-Instruct
Thinking Edition
Enhanced reasoning with explicit thought processes- Provides step-by-step reasoning
- Better performance on complex tasks
- Useful for debugging and interpretability
- Longer output sequences
Qwen3-VL-8B-Thinking
Architecture Comparison
Dense vs MoE
Dense Models- All parameters active during inference
- Predictable memory usage
- Simpler deployment
- Best for: Resource-constrained environments
- Sparse activation patterns
- Higher capacity with lower compute
- Requires expert parallelism support
- Best for: Maximum performance scenarios
Available Models
Released Models (HuggingFace)
2B Models
4B Models
8B Models
32B Models
30B-A3B MoE
235B-A22B MoE
Model Selection Guide
By Use Case
Edge/Mobile Applications- Choose: 2B or 4B Instruct
- Rationale: Low memory footprint, fast inference
- Choose: 8B Instruct
- Rationale: Best balance of performance and efficiency
- Choose: 8B or 32B Thinking
- Rationale: Enhanced reasoning capabilities
- Choose: 235B-A22B Instruct/Thinking
- Rationale: State-of-the-art results across benchmarks
- Choose: 30B-A3B Instruct
- Rationale: MoE efficiency with strong performance
By Hardware
Quantized Versions
For memory-constrained deployments, FP8 quantized versions are available:- Requires NVIDIA H100+ and CUDA 12+
- Minimal performance degradation
- Significant memory savings
- Faster inference throughput