Qwen3 is a family of language models from Alibaba Cloud, ranging from 0.6B to 32B parameters. The MLX implementation provides fast inference with Metal GPU acceleration and support for both dense and quantized models.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
Features
- Fast inference: Metal GPU acceleration with async token pipelining
- Quantization support: 4-bit and bf16 models for flexible memory/quality tradeoffs
- Step-based KV cache: Memory-efficient autoregressive generation
- Chat templates: Native support for multi-turn conversations
Installation
Add to yourCargo.toml:
Quick start
Examples
Text generation
Generate text from a prompt:Interactive chat
Multi-turn conversation with chat templates:- Loading chat templates from
tokenizer_config.json - Building conversation history
- Streaming token output
- EOS token detection for Qwen3 (tokens 151643, 151645)
Supported models
Qwen3-0.6B
Size: 1.2 GB
Use case: Embedded applications, testing
HF path:
Use case: Embedded applications, testing
HF path:
mlx-community/Qwen3-0.6B-bf16Qwen3-1.7B
Size: 3.4 GB
Use case: Resource-constrained deployments
HF path:
Use case: Resource-constrained deployments
HF path:
mlx-community/Qwen3-1.7B-bf16Qwen3-4B
Size: 8 GB
Use case: General-purpose chat, recommended
HF path:
Use case: General-purpose chat, recommended
HF path:
mlx-community/Qwen3-4B-bf16Qwen3-8B
Size: 16 GB
Use case: Higher quality responses
HF path:
Use case: Higher quality responses
HF path:
mlx-community/Qwen3-8B-bf16Qwen3-14B
Size: 28 GB
Use case: Advanced reasoning
HF path:
Use case: Advanced reasoning
HF path:
mlx-community/Qwen3-14B-bf16Qwen3-32B
Size: 64 GB
Use case: Maximum quality (requires M3 Max 128GB)
HF path:
Use case: Maximum quality (requires M3 Max 128GB)
HF path:
mlx-community/Qwen3-32B-bf16Quantized variants
All models available with 4-bit quantization for 4x memory reduction:-bf16 with -4bit in any HuggingFace path above.
Performance
Benchmark results (Apple M3 Max, 40-core GPU)
| Model | Precision | Prompt Speed | Decode Speed | Memory |
|---|---|---|---|---|
| Qwen3-4B | bf16 | 150 tok/s | 45 tok/s | 8 GB |
| Qwen3-4B | 4-bit | 250 tok/s | 75 tok/s | 3 GB |
- 1.67x faster prompt processing
- 1.67x faster token generation
- 2.67x less memory usage
Speed vs sequence length
Prompt processing speed scales linearly with input length, while decode speed remains constant per token. For a 1000-token input:- Qwen3-4B (4-bit): ~4 seconds prefill time
- Decode: 75 tokens/second regardless of context length
Converting models
Convert any Qwen3 model from HuggingFace:./mlx_model by default.
API reference
Core functions
config.json- Model configurationmodel.safetensorsormodel-*.safetensors- Model weights
tokenizer.json.
Generation
KV cache types
Troubleshooting
Out of memory errors
Try these solutions in order:- Use 4-bit quantized model instead of bf16
- Use smaller model (e.g., Qwen3-1.7B instead of Qwen3-4B)
- Reduce max token limit in generation
- Close other applications to free memory
Slow generation speed
- Ensure you’re using
--releasebuild mode - Verify Metal is enabled: check for GPU utilization in Activity Monitor
- Update to latest macOS version for best Metal performance
- Use quantized models for faster inference
Model download fails
Related models
- Qwen3-ASR - Speech recognition with Qwen3 backbone
- Qwen-Image - Vision-language model with Qwen architecture