OminiX-MLX provides production-ready implementations of popular large language models optimized for Apple Silicon. All models leverage Metal GPU acceleration for fast inference with minimal memory usage.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
Supported model families
Qwen3
0.6B to 32B parameters. Fast inference with 4-bit quantization support. Best for general-purpose text generation.
GLM-4
9B parameter models with unique architecture. Partial RoPE and fused MLP for efficiency.
Mixtral
8x7B and 8x22B MoE models. Custom Metal kernels for 10-12x faster expert dispatch.
Mistral
7B models with sliding window attention. Efficient long-context processing with GQA.
MiniCPM-SALA
9B hybrid attention model. Million-token context with lightning attention.
Common features
All language model implementations share these capabilities:- Metal GPU acceleration: Native Apple Silicon optimization with MLX framework
- Quantization support: 4-bit and 8-bit quantized models for reduced memory usage
- KV cache: Step-based key-value caching for efficient autoregressive generation
- Streaming generation: Token-by-token output for interactive applications
- Tokenizer integration: HuggingFace tokenizer support with chat templates
Unified API
All models follow a consistent Rust API pattern:Performance comparison
Benchmarks on Apple M3 Max (40-core GPU):| Model | Size | Prefill | Decode | Memory |
|---|---|---|---|---|
| Qwen3-4B (4-bit) | 3 GB | 250 tok/s | 75 tok/s | 3 GB |
| GLM-4-9B (4-bit) | 6 GB | ~200 tok/s | ~50 tok/s | 6 GB |
| Mixtral-8x7B (4-bit) | 26 GB | 80 tok/s | 25 tok/s | 26 GB |
| Mistral-7B (4-bit) | 4 GB | ~220 tok/s | 55 tok/s | 4 GB |
| MiniCPM-SALA-9B (8-bit) | 9.6 GB | 443 tok/s | 28 tok/s | 9.6 GB |
Model selection guide
For interactive chat
- Qwen3-4B (4-bit): Best balance of speed and quality for general chat
- Mistral-7B (4-bit): Strong instruction following with sliding window attention
For long context
- MiniCPM-SALA-9B: Million-token context capability with hybrid attention
- Mistral-7B: 4096 token sliding window for efficient long sequences
For maximum quality
- Mixtral-8x7B: 47B total parameters with expert routing
- Qwen3-32B: Largest dense model (requires 64GB+ memory)
For memory-constrained systems
- Qwen3-0.6B: Smallest model at 1.2 GB
- Qwen3-1.7B: Good quality with only 3.4 GB memory
Next steps
Download models
Get pre-converted MLX models from HuggingFace Hub
API reference
Detailed API documentation for all model implementations