OminiX provides high-performance vision-language model (VLM) inference on Apple Silicon through MLX. These models combine visual understanding with language capabilities, enabling applications like image captioning, visual question answering, and multimodal understanding.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
Available models
OminiX currently supports the following vision-language models:Moxin-7B
Dual-encoder VLM with DINOv2 and SigLIP vision backbones
Key features
Dual vision encoders
OminiX VLMs use multiple specialized vision encoders to capture different aspects of visual information:- DINOv2 - Self-supervised ViT trained on ImageNet, excellent for semantic understanding
- SigLIP - Contrastive vision-language encoder trained on image-text pairs
Efficient quantization
Reduce memory usage and improve inference speed with 8-bit or 4-bit quantization:KV-cache generation
All VLMs use efficient KV-caching for fast autoregressive generation:- Prefill - Process image and prompt in parallel, cache key-value pairs
- Decode - Generate tokens one at a time using cached values
Architecture overview
The typical VLM architecture in OminiX follows this pipeline:Performance considerations
Memory usage
- BF16 (no quantization): ~14GB for Moxin-7B
- INT8 quantization: ~7GB for Moxin-7B
- INT4 quantization: ~4GB for Moxin-7B
Inference speed
Typical performance on M3 Max (36 GPU cores):- Prefill: 200-300ms (vision encoding + prompt processing)
- Decode: 30-50 tokens/second (BF16), 50-80 tokens/second (INT8)
Vision encoding only runs once during prefill. Subsequent token generation uses only the language decoder, making decode steps much faster.
Getting started
Next steps
Moxin-7B guide
Learn how to use the Moxin-7B VLM
API reference
Explore the complete API documentation