8-bit quantization reduces model memory footprint by 4x compared to BF16/FP32 weights while maintaining generation quality. OminiX-MLX supports INT8 affine quantization with configurable group sizes.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
How quantization works
Quantization compresses 32-bit floating point weights into 8-bit integers using affine (linear) quantization:Quantization in models
Moxin-7B VLM
Moxin-7B quantizes only the Mistral-7B decoder to INT8. The dual vision encoders (DINOv2 + SigLIP) remain in BF16 to preserve visual feature quality.moxin-vlm-mlx/src/lib.rs
Qwen3-ASR
Qwen3-ASR models quantize the text decoder only. The audio encoder (Conv2d + Transformer) stays at full precision:The audio encoder is not quantized — only the text decoder uses 8-bit quantization. This preserves audio feature quality while reducing memory for the larger LLM component.
| Model | Size | Speed |
|---|---|---|
| Qwen3-ASR-1.7B-8bit | 2.46 GB | ~30x RT |
| Qwen3-ASR-0.6B-8bit | 1.01 GB | ~50x RT |
GLM4-MoE
GLM4-MoE supports 3-bit quantization for ultra-low memory:Performance impact
Memory reduction
| Precision | Size | Relative |
|---|---|---|
| BF16/FP32 | 4 bytes/param | 1.0x |
| 8-bit | 1 byte/param | 0.25x |
| 3-bit | 0.375 bytes/param | 0.09x |
- BF16: ~14 GB
- 8-bit: ~3.5 GB (base weights) + ~1 GB (scales/zeros) = ~4.5 GB
- 3-bit: ~2.6 GB
Speed impact
Quantized inference is typically 5-15% slower than full precision due to dequantization overhead. However, memory savings enable:- Running larger models on the same hardware
- Higher batch sizes
- Reduced memory bandwidth pressure
Group size selection
Smaller group sizes preserve more accuracy but add overhead:| Group Size | Quality | Overhead | Use Case |
|---|---|---|---|
| 32 | Best | High | Critical quality tasks |
| 64 | Excellent | Medium | Recommended default |
| 128 | Good | Low | Maximum efficiency |
| 256 | Fair | Minimal | Experimental |
Pre-quantized vs runtime quantization
Pre-quantized models (recommended)
Models frommlx-community on HuggingFace are pre-quantized:
- Weights stored as INT8 + scales/zeros in safetensors
- No conversion overhead at load time
- Optimized scale/zero-point placement
- Immediate inference after loading
Runtime quantization
Quantize models at load time:- Use any BF16 model without conversion
- Experiment with different group sizes
- Slower first load (1-2 minutes for 7B model)
- Uses peak memory (BF16 + INT8) during conversion
Saving quantized models
Save quantized weights to avoid re-quantization:Quality considerations
What to quantize
Quantize:
- LLM decoder layers (Transformer blocks)
- Linear/Dense layers
- Large parameter-heavy components
Model-specific recommendations
| Model Family | Recommended Quantization |
|---|---|
| LLMs (7B+) | 8-bit, group_size=64 |
| VLMs | Decoder only, 8-bit |
| ASR | Decoder only, 8-bit |
| MoE (large) | 3-bit or 4-bit |
| Small models (under 1B) | No quantization |
Weight format
Quantized models use safetensors with special keys:Benchmarks
Measured on Apple M4 Max (128GB):| Model | Precision | Memory | Speed | Quality |
|---|---|---|---|---|
| Moxin-7B VLM | BF16 | 14 GB | 32 tok/s | Baseline |
| Moxin-7B VLM | 8-bit | 10 GB | 30 tok/s | Negligible loss |
| Qwen3-ASR-1.7B | BF16 | 3.2 GB | 32x RT | Baseline |
| Qwen3-ASR-1.7B | 8-bit | 2.5 GB | 30x RT | Negligible loss |
| GLM4-MoE | BF16 | 80 GB | N/A | Baseline |
| GLM4-MoE | 3-bit | 20 GB | 15-20 tok/s | Acceptable loss |
Speed reduction from quantization is typically 5-10%, primarily from dequantization overhead. Memory savings often enable running models that wouldn’t fit in BF16.
References
- MLX Quantization Guide
- mlx-community models
- moxin-vlm-mlx/README.md:24
- qwen3-asr-mlx/README.md:229
- glm4-moe-mlx/README.md:10