Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
Overview
OminiX-MLX is a layered Rust ecosystem for ML inference on Apple Silicon. The architecture follows a bottom-up design where lower-level crates provide safe abstractions over MLX, and higher-level crates implement specific model families.Architecture diagram
Layer breakdown
Foundation layer (mlx-sys)
The lowest layer provides raw FFI bindings to Apple’s MLX C++ library:- Auto-generated bindings: Uses bindgen to create safe FFI interfaces
- mlx-c submodule: Git submodule tracking the upstream MLX C bindings
- Zero-cost abstractions: Direct mapping to C functions with no runtime overhead
Core abstraction layer (mlx-rs)
Provides a safe, idiomatic Rust API over mlx-sys:- Array operations: N-dimensional arrays with automatic memory management
- Neural network layers: Linear, convolution, attention, normalization
- Function transforms: Automatic differentiation (
grad), compilation - Device management: CPU/GPU device abstraction with unified memory
- Random operations: Random number generation and distributions
- Type safety: Compile-time shape and dtype validation where possible
array/mod.rs: CoreArraytype and operationsdevice.rs: Device abstraction (CPU/GPU) - mlx-rs/src/device.rs:11stream.rs: Execution streams for parallel computation - mlx-rs/src/stream.rs:110ops/: Mathematical and neural network operationstransforms/: Function transformations (grad, compile)nn/: High-level neural network layers
Shared infrastructure layer (mlx-rs-core)
Common components shared across model implementations: KV Cache ManagementConcatKeyValueCache: Simple concatenating cache for autoregressive generationKeyValueCachetrait: Interface for custom cache implementations- Used by all LLM/VLM crates for efficient token generation
scaled_dot_product_attention(): Optimized SDPA with mask supportcreate_attention_mask(): Causal and sliding window mask generationinitialize_rope(): RoPE embeddings with scaling configurations
- WAV I/O: Load/save 16/24/32-bit PCM audio
- Resampling: High-quality sinc interpolation
- Mel spectrograms: STFT-based feature extraction
- HuBERT preprocessing: Specialized audio normalization
fused_swiglu(): Fused SwiGLU activation (45x faster for MoE models)- Custom Metal shaders for specialized operations
Model implementation layer
Model-specific crates implementing complete inference pipelines: LLM/VLM Crates (qwen3-mlx, glm4-mlx, mixtral-mlx, etc.)- Model architecture definitions
- Weight loading from safetensors/HuggingFace
- Tokenizer integration
- Generation loops with KV caching
- Quantization support (4-bit, 8-bit)
- Audio frontend processing (mel spectrograms, STFT)
- Encoder/decoder architectures (Paraformer, Whisper-style)
- Vocabulary management
- Real-time streaming support
- VAE encoders/decoders
- Diffusion transformers (DiT, MMDiT)
- Text encoder integration
- Latent space manipulation
Application layer
User-facing applications and APIs:- ominix-api: Unified HTTP server with OpenAI-compatible endpoints
- Custom applications: User code directly importing model crates
- Example binaries: Reference implementations in each crate’s
examples/directory
Crate structure
Data flow patterns
LLM inference pipeline
ASR inference pipeline
Image generation pipeline
Memory management
OminiX-MLX leverages Rust’s ownership system combined with MLX’s unified memory:- Reference counting: MLX arrays use internal reference counting; Rust’s
Dropdecrements the count - Zero-copy operations: Arrays can be shared between CPU and GPU without copying
- Lazy materialization: Array data only allocated when
eval()is called - Automatic cleanup: When Rust values go out of scope, MLX memory is freed
The
Array type is thread-safe (Send) and uses MLX’s internal reference counting, similar to Arc<T> but managed by MLX.Parallelism and streams
MLX uses streams to enable parallel execution:- Default stream: Operations use
StreamOrDevice::default()which maps to GPU by default - Explicit streams: Create separate streams for parallel computation
- No data races: MLX handles synchronization between operations on different streams
- Device specification: Operations can target CPU or GPU via the
streamparameter
Build system
The project uses Cargo workspaces for efficient builds:- Workspace root: Top-level
Cargo.tomldefines all member crates - Shared dependencies: Common dependencies specified once at workspace level
- Incremental compilation: Changing one model crate only rebuilds that crate
- Feature flags:
metalandacceleratefeatures control MLX backend
Design principles
Modularity: Each model family is a separate crate with minimal dependencies Type safety: Leverage Rust’s type system to catch errors at compile time Zero-cost abstractions: Rust wrappers add no runtime overhead over raw MLX Ergonomic APIs: Provide convenient builders, macros, and method chaining Pure Rust inference: No Python runtime required; models run standalone Production-ready: Focus on reliability, error handling, and performanceNext steps
MLX framework
Learn about the MLX framework and Metal acceleration
Unified memory
Understand Apple Silicon’s unified memory architecture
Lazy evaluation
Explore lazy evaluation and compute graph optimization
Core API
Browse the mlx-rs core API reference