System architecture

Overview

OminiX-MLX is a layered Rust ecosystem for ML inference on Apple Silicon. The architecture follows a bottom-up design where lower-level crates provide safe abstractions over MLX, and higher-level crates implement specific model families.

Architecture diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              User Application                                │
│                    (OminiX-API / Custom Rust Application)                   │
└───────────────────────────────┬─────────────────────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────────┐
        │                       │                           │
        ▼                       ▼                           ▼
┌───────────────┐         ┌─────────────────┐         ┌─────────────────┐
│  LLM / VLM    │         │   Audio Crates  │         │  Image Crates   │
├───────────────┤         ├─────────────────┤         ├─────────────────┤
│ qwen3-mlx     │         │ funasr-mlx      │         │ flux-klein-mlx  │
│ glm4-mlx      │         │ funasr-nano-mlx │         │ zimage-mlx      │
│ glm4-moe-mlx  │         │ qwen3-asr-mlx   │         │ qwen-image-mlx  │
│ mixtral-mlx   │         │ gpt-sovits-mlx  │         │                 │
│ mistral-mlx   │         │                 │         │                 │
│ moxin-vlm-mlx │         │                 │         │                 │
│ minicpm-sala  │         │                 │         │                 │
└───────┬───────┘         └────────┬────────┘         └────────┬────────┘
        │                          │                           │
        └──────────────────────────┼───────────────────────────┘
                                   │
                                   ▼
                    ┌──────────────────────────┐
                    │       mlx-rs-core        │
                    ├──────────────────────────┤
                    │ • KV Cache Management    │
                    │ • RoPE Embeddings        │
                    │ • Attention (SDPA)       │
                    │ • Audio Processing       │
                    │ • Metal Kernels          │
                    │ • Speculative Decoding   │
                    └────────────┬─────────────┘
                                 │
                                 ▼
                    ┌──────────────────────────┐
                    │         mlx-rs           │
                    ├──────────────────────────┤
                    │ • Safe Rust API          │
                    │ • Array Operations       │
                    │ • Neural Network Layers  │
                    │ • Transforms (eval, jit) │
                    │ • Random/Ops/Indexing    │
                    └────────────┬─────────────┘
                                 │
                                 ▼
                    ┌──────────────────────────┐
                    │         mlx-sys          │
                    ├──────────────────────────┤
                    │ • FFI Bindings (bindgen) │
                    │ • mlx-c Submodule        │
                    └────────────┬─────────────┘
                                 │
                                 ▼
                    ┌──────────────────────────┐
                    │      Apple MLX (C++)     │
                    ├──────────────────────────┤
                    │ • Metal GPU Backend      │
                    │ • Accelerate Framework   │
                    │ • Unified Memory         │
                    │ • Lazy Evaluation        │
                    └──────────────────────────┘

Layer breakdown

Foundation layer (mlx-sys)

The lowest layer provides raw FFI bindings to Apple’s MLX C++ library:

Auto-generated bindings: Uses bindgen to create safe FFI interfaces
mlx-c submodule: Git submodule tracking the upstream MLX C bindings
Zero-cost abstractions: Direct mapping to C functions with no runtime overhead

Core abstraction layer (mlx-rs)

Provides a safe, idiomatic Rust API over mlx-sys:

Array operations: N-dimensional arrays with automatic memory management
Neural network layers: Linear, convolution, attention, normalization
Function transforms: Automatic differentiation (grad), compilation
Device management: CPU/GPU device abstraction with unified memory
Random operations: Random number generation and distributions
Type safety: Compile-time shape and dtype validation where possible

Key modules in mlx-rs/src/:

array/mod.rs: Core Array type and operations
device.rs: Device abstraction (CPU/GPU) - mlx-rs/src/device.rs:11
stream.rs: Execution streams for parallel computation - mlx-rs/src/stream.rs:110
ops/: Mathematical and neural network operations
transforms/: Function transformations (grad, compile)
nn/: High-level neural network layers

Shared infrastructure layer (mlx-rs-core)

Common components shared across model implementations: KV Cache Management

ConcatKeyValueCache: Simple concatenating cache for autoregressive generation
KeyValueCache trait: Interface for custom cache implementations
Used by all LLM/VLM crates for efficient token generation

Attention Utilities

scaled_dot_product_attention(): Optimized SDPA with mask support
create_attention_mask(): Causal and sliding window mask generation
initialize_rope(): RoPE embeddings with scaling configurations

Audio Processing

WAV I/O: Load/save 16/24/32-bit PCM audio
Resampling: High-quality sinc interpolation
Mel spectrograms: STFT-based feature extraction
HuBERT preprocessing: Specialized audio normalization

Metal Kernels

fused_swiglu(): Fused SwiGLU activation (45x faster for MoE models)
Custom Metal shaders for specialized operations

Model implementation layer

Model-specific crates implementing complete inference pipelines: LLM/VLM Crates (qwen3-mlx, glm4-mlx, mixtral-mlx, etc.)

Model architecture definitions
Weight loading from safetensors/HuggingFace
Tokenizer integration
Generation loops with KV caching
Quantization support (4-bit, 8-bit)

Audio Crates (funasr-mlx, qwen3-asr-mlx, gpt-sovits-mlx)

Audio frontend processing (mel spectrograms, STFT)
Encoder/decoder architectures (Paraformer, Whisper-style)
Vocabulary management
Real-time streaming support

Image Crates (flux-klein-mlx, zimage-mlx)

VAE encoders/decoders
Diffusion transformers (DiT, MMDiT)
Text encoder integration
Latent space manipulation

Application layer

User-facing applications and APIs:

ominix-api: Unified HTTP server with OpenAI-compatible endpoints
Custom applications: User code directly importing model crates
Example binaries: Reference implementations in each crate’s examples/ directory

Crate structure

OminiX-MLX/
├── mlx-rs/              # Core MLX Rust bindings
├── mlx-rs-core/         # Shared inference infrastructure
│
├── qwen3-mlx/           # Qwen2, Qwen3, Qwen3-MoE
├── glm4-mlx/            # GLM4
├── glm4-moe-mlx/        # GLM4-MoE (45 experts)
├── mixtral-mlx/         # Mixtral 8x7B/8x22B
├── mistral-mlx/         # Mistral 7B
├── moxin-vlm-mlx/       # Moxin-7B VLM (vision-language)
├── MiniCPM-SALA-MLX/    # MiniCPM-SALA 9B (hybrid attention, 1M context)
│
├── gpt-sovits-mlx/      # GPT-SoVITS voice cloning
├── funasr-mlx/          # FunASR Paraformer ASR
├── funasr-nano-mlx/     # FunASR-Nano (SenseVoice + Qwen)
├── qwen3-asr-mlx/       # Qwen3-ASR (30+ languages, 0.6B/1.7B)
│
├── ominix-api/          # Unified OpenAI-compatible API server
│
├── flux-klein-mlx/      # FLUX.2-klein image generation
├── zimage-mlx/          # Z-Image generation
└── qwen-image-mlx/      # Qwen image generation

Data flow patterns

LLM inference pipeline

┌─────────────────┐
│   Input Text    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Tokenizer     │  Convert text to token IDs
│  (tokenizers)   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Embedding     │  token_ids → hidden_states [batch, seq_len, hidden_dim]
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Transformer    │  Apply attention + MLP layers
│   Layers (N)    │  • Self-attention with KV cache
│                 │  • RoPE position embeddings
│                 │  • RMSNorm/LayerNorm
│                 │  • MLP with SwiGLU/GELU
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   LM Head       │  hidden → logits [batch, seq_len, vocab_size]
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Sampling      │  logits → next_token (argmax/temperature)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Detokenizer    │  token_id → text
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Output Text    │
└─────────────────┘

Key optimization: KV cache stores past key/value tensors to avoid recomputing attention for previous tokens. Only the new token’s keys/values are computed each step.

ASR inference pipeline

┌─────────────────┐
│   Audio File    │  (WAV/MP3)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Audio Frontend │  
│                 │  • Load and decode audio
│                 │  • Resample to 16kHz
│                 │  • Extract mel spectrogram / STFT features
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Encoder       │  Features → encoder hidden states
│  (Paraformer/   │  • Conformer/Transformer blocks
│   SAN-M)        │  • Temporal convolution
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Decoder       │  Hidden states → token probabilities
│  (CTC / CIF)    │  • CTC: Frame-level prediction
│                 │  • CIF: Continuous integrate-and-fire
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Vocabulary     │  Token IDs → characters/words
│   Mapping       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Transcript     │  Final text output
└─────────────────┘

Image generation pipeline

┌─────────────────┐
│  Text Prompt    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Text Encoder   │  text → embeddings
│  (CLIP/T5/Qwen) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Random Noise   │  latent_shape ~ N(0, 1)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Diffusion      │  Iterative denoising
│  Transformer    │  • DiT/MMDiT architecture
│  (DiT/MMDiT)    │  • Attention with text conditioning
│                 │  • N denoising steps
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  VAE Decoder    │  latent → pixel space
│                 │  [batch, latent_dim, h/8, w/8] → [batch, 3, h, w]
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  PNG/JPEG       │  tensor → image file
│  Encoding       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Output Image   │
└─────────────────┘

Memory management

OminiX-MLX leverages Rust’s ownership system combined with MLX’s unified memory:

Reference counting: MLX arrays use internal reference counting; Rust’s Drop decrements the count
Zero-copy operations: Arrays can be shared between CPU and GPU without copying
Lazy materialization: Array data only allocated when eval() is called
Automatic cleanup: When Rust values go out of scope, MLX memory is freed

The Array type is thread-safe (Send) and uses MLX’s internal reference counting, similar to Arc<T> but managed by MLX.

Parallelism and streams

MLX uses streams to enable parallel execution:

Default stream: Operations use StreamOrDevice::default() which maps to GPU by default
Explicit streams: Create separate streams for parallel computation
No data races: MLX handles synchronization between operations on different streams
Device specification: Operations can target CPU or GPU via the stream parameter

Example parallel execution:

use mlx_rs::{StreamOrDevice, ops};

// Operations on different streams execute in parallel
let a = ops::add(&x, &y, StreamOrDevice::cpu())?;
let b = ops::mul(&x, &y, StreamOrDevice::gpu())?;

Build system

The project uses Cargo workspaces for efficient builds:

Workspace root: Top-level Cargo.toml defines all member crates
Shared dependencies: Common dependencies specified once at workspace level
Incremental compilation: Changing one model crate only rebuilds that crate
Feature flags: metal and accelerate features control MLX backend

# Build all crates
cargo build --release

# Build specific model crate
cargo build --release -p qwen3-mlx

# Build with specific features
cargo build --release --features metal,accelerate

Design principles

Modularity: Each model family is a separate crate with minimal dependencies Type safety: Leverage Rust’s type system to catch errors at compile time Zero-cost abstractions: Rust wrappers add no runtime overhead over raw MLX Ergonomic APIs: Provide convenient builders, macros, and method chaining Pure Rust inference: No Python runtime required; models run standalone Production-ready: Focus on reliability, error handling, and performance

Next steps

MLX framework

Learn about the MLX framework and Metal acceleration

Unified memory

Understand Apple Silicon’s unified memory architecture

Lazy evaluation

Explore lazy evaluation and compute graph optimization

Core API

Browse the mlx-rs core API reference

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Overview

Architecture diagram

Layer breakdown

Foundation layer (mlx-sys)

Core abstraction layer (mlx-rs)

Shared infrastructure layer (mlx-rs-core)

Model implementation layer

Application layer

Crate structure

Data flow patterns

LLM inference pipeline

ASR inference pipeline

Image generation pipeline

Memory management

Parallelism and streams

Build system

Design principles

Next steps

MLX framework

Unified memory

Lazy evaluation

Core API

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Documentation Index

​Overview

​Architecture diagram

​Layer breakdown

​Foundation layer (mlx-sys)

​Core abstraction layer (mlx-rs)

​Shared infrastructure layer (mlx-rs-core)

​Model implementation layer

​Application layer

​Crate structure

​Data flow patterns

​LLM inference pipeline

​ASR inference pipeline

​Image generation pipeline

​Memory management

​Parallelism and streams

​Build system

​Design principles

​Next steps

MLX framework

Unified memory

Lazy evaluation

Core API

Build docs developers (and LLMs) love

Overview

Architecture diagram

Layer breakdown

Foundation layer (mlx-sys)

Core abstraction layer (mlx-rs)

Shared infrastructure layer (mlx-rs-core)

Model implementation layer

Application layer

Crate structure

Data flow patterns

LLM inference pipeline

ASR inference pipeline

Image generation pipeline

Memory management

Parallelism and streams

Build system

Design principles

Next steps