Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Theminicpm-sala-mlx crate provides inference for MiniCPM-SALA, a compact language model with hybrid attention combining sparse (Lightning Attention) and dense layers. SALA includes built-in thinking capabilities and speculative decoding for faster inference.
Key features
- Hybrid attention - Alternating Lightning (sparse) and dense attention layers
- Built-in thinking -
<think>...</think>blocks for reasoning - Speculative decoding - Draft model acceleration
- Custom Metal kernels - Optimized Lightning Attention implementation
- Compact size - High performance in small models (2B-4B parameters)
Installation
Add to yourCargo.toml:
Core functions
load_model
Loads a MiniCPM-SALA model from a directory.Path to the model directory containing:
config.json- Model configurationmodel.safetensors.index.json- Weight file index (or singlemodel.safetensors)- Model weight files
Returns a loaded
Model ready for inferenceload_tokenizer
Loads the tokenizer from the model directory.Path to the model directory containing
tokenizer.jsonReturns a HuggingFace
Tokenizer instanceget_model_args
Parses model configuration fromconfig.json.
Path to directory containing
config.jsonReturns parsed
ModelArgs with model hyperparametersUtility functions
format_chat_prompt
Formats a single-turn chat prompt in ChatML format.System message defining assistant behavior
User message/question
Returns formatted ChatML prompt ready for tokenization
format_chat_prompt_multi
Formats a multi-turn chat prompt in ChatML format.System message
List of (role, content) pairs where role is “user” or “assistant”
Returns formatted multi-turn ChatML prompt
strip_thinking
Removes<think>...</think> block from generated text.
Generated text that may contain thinking block
Returns text after
</think> tag, or original text if no thinking blockis_stop_token
Checks if a token is a stop token (EOS or<|im_end|>).
Token ID to check
Returns
true if token is EOS (2) or <|im_end|> (73440)Types
Model
The main model struct for MiniCPM-SALA inference.ModelArgs
Model configuration fromconfig.json.
Embedding scaling factor (muP scaling)
Depth scaling for residual connections
Interval between Lightning (sparse) attention layers
HybridAttention
Enum for sparse or dense attention layers.ThinkFilter
Incremental filter for streaming output with think block suppression.Constructor
If
true, suppresses <think>...</think> content in outputnext
Full decoded text so far
Returns new text to emit (empty string if still in think block)
Example usage
Basic generation
With think filtering
Multi-turn conversation
Architecture details
Hybrid attention layers
MiniCPM-SALA alternates between Lightning (sparse) and dense attention:Thinking blocks
MiniCPM-SALA can generate intermediate reasoning:ThinkFilter to hide thinking in streaming output or strip_thinking for post-processing.
Speculative decoding
UseSpeculativeDecoder for faster inference:
Performance notes
- Lightning Attention reduces memory and computation in sparse layers
- Hybrid architecture balances quality and efficiency
- Speculative decoding can provide 1.5-2.5x speedup
- Compact models (2B-4B) run efficiently on consumer hardware
Constants
See also
- qwen3-mlx - Similar architecture without hybrid attention
- Lightning Attention paper - Sparse attention mechanism
- Speculative decoding - Acceleration technique