Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
GPT-SoVITS MLX
Pure Rust implementation of GPT-SoVITS with MLX acceleration for Apple Silicon. Enables few-shot voice cloning with just a few seconds of reference audio.Features
- Few-shot voice cloning: Clone any voice with just a few seconds of reference audio
- Mixed Chinese-English: Natural handling of mixed language text
- High performance: 4x realtime synthesis on Apple Silicon
- Pure Rust: No Python dependencies at runtime
Performance
On Apple Silicon (M-series):- Model loading: ~50ms
- Synthesis: ~4x realtime (generates 20s audio in 5s)
- Memory: ~2GB for all models
Installation
Quick start
VoiceCloner
Main API for voice cloning with GPT-SoVITS.VoiceCloner::new
Create a new voice cloner with configuration.Configuration with model paths and sampling parameters
Voice cloner instance with loaded models
VoiceCloner::with_defaults
Create with default configuration.Voice cloner with default model paths from ~/.OminiX/models/gpt-sovits-mlx
VoiceCloner::set_reference_audio
Set reference audio for voice cloning (zero-shot mode).Path to reference audio file (WAV format)
Success if reference loaded and mel spectrogram computed
VoiceCloner::set_reference_audio_with_text
Set reference audio with transcript for few-shot mode.Path to reference audio file
Transcript of the reference audio
Success if reference loaded and HuBERT semantic codes extracted
VoiceCloner::set_reference_with_precomputed_codes
Set reference using pre-computed prompt semantic codes.Path to reference audio file (for mel spectrogram)
Transcript of the reference audio
Path to binary file containing i32 codes (little-endian) or .npy file
Success if reference and codes loaded
VoiceCloner::synthesize
Synthesize speech from text.Text to synthesize (up to 10,000 characters)
Generated audio with samples, sample rate, duration, and token count
VoiceCloner::synthesize_with_options
Synthesize speech with timeout and cancellation support.Text to synthesize
Synthesis options (timeout, cancellation token, speed override)
Generated audio or error if cancelled/timed out
VoiceCloner::synthesize_from_tokens
Synthesize audio from external semantic tokens.Text to get phoneme IDs
Pre-computed semantic tokens
Generated audio
VoiceCloner::few_shot_available
Check if few-shot mode is available.True if HuBERT model is loaded
VoiceCloner::is_few_shot_mode
Check if currently in few-shot mode.True if prompt semantic codes and reference text are set
Types
VoiceClonerConfig
Configuration for voice cloner.$GPT_SOVITS_MODEL_DIR if set, otherwise ~/.OminiX/models/gpt-sovits-mlx.
AudioOutput
Generated audio output.AudioOutput::duration_secs
Get duration in seconds.Duration calculated from samples.len() / sample_rate
AudioOutput::to_i16_samples
Convert to i16 samples for WAV output.Samples converted to 16-bit PCM, clamped to [-1, 1]
AudioOutput::apply_fade_in
Apply fade-in to reduce initial noise artifacts.Fade-in duration in milliseconds (default: 50ms)
AudioOutput::trim_start
Trim audio from the start to remove initial artifacts.Duration to trim in milliseconds
SynthesisOptions
Options for synthesis with timeout and cancellation support.SynthesisOptions::with_timeout
Create options with a timeout.Maximum time allowed for synthesis
Options with timeout set
SynthesisOptions::with_cancel_token
Create options with a cancellation token.Cancellation token - set to true to cancel synthesis
Options with cancel token set
Text preprocessing
preprocess_text
Preprocess text to phonemes.Input text in Chinese or English
Tuple of (phoneme_ids, phonemes, word2ph, normalized_text)
Language
Supported languages.Model files
Required files in model directory:- T2S weights:
doubao_mixed_gpt_new.safetensors - BERT weights:
bert.safetensors - BERT tokenizer:
chinese-roberta-tokenizer/tokenizer.json - VITS weights:
doubao_mixed_sovits_new.safetensors - HuBERT weights:
hubert.safetensors(for few-shot mode) - ONNX VITS (optional):
vits.onnx(recommended for best quality)