Documentation Index
Fetch the complete documentation index at: https://mintlify.com/XDcobra/react-native-sherpa-onnx/llms.txt
Use this file to discover all available pages before exploring further.
Other TTS Models
This page covers additional TTS model types including lightweight KittenTTS, voice cloning with Zipvoice, and flow-matching Pocket models.Overview
KittenTTS
Lightweight, multi-speaker TTS
Zipvoice
Zero-shot voice cloning
Flow-matching TTS with voice cloning
KittenTTS
modelType: 'kitten'
Description
KittenTTS is a lightweight, fast, multi-speaker TTS model optimized for resource-constrained devices.Characteristics
- Streaming: ✅ Yes
- Quality: ⭐⭐⭐ Good
- Speed: ⭐⭐⭐⭐⭐ Very Fast
- Memory: ⭐⭐⭐⭐⭐ Very Low
- Size: Very Small (typically 10-30 MB)
- Multi-Speaker: ✅ Yes
Configuration
Streaming Example
Download
KittenTTS Models
Download KittenTTS models
Model Detection
- Folder name should contain
kitten(notkokoro) - Files:
model.onnx,tokens.txt
When to Use
Low-End Devices
Resource-constrained mobile devices
Fast Response
Applications requiring minimal latency
Battery Efficiency
Low power consumption for longer battery life
Embedded Systems
IoT devices with limited resources
Advantages
- Very Fast: Fastest TTS model available
- Very Small: Minimal storage footprint
- Low Memory: Runs on constrained devices
- Streaming: Low-latency incremental generation
- Multi-Speaker: Multiple voices in one model
Limitations
- Quality: Good but not as natural as VITS or Matcha
- Limited Languages: Fewer language options
- No Voice Cloning: Fixed voice set only
Zipvoice
modelType: 'zipvoice'
Description
Zipvoice is a zero-shot voice cloning model that can synthesize speech in any voice from a short reference audio sample.Characteristics
- Streaming: ❌ No (batch only for voice cloning)
- Quality: ⭐⭐⭐⭐⭐ Excellent
- Speed: ⭐⭐⭐ Medium
- Memory: ⭐⭐ High (requires significant RAM)
- Size: Large (~605 MB for full model)
- Voice Cloning: ✅ Yes
Architecture
Zipvoice uses a three-stage pipeline:- Encoder – Encodes reference audio
- Decoder (flow-matching) – Generates mel-spectrogram
- Vocoder (e.g.
vocos_24khz.onnx) – Converts to waveform
Configuration
Memory Requirements
Download
Zipvoice Models
Download Zipvoice models (full and int8 distill variants)
Model Detection
Zipvoice is detected by file layout:- Encoder + decoder + vocoder files
- Optional folder name pattern (containing
zipvoice) - Files: encoder, decoder,
vocos_*.onnx(vocoder),tokens.txt,lexicon.txt,espeak-ng-data
When to Use
Custom Voices
Synthesize speech in any voice from reference audio
Voice Cloning Apps
Apps that need user-specific voice synthesis
Dubbing & Translation
Translate content while preserving original voice
Personalization
Personalized voice experiences
Advantages
- Zero-Shot Voice Cloning: Clone any voice from short audio
- Excellent Quality: Very natural-sounding output
- Flexible: Works with various reference voices
- Multilingual: Supports Chinese and English
Limitations
- High Memory: Full model needs 8+ GB device RAM
- No Streaming: Voice cloning only supports batch generation
- Large Size: ~605 MB (use int8 distill variant for smaller size)
- Slower: Flow-matching is computationally intensive
- Requires Vocoder: Distill-only models (no vocoder) will fail
Reference Audio Requirements
- Format: Mono, float PCM samples in [-1, 1]
- Sample Rate: Typically 22050 Hz or 24000 Hz
- Duration: 3-10 seconds recommended
- Quality: Clear speech, minimal background noise
- Transcript: Must provide accurate transcript of reference audio
modelType: 'pocket'
Description
Pocket is a flow-matching TTS model that supports both standard synthesis and voice cloning with reference audio.Characteristics
- Streaming: ✅ Yes (including with reference audio for Kotlin-engine models)
- Quality: ⭐⭐⭐⭐ High
- Speed: ⭐⭐⭐⭐ Fast
- Memory: ⭐⭐⭐ Moderate
- Size: Medium
- Voice Cloning: ✅ Yes
Configuration
Streaming with Voice Cloning
Unlike Zipvoice, Pocket supports streaming even with reference audio:Extra Options
Pocket accepts model-specific options via theextra parameter:
Download
Pocket Models
Download Pocket TTS models
Model Detection
Pocket is detected by file layout:- Files:
lm_flow,lm_main,text_conditioner,vocab/token_scores - No folder name pattern required
When to Use
Voice Cloning + Streaming
Need both voice cloning and low-latency streaming
Modern Architecture
Flow-matching for high-quality synthesis
Flexible Options
Fine-grained control with extra parameters
Interactive Apps
Real-time custom voice applications
Advantages
- Streaming + Voice Cloning: Supports both simultaneously
- Flow-Matching: Modern architecture for quality
- Fast: Good performance with streaming
- Flexible: Extra options for fine-tuning
- Good Quality: Natural-sounding speech
Limitations
- Newer: Less battle-tested than VITS or Zipvoice
- Documentation: Fewer examples and resources
- Model Availability: Fewer pretrained models
Comparison Table
| Feature | KittenTTS | Zipvoice | |
|---|---|---|---|
| Speed | Very Fast | Medium | Fast |
| Quality | Good | Excellent | High |
| Streaming | Yes | No | Yes |
| Voice Cloning | No | Yes | Yes |
| Model Size | Very Small | Large | Medium |
| Memory | Very Low | High | Moderate |
| Best For | Low-end devices | High-quality cloning | Streaming + cloning |
Choosing Between Models
For Voice Cloning
- Zipvoice – Best quality, batch generation only, high memory
- Pocket – Streaming support, good quality, moderate memory
For Speed
- KittenTTS – Fastest, lightweight
- Pocket – Fast with streaming
For Low-End Devices
- KittenTTS – Minimal resources
- Zipvoice int8 distill – If voice cloning is needed
For High Quality
- Zipvoice – Excellent voice cloning quality
- Pocket – Good quality with more flexibility
Next Steps
TTS Overview
Compare all TTS model types
TTS API
Detailed API documentation
Streaming TTS
Low-latency streaming guide
Model Setup
How to download and bundle models