Why Moonshine Models?
Moonshine models were designed to address limitations in existing speech recognition models when building live voice interfaces:Flexible Input Windows
You can supply any length of audio (though we recommend staying below around 30 seconds) and the model will only spend compute on that input, no zero-padding required. Unlike Whisper’s fixed 30-second input window, this gives a significant latency boost for real-time applications.Caching for Streaming
Our models support incremental addition of audio over time, caching the input encoding and part of the decoder’s state. This allows the model to skip redundant computation, driving latency down dramatically while the user is speaking.Language-Specific Models
We’ve found that we can get much higher accuracy for the same size and compute if we restrict a model to focus on just one language, compared to training one model across many. This allows us to offer better accuracy for languages that are poorly supported by multilingual models.Cross-Platform Optimization
All models use ONNX format and are optimized for edge deployment with:- Memory-mappable OnnxRuntime (
.ort) flatbuffer encoding - Post-training quantization to 8-bit weights and calculations
- Optimized for CPU inference across Linux, macOS, Windows, iOS, Android, and Raspberry Pi
Model Architectures
Moonshine offers several model architectures, each optimized for different use cases:Architecture constants are defined in
moonshine-c-api.h as:MOONSHINE_MODEL_ARCH_TINY(0)MOONSHINE_MODEL_ARCH_BASE(1)MOONSHINE_MODEL_ARCH_TINY_STREAMING(2)MOONSHINE_MODEL_ARCH_BASE_STREAMING(3)MOONSHINE_MODEL_ARCH_SMALL_STREAMING(4)MOONSHINE_MODEL_ARCH_MEDIUM_STREAMING(5)
Non-Streaming Models
Tiny - 26 million parameters- Smallest model for highly constrained deployments
- Best for batch processing or applications with relaxed latency requirements
- Balanced accuracy and size
- Available for multiple languages
- Good general-purpose choice for edge devices
Streaming Models
Tiny Streaming - 34 million parameters- Smallest streaming model with caching support
- Ideal for resource-constrained real-time applications
- Mid-range streaming model
- Good balance of accuracy and compute for live transcription
- Highest accuracy streaming model
- Achieves lower WER than Whisper Large v3 (6.65% vs 7.44%)
- Recommended for applications where accuracy is critical
Streaming models do most of their work while the user is still talking, enabling latency below 200ms for responsive voice interfaces.
Research Papers
These research papers provide detailed information about the architectures and performance strategies:- Moonshine: Speech Recognition for Live Transcription and Voice Commands - Describes the first-generation model architecture with flexible-duration input windows
- Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices - How language-specific models improve accuracy for non-English languages
- Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications - Introduces the streaming approach and advantages for live voice applications
Performance Characteristics
All Moonshine models are optimized for:- Low Latency: Streaming models cache computation to minimize response time
- Edge Deployment: Run efficiently on CPUs without GPU acceleration
- Privacy: Everything runs on-device with no network calls required
- Accuracy: Top-end models achieve state-of-the-art accuracy for their size