.ort) flatbuffer encoding.
Supported Languages and Models
| Language | Architecture | Parameters | WER/CER | HuggingFace Link |
|---|---|---|---|---|
| English | Tiny | 26 million | 12.66% | Model |
| English | Tiny Streaming | 34 million | 12.00% | Model |
| English | Base | 58 million | 10.07% | Model |
| English | Small Streaming | 123 million | 7.84% | Model |
| English | Medium Streaming | 245 million | 6.65% | Model |
| Arabic | Base | 58 million | 5.63% | Model |
| Japanese | Base | 58 million | 13.62% | Model |
| Korean | Tiny | 26 million | 6.46% | Model |
| Mandarin | Base | 58 million | 25.76% | Model |
| Spanish | Base | 58 million | 4.33% | Model |
| Ukrainian | Base | 58 million | 14.55% | Model |
| Vietnamese | Base | 58 million | 8.82% | Model |
WER (Word Error Rate) is used for languages with word boundaries like English and Spanish. CER (Character Error Rate) is used for languages without clear word boundaries.
Evaluation Methodology
- English models: Evaluated using the HuggingFace OpenASR Leaderboard datasets and methodology
- Other languages: Evaluated using the FLEURS dataset with the
scripts/eval-model-accuracy.pyscript
Downloading Models
The easiest way to get model files is using the Python module:Specifying Model Architecture
Optionally request a specific model architecture using themodel-arch flag:
moonshine-c-api.h):
0- Tiny1- Base2- Tiny Streaming3- Base Streaming4- Small Streaming5- Medium Streaming
Download Output
The download script will log the model location and architecture:By default, models are cached in your user cache directory (
~/Library/Caches/moonshine_voice on macOS). Set the MOONSHINE_VOICE_CACHE environment variable to use a different location.HuggingFace Models
Safetensor versions of the models are available on HuggingFace at huggingface.co/UsefulSensors/models. These are floating-point checkpoints exported directly from the training pipeline.The organization name “UsefulSensors” is from an earlier incarnation of the company when they focused on complete voice interface solutions integrated onto low-cost chips with built-in microphones.
Non-Latin Language Configuration
This is required because:- Hallucination detection uses a heuristic based on tokens per second
- Non-Latin languages produce more tokens per second due to tokenization
- Without this setting, valid outputs may be incorrectly truncated
Model Components
Each model consists of three files:- encoder_model.ort - Encoder neural network (processes audio features)
- decoder_model_merged.ort - Decoder neural network (generates text)
- tokenizer.bin - Token-to-character mapping in compact binary format