Skip to main content
Moonshine Voice supports multiple languages with models optimized for different deployment scenarios. All models use the ONNX format converted to memory-mappable OnnxRuntime (.ort) flatbuffer encoding.

Supported Languages and Models

LanguageArchitectureParametersWER/CERHuggingFace Link
EnglishTiny26 million12.66%Model
EnglishTiny Streaming34 million12.00%Model
EnglishBase58 million10.07%Model
EnglishSmall Streaming123 million7.84%Model
EnglishMedium Streaming245 million6.65%Model
ArabicBase58 million5.63%Model
JapaneseBase58 million13.62%Model
KoreanTiny26 million6.46%Model
MandarinBase58 million25.76%Model
SpanishBase58 million4.33%Model
UkrainianBase58 million14.55%Model
VietnameseBase58 million8.82%Model
WER (Word Error Rate) is used for languages with word boundaries like English and Spanish. CER (Character Error Rate) is used for languages without clear word boundaries.

Evaluation Methodology

  • English models: Evaluated using the HuggingFace OpenASR Leaderboard datasets and methodology
  • Other languages: Evaluated using the FLEURS dataset with the scripts/eval-model-accuracy.py script

Downloading Models

The easiest way to get model files is using the Python module:
python -m moonshine_voice.download --language en
You can use either the two-letter code or the English name for the language:
# Examples
python -m moonshine_voice.download --language spanish
python -m moonshine_voice.download --language ja
python -m moonshine_voice.download --language korean

Specifying Model Architecture

Optionally request a specific model architecture using the model-arch flag:
python -m moonshine_voice.download --language en --model-arch 5
Architecture numbers (from moonshine-c-api.h):
  • 0 - Tiny
  • 1 - Base
  • 2 - Tiny Streaming
  • 3 - Base Streaming
  • 4 - Small Streaming
  • 5 - Medium Streaming
If no architecture is specified, the script loads the highest-quality model available for that language.

Download Output

The download script will log the model location and architecture:
encoder_model.ort: 100%|███████████████████████| 29.9M/29.9M [00:00<00:00, 34.5MB/s]
decoder_model_merged.ort: 100%|████████████████| 104M/104M [00:02<00:00, 52.6MB/s]
tokenizer.bin: 100%|█████████████████████████████| 244k/244k [00:00<00:00, 1.44MB/s]
Model download url: https://download.moonshine.ai/model/base-en/quantized/base-en
Model components: ['encoder_model.ort', 'decoder_model_merged.ort', 'tokenizer.bin']
Model arch: 1
Downloaded model path: /Users/username/Library/Caches/moonshine_voice/download.moonshine.ai/model/base-en/quantized/base-en
By default, models are cached in your user cache directory (~/Library/Caches/moonshine_voice on macOS). Set the MOONSHINE_VOICE_CACHE environment variable to use a different location.

HuggingFace Models

Safetensor versions of the models are available on HuggingFace at huggingface.co/UsefulSensors/models. These are floating-point checkpoints exported directly from the training pipeline.
The organization name “UsefulSensors” is from an earlier incarnation of the company when they focused on complete voice interface solutions integrated onto low-cost chips with built-in microphones.

Non-Latin Language Configuration

For models that don’t use the Latin alphabet (Arabic, Japanese, Korean, Mandarin, Vietnamese), you must set the max_tokens_per_second option to 13.0 when creating the transcriber.
This is required because:
  • Hallucination detection uses a heuristic based on tokens per second
  • Non-Latin languages produce more tokens per second due to tokenization
  • Without this setting, valid outputs may be incorrectly truncated
transcriber = Transcriber(
    model_path=model_path,
    model_arch=model_arch,
    options={"max_tokens_per_second": "13.0"}
)

Model Components

Each model consists of three files:
  1. encoder_model.ort - Encoder neural network (processes audio features)
  2. decoder_model_merged.ort - Decoder neural network (generates text)
  3. tokenizer.bin - Token-to-character mapping in compact binary format
All files must be present in the model directory for the transcriber to load successfully.

Build docs developers (and LLMs) love