Skip to main content
Answers to common questions about Moonshine Voice.

General Questions

TL;DR - When you’re working with live speech.Moonshine is specifically optimized for real-time voice applications, while Whisper excels at batch processing. Here’s why Moonshine is better for live speech:1. Flexible Input Windows
  • Whisper always operates on a 30-second input window, wasting computation on padding
  • Moonshine accepts any length of audio, spending compute only on actual speech
  • Result: Much lower latency on short phrases (most voice interfaces)
2. Caching for Streaming
  • Whisper starts from scratch on every call, even for repeated audio
  • Moonshine caches encoding and decoder state, skipping redundant work
  • Result: Can provide updates while the user is still talking
3. Better Accuracy-to-Size Ratio
  • Moonshine Medium Streaming: 6.65% WER, 245M parameters
  • Whisper Large v3: 7.44% WER, 1.5B parameters
  • Moonshine achieves higher accuracy with 6x fewer parameters
4. Speed Comparison
ModelWERParametersMacBook ProLinux x86Raspberry Pi 5
Moonshine Medium Streaming6.65%245M107ms269ms802ms
Whisper Large v37.44%1.5B11,286ms16,919msN/A
Moonshine Small Streaming7.84%123M73ms165ms527ms
Whisper Small8.59%244M1940ms3,425ms10,397ms
Choose Whisper if: You’re doing bulk offline processing with GPUs where throughput matters more than latency.Choose Moonshine if: You’re building voice interfaces that need to respond quickly to users.
Moonshine Voice currently supports:
  • English (Tiny, Tiny Streaming, Base, Small Streaming, Medium Streaming)
  • Arabic (Base)
  • Japanese (Base)
  • Korean (Tiny)
  • Mandarin Chinese (Base)
  • Spanish (Base)
  • Ukrainian (Base)
  • Vietnamese (Base)
Each language has dedicated models trained specifically for that language, providing better accuracy than multilingual models of the same size.See Models for accuracy benchmarks and model details.
Moonshine Voice runs on:
  • Python (pip install)
  • iOS (Swift Package Manager)
  • Android (Maven)
  • macOS (Swift Package Manager)
  • Linux (x86_64, ARM)
  • Windows (Visual Studio)
  • Raspberry Pi (ARM)
  • IoT devices (via C++ core)
The same API works across all platforms with native language bindings.
No! Everything runs on-device:
  • No API keys required
  • No account or credit card needed
  • Complete privacy - audio never leaves your device
  • Works offline
  • No usage limits or quotas
You only need internet to download models initially.
Code and English Models: MIT License
  • Free for any use (commercial or non-commercial)
  • Modify, distribute, use privately
Non-English Models: Moonshine Community License
  • Free for research and non-commercial use
  • Free for commercial use if revenue < $1,000,000 USD/year
  • Commercial license required if revenue ≥ $1,000,000 USD/year
See the LICENSE file for full terms.

Technical Questions

1. Use the Right Model
  • Larger models = better accuracy
  • Language-specific models outperform multilingual models
  • Choose based on your accuracy vs. speed tradeoff
2. Optimize Audio Quality
  • Use a good microphone
  • Minimize background noise
  • Test with save_input_wav_path option to verify audio quality
3. Adjust VAD Settings
transcriber = Transcriber(
    model_path=model_path,
    model_arch=model_arch,
    options={
        "vad_threshold": "0.5",  # Lower = longer segments
        "vad_window_duration": "0.5",  # Smoothing window
    }
)
4. For Non-Latin Languages
  • Set max_tokens_per_second to 13.0 (default is 6.5)
  • This prevents false hallucination detection
5. Domain Customization
Non-Streaming (Tiny, Base):
  • Process complete audio segments
  • Lower memory usage
  • Good for shorter phrases
  • Simpler architecture
Streaming (Tiny Streaming, Small Streaming, Medium Streaming):
  • Cache encoder output and decoder state
  • Process audio incrementally
  • Much lower latency for real-time use
  • Can update transcription while user is still speaking
  • Ideal for interactive applications
Recommendation: Use streaming models for live voice interfaces.
Use Streams to process multiple inputs with a single transcriber:
transcriber = Transcriber(model_path=model_path, model_arch=model_arch)

# Create separate streams
mic_stream = transcriber.create_stream()
system_audio_stream = transcriber.create_stream()

# Each stream has its own transcript
mic_stream.start()
system_audio_stream.start()

# Events are tagged with stream_handle
def on_line_completed(event):
    if event.stream_handle == mic_stream.handle:
        print(f"Mic: {event.line.text}")
    elif event.stream_handle == system_audio_stream.handle:
        print(f"System: {event.line.text}")
This shares model weights, reducing memory usage.
Moonshine uses ONNX Runtime, which supports hardware acceleration:
  • CPU: Works out of the box (optimized for ARM and x86_64)
  • GPU: ONNX Runtime can use CUDA, DirectML, or CoreML
  • NPU: Some platforms support neural processing units
However, the pre-built packages are optimized for CPU execution across all platforms. Moonshine is designed to be fast enough on CPU for real-time use.For custom GPU builds, you’ll need to build ONNX Runtime with GPU support and link against it.
1. Enable Logging
transcriber = Transcriber(
    model_path=model_path,
    model_arch=model_arch,
    options={
        "log_api_calls": "true",
        "log_output_text": "true",
        "log_ort_runs": "true",
    }
)
2. Save Input Audio
options={"save_input_wav_path": "/tmp/debug"}
Listen to the saved WAV files to verify audio quality.3. Check Audio Format
  • Ensure audio is mono (not stereo)
  • Sample rate is handled automatically
  • Values should be in range [-1.0, 1.0]
4. Test with Known Audio Use the included test file:
from moonshine_voice import get_assets_path, load_wav_file
test_wav = os.path.join(get_assets_path(), "two_cities.wav")
audio_data, sample_rate = load_wav_file(test_wav)
See Debugging Guide for more details.
Model sizes (quantized):
  • Tiny: ~26 MB (26M parameters)
  • Tiny Streaming: ~34 MB (34M parameters)
  • Base: ~58 MB (58M parameters)
  • Small Streaming: ~123 MB (123M parameters)
  • Medium Streaming: ~245 MB (245M parameters)
Runtime memory:
  • Small overhead for audio buffers
  • Caching state for streaming models
  • Multiple streams share model weights
All models are designed for edge devices with limited memory.

Integration Questions

Python:
pip install moonshine-voice
iOS/macOS: Add Swift package: https://github.com/moonshine-ai/moonshine-swift/Android: Add to gradle/libs.versions.toml:
[versions]
moonshineVoice = "0.0.49"

[libraries]
moonshine-voice = { group = "ai.moonshine", name = "moonshine-voice", version.ref = "moonshineVoice" }
Windows/C++: Download pre-built libraries from GitHub releasesSee Installation Guide for details.
Moonshine Voice doesn’t handle permissions directly - that’s platform-specific:iOS: Add to Info.plist:
<key>NSMicrophoneUsageDescription</key>
<string>We need microphone access to transcribe your speech</string>
Android: Add to AndroidManifest.xml:
<uses-permission android:name="android.permission.RECORD_AUDIO" />
And request at runtime (see Android examples).Web: Use Web Audio API with getUserMedia()See platform examples in /resources/examples for complete implementations.
Yes, but you’ll need to create bindings:React Native:
  • Create a native module wrapping the iOS/Android SDKs
  • Use bridge to communicate with JavaScript
Flutter:
  • Create a plugin using platform channels
  • Call native iOS/Android code
The community is working on official bindings for these platforms. Join Discord to collaborate or get updates.
iOS/macOS:
  1. Add model files to Xcode project
  2. Ensure they’re in “Copy Bundle Resources”
  3. Access via Bundle.main.url(forResource:withExtension:)
Android:
  1. Place models in src/main/assets/
  2. Access via AssetManager
Python:
  • Models download to cache directory automatically
  • Set custom location with MOONSHINE_VOICE_CACHE environment variable
See examples for complete implementations.

Still Have Questions?

If you don’t see your question answered here: See Support for more ways to get help.

Build docs developers (and LLMs) love