Quantization
All Moonshine models use post-training quantization to reduce size and improve inference speed while maintaining accuracy.Default Quantization Strategy
Moonshine models are quantized using:- 8-bit weights across the board
- 8-bit calculations for heavy operations like MatMul
- B16 float precision for frontend convolution layers
Frontend Precision Exception
The frontend uses convolution layers to generate features (similar to MEL spectrogram preprocessing but learned). Since inputs correspond to 16-bit signed integers from raw audio (encoded as floats), these convolution operations require at least B16 float precision for optimal quality.Quantization Tools
Moonshine uses a combination of:- OnnxRuntime’s quantization tools
- Onnx Shrink Ray utility for additional optimization
scripts/quantize-streaming-model.sh for specific configuration details.
Model Variants
When downloading models, you can specify different quantization levels:| Variant | Precision | Model Size | Inference Speed | Quality |
|---|---|---|---|---|
| fp32 | 32-bit float | Largest | Slowest | Highest |
| fp16 | 16-bit float | Medium | Medium | High |
| q8 | 8-bit int | Small | Fast | Good (recommended) |
| q4 | 4-bit int | Smallest | Fastest | Acceptable |
| q4f16 | Mixed 4/16-bit | Small | Fast | Good |
The default q8 (8-bit) quantization provides the best balance of size, speed, and quality for most applications.
Domain Customization
Customizing models for specific vocabulary, jargon, accents, or dialects can significantly improve accuracy for your application.Commercial Full Retraining
Moonshine AI offers full model retraining as a commercial service:- Training on Moonshine’s internal dataset plus your domain-specific data
- Optimization for technical terms, industry jargon, or specialized vocabulary
- Accent and dialect customization
- Support for new languages or language variants
Community Fine-Tuning Project
A community project provides lightweight fine-tuning capabilities: Repository: github.com/pierre-cheneau/finetune-moonshine-asr This project enables:- Fine-tuning existing Moonshine models on custom datasets
- Adapting models to specific domains without full retraining
- Experimenting with domain adaptation techniques
Community fine-tuning is experimental and may not achieve the same quality as full retraining with Moonshine AI’s proprietary dataset.
Model Architecture Customization
For advanced users who want to modify the model architecture itself:Model Files
Each Moonshine model consists of:- encoder_model.ort - ONNX model for audio encoding
- decoder_model_merged.ort - ONNX model for text generation
- tokenizer.bin - Binary token vocabulary file
Source Weights
Original model weights are available on HuggingFace:- Organization: huggingface.co/UsefulSensors/models
- Format: Safetensors (floating-point checkpoints)
Conversion Scripts
Convert HuggingFace models to ONNX format:Tokenizer Conversion
Convert JSON tokenizers to Moonshine’s binary format:Runtime Customization Options
Moonshine provides several runtime options to customize behavior without retraining:Voice Activity Detection (VAD)
- vad_threshold: Lower values (0.3) = longer segments with more background noise; Higher values (0.7) = shorter, cleaner segments
- vad_window_duration: Shorter = faster speech detection, less accuracy; Longer = more accurate, may miss short utterances
- vad_max_segment_duration: Maximum line length before forcing a break
Hallucination Prevention
Transcription Behavior
Debug Options
All option values must be passed as strings, even for numeric values:
{"max_tokens_per_second": "13.0"}Platform-Specific Optimization
Moonshine models are automatically optimized for your target platform, but you can further customize:Mobile Optimization
- Use Tiny or Base models for smaller binary size
- Consider q4 quantization for minimal storage impact
- Disable speaker identification if not needed
- Reduce
transcription_intervalto lower compute load
Server Optimization
- Use Medium Streaming for highest accuracy
- Enable all features (speaker ID, audio data)
- Shorter
transcription_intervalfor more responsive updates
Embedded Devices
- Stick with Tiny Streaming for Raspberry Pi and similar devices
- Increase
vad_thresholdto filter out more background noise - Set
return_audio_datato “false” to reduce memory usage
Future Customization Features
Moonshine is actively developing:- Lightweight domain customization: Fine-tuning without full retraining
- More languages: Expanding language support
- Binary size reduction: Smaller models for mobile deployment
- Improved speaker identification: Better diarization accuracy