Overview
Moonshine Voice transcription transforms continuous audio streams into structured text with timestamps, speaker identification, and real-time updates. The pipeline is optimized for live speech applications where responsiveness matters.Transcription Flow
Stage 1: Audio Preprocessing
Input Format
The transcriber accepts audio in any format throughadd_audio():
- Sample rate → Resampled to 16kHz
- Channels → Converted to mono
- Format → Normalized to float32 range [-1.0, 1.0]
The library uses 16kHz internally. To avoid resampling overhead, capture audio at 16kHz when possible.
Buffering Strategy
Frompython/src/moonshine_voice/transcriber.py:359-374:
update_interval (default 0.5 seconds).
Stage 2: Voice Activity Detection
The Silero VAD model detects speech and segments audio into phrases.VAD Configuration
From README.md lines 669-676, these options control VAD behavior:How VAD Works
Fromcore/silero-vad.h:22-89:
- Frame Processing: VAD runs on 32ms frames (512 samples at 16kHz)
- Context Addition: 64 samples of context from previous chunk for continuity
- Probability Output: Returns probability [0.0-1.0] that frame contains speech
- Averaging: Results averaged over
vad_window_durationfor stability - Thresholding: When average exceeds
vad_threshold, speech detected
Speech Padding
To avoid cutting off speech starts/ends:- Look-behind: 8192 samples (512ms) prepended when speech detected
- Speech pad: 30ms padding added around detected speech
- Min silence: 100ms silence required to end segment
- Min speech: 250ms minimum segment duration
Segment Duration Management
From README.md:675:vad_max_segment_duration: Sets the longest duration a line can be before it’s marked as complete. Default is 15 seconds. Thevad_thresholdis linearly decreased from 2/3 of max duration to force finding a break.
Stage 3: Speech-to-Text Model
Moonshine ASR models convert audio segments to text.Model Architecture Types
Fromcore/moonshine-c-api.h:97-103:
Non-Streaming Transcription
For offline audio or complete segments:python/src/moonshine_voice/transcriber.py:146-186:
- Processes entire audio array at once
- VAD segments audio into phrases
- Each segment transcribed independently
- Returns complete
Transcriptwith all lines finalized
Streaming Transcription
For live audio with incremental updates:Stage 4: Token Decoding
The ASR model outputs tokens that must be decoded to text:- Encoder: Audio → latent representation
- Decoder: Latent representation → token sequence
- Tokenizer: Tokens → UTF-8 text
tokenizer.bin file, loaded from model directory.
Hallucination Detection
From README.md:667:
max_tokens_per_second: Models occasionally get caught in an infinite decoder loop, repeating the same words. We compare tokens to duration and truncate if too many. Default is 6.5, but for non-Latin languages use 13.0.
Stage 5: Speaker Identification
Optional diarization assigns speaker IDs to segments.Speaker ID Assignment
Fromcore/moonshine-c-api.h:159-162:
has_speaker_id: Boolean, true when speaker identifiedspeaker_id: Unique 64-bit identifier for this speakerspeaker_index: Display order (0, 1, 2 for “Speaker 1”, “Speaker 2”, etc.)
Configuration
Speaker identification is experimental. Accuracy may not be suitable for all applications.
Transcript Structure
TranscriptLine
Fromcore/moonshine-c-api.h:168-202, each line contains:
Transcript
Fromcore/moonshine-c-api.h:204-208:
Update Intervals
Transcription doesn’t happen on everyadd_audio() call. From core/moonshine-c-api.h:456-460:
By default this function will only perform full analysis if there has been more than 200ms of new samples since the last complete analysis. This can be overridden by setting the MOONSHINE_FLAG_FORCE_UPDATE flag.
Configurable via:
- Constructor:
update_interval=0.5(seconds) - Option:
transcription_intervalin options dict - Manual:
update_transcription(MOONSHINE_FLAG_FORCE_UPDATE)
Event Flow Guarantees
From README.md:277-288, the transcription event system provides these guarantees:LineStartedcalled exactly once per segmentLineCompletedcalled exactly once afterLineStartedLineUpdated/LineTextChangedonly between started and completed- Only one line active at a time per stream
- Completed lines never modified again
line_idremains stable throughout line’s lifetime- Calling
stop()completes any active lines
Performance Optimization
Skip Transcription
If you only need VAD segmentation:Disable Audio Return
Reduce memory overhead:Debugging Transcription
Save Input Audio
From README.md:395-404:Log API Calls
Log Output Text
Example: Complete Transcription
Next Steps
Streaming ASR
Learn how streaming reduces latency
Model Architectures
Choose the right model for your needs