Transcription Pipeline - Moonshine Voice

Overview

Moonshine Voice transcription transforms continuous audio streams into structured text with timestamps, speaker identification, and real-time updates. The pipeline is optimized for live speech applications where responsiveness matters.

Transcription Flow

Audio Input → VAD Segmentation → ASR Model → Speaker ID → Events
     ↓              ↓                ↓           ↓          ↓
  16kHz PCM    Speech/Silence   Tokens→Text  Embeddings  Listeners

Stage 1: Audio Preprocessing

Input Format

The transcriber accepts audio in any format through add_audio():

transcriber.add_audio(
    audio_data,    # List[float] or array of PCM samples
    sample_rate    # int, e.g., 44100, 16000, 48000
)

Internal conversion:

Sample rate → Resampled to 16kHz
Channels → Converted to mono
Format → Normalized to float32 range [-1.0, 1.0]

The library uses 16kHz internally. To avoid resampling overhead, capture audio at 16kHz when possible.

Buffering Strategy

From python/src/moonshine_voice/transcriber.py:359-374:

def add_audio(self, audio_data: List[float], sample_rate: int = 16000):
    """Add audio data to the stream."""
    audio_array = (ctypes.c_float * len(audio_data))(*audio_data)
    error = self._lib.moonshine_transcribe_add_audio_to_stream(
        self._transcriber._handle,
        self._handle,
        audio_array,
        len(audio_data),
        sample_rate,
        0,
    )
    check_error(error)
    self._stream_time += len(audio_data) / sample_rate
    if self._stream_time - self._last_update_time >= self._update_interval:
        self.update_transcription(0)
        self._last_update_time = self._stream_time

Audio is buffered internally and automatically triggers transcription updates at the update_interval (default 0.5 seconds).

Stage 2: Voice Activity Detection

The Silero VAD model detects speech and segments audio into phrases.

VAD Configuration

From README.md lines 669-676, these options control VAD behavior:

transcriber = Transcriber(
    model_path=model_path,
    model_arch=ModelArch.BASE,
    options={
        "vad_threshold": "0.5",          # Sensitivity (0.0-1.0)
        "vad_window_duration": "0.5",    # Averaging window in seconds
        "vad_look_behind_sample_count": "8192",  # Samples to prepend
        "vad_max_segment_duration": "15.0",     # Max line length
    }
)

How VAD Works

From core/silero-vad.h:22-89:

Frame Processing: VAD runs on 32ms frames (512 samples at 16kHz)
Context Addition: 64 samples of context from previous chunk for continuity
Probability Output: Returns probability [0.0-1.0] that frame contains speech
Averaging: Results averaged over vad_window_duration for stability
Thresholding: When average exceeds vad_threshold, speech detected

Lower vad_threshold (e.g., 0.3) creates longer segments with more background noise. Higher values (e.g., 0.7) break speech into smaller chunks but risk clipping words.

Speech Padding

To avoid cutting off speech starts/ends:

Look-behind: 8192 samples (512ms) prepended when speech detected
Speech pad: 30ms padding added around detected speech
Min silence: 100ms silence required to end segment
Min speech: 250ms minimum segment duration

Segment Duration Management

From README.md:675:

vad_max_segment_duration: Sets the longest duration a line can be before it’s marked as complete. Default is 15 seconds. The vad_threshold is linearly decreased from 2/3 of max duration to force finding a break.

Stage 3: Speech-to-Text Model

Moonshine ASR models convert audio segments to text.

Model Architecture Types

From core/moonshine-c-api.h:97-103:

#define MOONSHINE_MODEL_ARCH_TINY (0)
#define MOONSHINE_MODEL_ARCH_BASE (1)
#define MOONSHINE_MODEL_ARCH_TINY_STREAMING (2)
#define MOONSHINE_MODEL_ARCH_BASE_STREAMING (3)
#define MOONSHINE_MODEL_ARCH_SMALL_STREAMING (4)
#define MOONSHINE_MODEL_ARCH_MEDIUM_STREAMING (5)

Non-Streaming Transcription

For offline audio or complete segments:

transcript = transcriber.transcribe_without_streaming(
    audio_data=audio_samples,
    sample_rate=16000,
    flags=0
)

for line in transcript.lines:
    print(f"[{line.start_time:.2f}s] {line.text}")

From python/src/moonshine_voice/transcriber.py:146-186:

Processes entire audio array at once
VAD segments audio into phrases
Each segment transcribed independently
Returns complete Transcript with all lines finalized

Streaming Transcription

For live audio with incremental updates:

transcriber.start()

# Feed audio in chunks as it arrives
for chunk in audio_chunks:
    transcriber.add_audio(chunk, sample_rate)
    # Transcription happens automatically at update_interval

transcript = transcriber.stop()

Streaming models cache computation for lower latency (see Streaming concepts).

Stage 4: Token Decoding

The ASR model outputs tokens that must be decoded to text:

Encoder: Audio → latent representation
Decoder: Latent representation → token sequence
Tokenizer: Tokens → UTF-8 text

Tokenizer stored in tokenizer.bin file, loaded from model directory.

Hallucination Detection

From README.md:667:

max_tokens_per_second: Models occasionally get caught in an infinite decoder loop, repeating the same words. We compare tokens to duration and truncate if too many. Default is 6.5, but for non-Latin languages use 13.0.

options = {
    "max_tokens_per_second": "13.0"  # For Korean, Japanese, etc.
}

Stage 5: Speaker Identification

Optional diarization assigns speaker IDs to segments.

Speaker ID Assignment

From core/moonshine-c-api.h:159-162:

/* The speaker ID is another 64-bit randomly-generated number, used to identify
   the calculated speaker of the line, for diarization purposes. This is not
   available until the line has accumulated enough audio data to be confident
   in the speaker identification, or if the line is complete. */

Speaker metadata:

has_speaker_id: Boolean, true when speaker identified
speaker_id: Unique 64-bit identifier for this speaker
speaker_index: Display order (0, 1, 2 for “Speaker 1”, “Speaker 2”, etc.)

Configuration

options = {
    "identify_speakers": "true"  # Enable (default)
    # or
    "identify_speakers": "false"  # Disable for performance
}

Speaker identification is experimental. Accuracy may not be suitable for all applications.

Transcript Structure

TranscriptLine

From core/moonshine-c-api.h:168-202, each line contains:

line = TranscriptLine(
    text="Hello world",               # UTF-8 transcribed text
    start_time=1.5,                    # Start offset in seconds
    duration=2.3,                      # Segment length in seconds
    line_id=0x1234567890ABCDEF,        # Unique 64-bit ID
    is_complete=True,                  # Speech ended?
    is_updated=True,                   # Changed since last update?
    is_new=False,                      # Just added?
    has_text_changed=True,             # Text changed?
    has_speaker_id=True,               # Speaker identified?
    speaker_id=0xFEDCBA0987654321,    # Speaker's unique ID
    speaker_index=0,                   # Speaker #1
    audio_data=[...],                  # Raw 16kHz PCM audio
    last_transcription_latency_ms=87   # Processing time
)

Transcript

From core/moonshine-c-api.h:204-208:

transcript = Transcript(
    lines=[line1, line2, line3, ...]  # Time-ordered list
)

Update Intervals

Transcription doesn’t happen on every add_audio() call. From core/moonshine-c-api.h:456-460:

By default this function will only perform full analysis if there has been more than 200ms of new samples since the last complete analysis. This can be overridden by setting the MOONSHINE_FLAG_FORCE_UPDATE flag.

Configurable via:

Constructor: update_interval=0.5 (seconds)
Option: transcription_interval in options dict
Manual: update_transcription(MOONSHINE_FLAG_FORCE_UPDATE)

# Force immediate update
transcript = stream.update_transcription(
    flags=Transcriber.MOONSHINE_FLAG_FORCE_UPDATE
)

Event Flow Guarantees

From README.md:277-288, the transcription event system provides these guarantees:

LineStarted called exactly once per segment
LineCompleted called exactly once after LineStarted
LineUpdated/LineTextChanged only between started and completed
Only one line active at a time per stream
Completed lines never modified again
line_id remains stable throughout line’s lifetime
Calling stop() completes any active lines

Performance Optimization

Skip Transcription

If you only need VAD segmentation:

options = {"skip_transcription": "true"}
transcriber = Transcriber(model_path, model_arch, options=options)

# Lines will have audio_data but empty text
for line in transcript.lines:
    process_audio_segment(line.audio_data)

Disable Audio Return

Reduce memory overhead:

options = {"return_audio_data": "false"}
# line.audio_data will be None

Debugging Transcription

Save Input Audio

From README.md:395-404:

options = {"save_input_wav_path": "."}
transcriber = Transcriber(model_path, model_arch, options=options)
# Saves input_1.wav, input_2.wav, etc. for each stream

Log API Calls

options = {"log_api_calls": "true"}
# Prints all C API calls to console

Log Output Text

options = {"log_output_text": "true"}
# Prints transcription results to console

Example: Complete Transcription

from moonshine_voice import Transcriber, ModelArch, load_wav_file

# Load audio
audio_data, sample_rate = load_wav_file("speech.wav")

# Create transcriber
transcriber = Transcriber(
    model_path="/path/to/models",
    model_arch=ModelArch.BASE,
    update_interval=0.5,
    options={
        "vad_threshold": "0.5",
        "identify_speakers": "true"
    }
)

transcriber.start()

# Simulate streaming by chunking
chunk_size = int(0.1 * sample_rate)  # 100ms chunks
for i in range(0, len(audio_data), chunk_size):
    chunk = audio_data[i:i + chunk_size]
    transcriber.add_audio(chunk, sample_rate)

transcript = transcriber.stop()

# Print results
for line in transcript.lines:
    speaker = f"Speaker {line.speaker_index + 1}: " if line.has_speaker_id else ""
    print(f"[{line.start_time:.1f}s] {speaker}{line.text}")

transcriber.close()

Get Started

Core Concepts

Platform Guides

Guides

Models

​Overview

​Transcription Flow

​Stage 1: Audio Preprocessing

​Input Format

​Buffering Strategy

​Stage 2: Voice Activity Detection

​VAD Configuration

​How VAD Works

​Speech Padding

​Segment Duration Management

​Stage 3: Speech-to-Text Model

​Model Architecture Types

​Non-Streaming Transcription

​Streaming Transcription

​Stage 4: Token Decoding

​Hallucination Detection

​Stage 5: Speaker Identification

​Speaker ID Assignment

​Configuration

​Transcript Structure

​TranscriptLine

​Transcript

​Update Intervals

​Event Flow Guarantees

​Performance Optimization

​Skip Transcription

​Disable Audio Return

​Debugging Transcription

​Save Input Audio

​Log API Calls

​Log Output Text

​Example: Complete Transcription

​Next Steps

Streaming ASR

Model Architectures

Build docs developers (and LLMs) love

Overview

Transcription Flow

Stage 1: Audio Preprocessing

Input Format

Buffering Strategy

Stage 2: Voice Activity Detection

VAD Configuration

How VAD Works

Speech Padding

Segment Duration Management

Stage 3: Speech-to-Text Model

Model Architecture Types

Non-Streaming Transcription

Streaming Transcription

Stage 4: Token Decoding

Hallucination Detection

Stage 5: Speaker Identification

Speaker ID Assignment

Configuration

Transcript Structure

TranscriptLine

Transcript

Update Intervals

Event Flow Guarantees

Performance Optimization

Skip Transcription

Disable Audio Return

Debugging Transcription

Save Input Audio

Log API Calls

Log Output Text

Example: Complete Transcription

Next Steps