System Architecture

Overview

Moonshine Voice is designed with a simple, event-based architecture that abstracts away the complexity of voice processing. The framework provides a high-level API that lets developers focus on building applications rather than managing audio pipelines.

Design Philosophy

The basic flow is straightforward:

Create a Transcriber or IntentRecognizer object
Attach an EventListener that gets called when important events occur
Feed in audio and respond to events

Batteries Included: Moonshine Voice includes all stages of the voice processing pipeline in a single library - microphone capture, voice activity detection, speech to text, speaker identification, and intent recognition.

Architecture Diagram

Traditionally, adding a voice interface required integrating multiple libraries for different processing stages. Moonshine Voice consolidates these into one framework:

┌─────────────────────────────────────────────────────────────┐
│                      Application Layer                      │
│                 (Event Listeners & Handlers)                │
└────────────────────────────┬────────────────────────────────┘
                             │
                    Events (LineStarted,
                    LineTextChanged, etc.)
                             │
┌────────────────────────────┴────────────────────────────────┐
│                   Moonshine Voice Library                    │
├──────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌─────────────┐  ┌──────────────────┐   │
│  │  Microphone  │  │    VAD      │  │  Speech to Text  │   │
│  │   Capture    │─▶│  (Silero)   │─▶│  (Moonshine ASR) │   │
│  └──────────────┘  └─────────────┘  └──────────────────┘   │
│                                              │               │
│                                              ▼               │
│  ┌──────────────┐  ┌─────────────────────────────────┐     │
│  │   Speaker    │  │    Intent Recognition            │     │
│  │     ID       │  │    (Semantic Embeddings)         │     │
│  │ (Pyannote)   │  └─────────────────────────────────┘     │
│  └──────────────┘                                           │
└──────────────────────────────────────────────────────────────┘

Core Components

Transcriber

The Transcriber (defined in python/src/moonshine_voice/transcriber.py:68) is the main entry point for speech-to-text functionality:

transcriber = Transcriber(
    model_path=model_path,
    model_arch=ModelArch.BASE,
    update_interval=0.5
)

Key responsibilities:

Loading and managing Moonshine ASR models
Coordinating the processing pipeline
Managing audio streams
Emitting events to listeners

Stream

Streams (defined in python/src/moonshine_voice/transcriber.py:321) handle real-time audio input:

stream = transcriber.create_stream()
stream.start()
stream.add_audio(audio_data, sample_rate)
transcript = stream.update_transcription()
stream.stop()

Key features:

Multiple streams per transcriber for multiple audio sources
Independent transcripts per stream
Automatic periodic updates based on update_interval
Event emission when transcript changes

If you only have one audio input source, you can use the transcriber’s default stream via transcriber.start(), transcriber.add_audio(), etc. without creating an explicit stream.

Event Listeners

The event system (python/src/moonshine_voice/transcriber.py:290) provides reactive updates:

class MyListener(TranscriptEventListener):
    def on_line_started(self, event):
        print(f"Started: {event.line.text}")
    
    def on_line_text_changed(self, event):
        print(f"Updated: {event.line.text}")
    
    def on_line_completed(self, event):
        print(f"Done: {event.line.text}")

transcriber.add_listener(MyListener())

Event types:

LineStarted - New speech segment detected
LineUpdated - Any line property changed
LineTextChanged - Transcription text updated
LineCompleted - Speech segment finished
Error - Processing error occurred

Processing Pipeline

When audio is added via add_audio(), it flows through this pipeline:

1. Audio Buffering

Raw PCM audio is converted to 16kHz mono format internally, regardless of input sample rate.

2. Voice Activity Detection (VAD)

The Silero VAD model (core/silero-vad.h) segments continuous audio into speech phrases:

Runs every 30ms on audio chunks
Averages results over a window (default 0.5s) for stability
Uses a threshold (default 0.5) to distinguish speech from silence
Adds padding and look-behind to avoid clipping speech

3. Speech-to-Text

Moonshine ASR models transcribe segmented audio:

Non-streaming models: Process complete segments
Streaming models: Cache encoder output and decoder state for incremental updates

4. Speaker Identification (Optional)

Pyannote embedding model identifies speakers for diarization:

Requires sufficient audio data per segment
Assigns unique speaker_id (64-bit integer)
Provides speaker_index for “Speaker 1”, “Speaker 2” labeling

5. Event Emission

The stream analyzes transcript changes and emits appropriate events to all registered listeners.

Cross-Platform Architecture

Moonshine Voice runs consistently across platforms:

┌─────────────────────────────────────────────────────────┐
│   Platform-Specific Bindings                            │
│   (Python, Swift, Java, etc.)                          │
└────────────────────────┬────────────────────────────────┘
                         │
┌────────────────────────┴────────────────────────────────┐
│   C API (moonshine-c-api.h)                            │
│   - Provides platform-agnostic interface                │
│   - Thread-safe handle-based design                     │
└────────────────────────┬────────────────────────────────┘
                         │
┌────────────────────────┴────────────────────────────────┐
│   C++ Core Library                                      │
│   - Portable implementation                             │
│   - ONNX Runtime for inference                          │
│   - Cross-platform audio/file handling                  │
└─────────────────────────────────────────────────────────┘

The C++ core (core/) is the single source of truth, with language-specific bindings providing idiomatic interfaces.

Thread Safety

From core/moonshine-c-api.h:64-66:

All API calls are thread-safe, so you can call them from multiple threads concurrently. However, calculations on a single transcriber are serialized, so latency will be affected for calls from other threads while the transcriber is busy.

Session Management

Sessions define the lifecycle of transcription:

start() - Begins new session, resets transcript
add_audio() - Feeds audio into active session
stop() - Ends session, completes any active lines

Each session has one transcript document. Calling start() resets the transcript, so save any data you need beforehand.

Resource Management

The architecture uses handle-based resource management:

// From core/moonshine-c-api.h
int32_t transcriber_handle = moonshine_load_transcriber_from_files(...);
int32_t stream_handle = moonshine_create_stream(transcriber_handle, 0);

// ... use resources ...

moonshine_free_stream(transcriber_handle, stream_handle);
moonshine_free_transcriber(transcriber_handle);

Handles can be reused after freeing, so always clear references in your code after calling free functions.

Next Steps

Transcription

Understand the speech-to-text pipeline

Streaming

Learn how streaming reduces latency

Intent Recognition

Build voice command interfaces

Models

Explore model architectures

Get Started

Core Concepts

Platform Guides

Guides

Models

Overview

Design Philosophy

Architecture Diagram

Core Components

Transcriber

Stream

Event Listeners

Processing Pipeline

1. Audio Buffering

2. Voice Activity Detection (VAD)

3. Speech-to-Text

4. Speaker Identification (Optional)

5. Event Emission

Cross-Platform Architecture

Thread Safety

Session Management

Resource Management

Next Steps

Transcription

Streaming

Intent Recognition

Models

Build docs developers (and LLMs) love

Get Started

Core Concepts

Platform Guides

Guides

Models

​Overview

​Design Philosophy

​Architecture Diagram

​Core Components

​Transcriber

​Stream

​Event Listeners

​Processing Pipeline

​1. Audio Buffering

​2. Voice Activity Detection (VAD)

​3. Speech-to-Text

​4. Speaker Identification (Optional)

​5. Event Emission

​Cross-Platform Architecture

​Thread Safety

​Session Management

​Resource Management

​Next Steps

Transcription

Streaming

Intent Recognition

Models

Build docs developers (and LLMs) love

Overview

Design Philosophy

Architecture Diagram

Core Components

Transcriber

Stream

Event Listeners

Processing Pipeline

1. Audio Buffering

2. Voice Activity Detection (VAD)

3. Speech-to-Text

4. Speaker Identification (Optional)

5. Event Emission

Cross-Platform Architecture

Thread Safety

Session Management

Resource Management

Next Steps