Speech-to-Text Transcription

Overview

Cactus supports high-quality speech-to-text transcription with multiple model families:

Whisper (tiny, base, small, medium) - OpenAI’s multilingual models
Moonshine - Lightweight, fast transcription
Parakeet (CTC 0.6B, 1.1B) - NVIDIA’s efficient models with NPU support

All transcription models support both file-based and real-time streaming transcription.

File Transcription

Basic Usage

from cactus import cactus_init, cactus_transcribe, cactus_destroy
import json

model = cactus_init("weights/parakeet-ctc-1.1b", None, False)

result = json.loads(
    cactus_transcribe(
        model,
        "audio.wav",  # audio file path
        None,         # prompt (optional)
        None,         # options
        None,         # callback
        None          # pcm_data
    )
)

print(result["text"])
print(f"Duration: {result['audio_duration_sec']:.2f}s")
print(f"Latency: {result['total_time_ms']:.2f}ms")

cactus_destroy(model)

C API

#include <cactus.h>

cactus_model_t model = cactus_init("weights/whisper-base", NULL, false);

char response[8192];
int result = cactus_transcribe(
    model,
    "audio.wav",
    NULL,  // prompt
    response,
    sizeof(response),
    NULL,  // options
    NULL,  // callback
    NULL,  // user_data
    NULL,  // pcm_buffer
    0      // pcm_buffer_size
);

if (result == 0) {
    printf("%s\n", response);
}

cactus_destroy(model);

Audio Format Requirements

All models expect 16 kHz mono PCM audio. If your audio is in a different format, resample it before passing to Cactus.

For raw PCM data:

import numpy as np

# Load audio at 16 kHz
audio = np.array([...], dtype=np.float32)  # mono, 16 kHz

# Convert to 16-bit PCM
pcm_data = (audio * 32767).astype(np.int16).tobytes()

result = json.loads(
    cactus_transcribe(
        model,
        None,      # audio_path
        None,      # prompt
        None,      # options
        None,      # callback
        pcm_data   # PCM data
    )
)

Response Format

{
    "success": true,
    "text": "This is the transcribed text.",
    "language": "en",
    "audio_duration_sec": 5.2,
    "time_to_first_token_ms": 120.5,
    "total_time_ms": 450.3,
    "decode_tps": 95000,
    "tokens": 12
}

Streaming Transcription

Get real-time transcription results as audio is captured:

from cactus import (
    cactus_init,
    cactus_stream_transcribe_start,
    cactus_stream_transcribe_process,
    cactus_stream_transcribe_stop,
    cactus_destroy
)
import json

model = cactus_init("weights/moonshine-base", None, False)

# Start streaming session
options = json.dumps({"language": "en"})
stream = cactus_stream_transcribe_start(model, options)

# Process audio chunks (e.g., from microphone)
for audio_chunk in audio_stream:
    partial = json.loads(cactus_stream_transcribe_process(stream, audio_chunk))
    print(f"Partial: {partial['text']}", end="\r")

# Get final result
final = json.loads(cactus_stream_transcribe_stop(stream))
print(f"\nFinal: {final['text']}")

cactus_destroy(model)

Prompting

Guide transcription with a text prompt:

# Improve accuracy for specific terminology
prompt = "Transcribe this audio about machine learning and neural networks."

result = json.loads(
    cactus_transcribe(model, "audio.wav", prompt, None, None, None)
)

Options

{
    "language": "en",
    "task": "transcribe",
    "temperature": 0.0,
    "best_of": 5
}

language

string

Language code (e.g., “en”, “es”, “fr”). Auto-detected if omitted

task

string

default:"transcribe"

Either “transcribe” or “translate” (translate to English)

temperature

number

default:"0.0"

Sampling temperature for generation

best_of

integer

default:"5"

Number of candidates to generate when temperature > 0

Streaming with Callbacks

def on_token(text, token_id):
    print(text, end="", flush=True)

result = json.loads(
    cactus_transcribe(
        model,
        "audio.wav",
        None,
        None,
        on_token,
        None
    )
)

Model Comparison

Model	Size	Speed	Quality	NPU
moonshine-base	80MB	★★★★★	★★★	❌
whisper-tiny	75MB	★★★★	★★★	✅
whisper-base	145MB	★★★	★★★★	✅
whisper-small	488MB	★★	★★★★★	✅
parakeet-ctc-0.6b	600MB	★★★★	★★★★	✅
parakeet-ctc-1.1b	1.1GB	★★★	★★★★★	✅

Language Detection

Detect audio language before transcription:

result = json.loads(cactus_detect_language(model, "audio.wav", None, None))
print(f"Detected language: {result['language']}")
print(f"Confidence: {result['confidence']:.2f}")

CLI Usage

# Live microphone transcription
cactus transcribe

# Transcribe audio file
cactus transcribe --file audio.wav

# Use specific model
cactus transcribe --file audio.wav parakeet-ctc-1.1b

Performance Benchmarks

Real-world latency on mobile devices (30s audio):

Device	Moonshine	Whisper-Base	Parakeet 1.1B
iPhone 17 Pro	0.2s	0.3s	0.3s
Mac M4 Pro	0.1s	0.2s	0.1s
Galaxy S25 Ultra	N/A	N/A	N/A

Android NPU support for Whisper and Parakeet is coming in March 2026.

Next Steps

Audio Embeddings

Generate embeddings from audio for similarity search

Voice Activity Detection

Detect speech segments in audio

Supported Models

Browse all transcription models

API Reference

Complete transcription API docs

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Speech-to-Text Transcription

Overview

File Transcription

Basic Usage

C API

Audio Format Requirements

Response Format

Streaming Transcription

Prompting

Options

Streaming with Callbacks

Model Comparison

Language Detection

CLI Usage

Performance Benchmarks

Next Steps

Audio Embeddings

Voice Activity Detection

Supported Models

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Documentation Index

​Overview

​File Transcription

​Basic Usage

​C API

​Audio Format Requirements

​Response Format

​Streaming Transcription

​Prompting

​Options

​Streaming with Callbacks

​Model Comparison

​Language Detection

​CLI Usage

​Performance Benchmarks

​Next Steps

Audio Embeddings

Voice Activity Detection

Supported Models

API Reference

Build docs developers (and LLMs) love

Overview

File Transcription

Basic Usage

C API

Audio Format Requirements

Response Format

Streaming Transcription

Prompting

Options

Streaming with Callbacks

Model Comparison

Language Detection

CLI Usage

Performance Benchmarks

Next Steps