Skip to main content

Overview

Cactus supports high-quality speech-to-text transcription with multiple model families:
  • Whisper (tiny, base, small, medium) - OpenAI’s multilingual models
  • Moonshine - Lightweight, fast transcription
  • Parakeet (CTC 0.6B, 1.1B) - NVIDIA’s efficient models with NPU support
All transcription models support both file-based and real-time streaming transcription.

File Transcription

Basic Usage

from cactus import cactus_init, cactus_transcribe, cactus_destroy
import json

model = cactus_init("weights/parakeet-ctc-1.1b", None, False)

result = json.loads(
    cactus_transcribe(
        model,
        "audio.wav",  # audio file path
        None,         # prompt (optional)
        None,         # options
        None,         # callback
        None          # pcm_data
    )
)

print(result["text"])
print(f"Duration: {result['audio_duration_sec']:.2f}s")
print(f"Latency: {result['total_time_ms']:.2f}ms")

cactus_destroy(model)

C API

#include <cactus.h>

cactus_model_t model = cactus_init("weights/whisper-base", NULL, false);

char response[8192];
int result = cactus_transcribe(
    model,
    "audio.wav",
    NULL,  // prompt
    response,
    sizeof(response),
    NULL,  // options
    NULL,  // callback
    NULL,  // user_data
    NULL,  // pcm_buffer
    0      // pcm_buffer_size
);

if (result == 0) {
    printf("%s\n", response);
}

cactus_destroy(model);

Audio Format Requirements

All models expect 16 kHz mono PCM audio. If your audio is in a different format, resample it before passing to Cactus.
For raw PCM data:
import numpy as np

# Load audio at 16 kHz
audio = np.array([...], dtype=np.float32)  # mono, 16 kHz

# Convert to 16-bit PCM
pcm_data = (audio * 32767).astype(np.int16).tobytes()

result = json.loads(
    cactus_transcribe(
        model,
        None,      # audio_path
        None,      # prompt
        None,      # options
        None,      # callback
        pcm_data   # PCM data
    )
)

Response Format

{
    "success": true,
    "text": "This is the transcribed text.",
    "language": "en",
    "audio_duration_sec": 5.2,
    "time_to_first_token_ms": 120.5,
    "total_time_ms": 450.3,
    "decode_tps": 95000,
    "tokens": 12
}

Streaming Transcription

Get real-time transcription results as audio is captured:
from cactus import (
    cactus_init,
    cactus_stream_transcribe_start,
    cactus_stream_transcribe_process,
    cactus_stream_transcribe_stop,
    cactus_destroy
)
import json

model = cactus_init("weights/moonshine-base", None, False)

# Start streaming session
options = json.dumps({"language": "en"})
stream = cactus_stream_transcribe_start(model, options)

# Process audio chunks (e.g., from microphone)
for audio_chunk in audio_stream:
    partial = json.loads(cactus_stream_transcribe_process(stream, audio_chunk))
    print(f"Partial: {partial['text']}", end="\r")

# Get final result
final = json.loads(cactus_stream_transcribe_stop(stream))
print(f"\nFinal: {final['text']}")

cactus_destroy(model)

Prompting

Guide transcription with a text prompt:
# Improve accuracy for specific terminology
prompt = "Transcribe this audio about machine learning and neural networks."

result = json.loads(
    cactus_transcribe(model, "audio.wav", prompt, None, None, None)
)

Options

{
    "language": "en",
    "task": "transcribe",
    "temperature": 0.0,
    "best_of": 5
}
language
string
Language code (e.g., “en”, “es”, “fr”). Auto-detected if omitted
task
string
default:"transcribe"
Either “transcribe” or “translate” (translate to English)
temperature
number
default:"0.0"
Sampling temperature for generation
best_of
integer
default:"5"
Number of candidates to generate when temperature > 0

Streaming with Callbacks

def on_token(text, token_id):
    print(text, end="", flush=True)

result = json.loads(
    cactus_transcribe(
        model,
        "audio.wav",
        None,
        None,
        on_token,
        None
    )
)

Model Comparison

ModelSizeSpeedQualityNPU
moonshine-base80MB★★★★★★★★
whisper-tiny75MB★★★★★★★
whisper-base145MB★★★★★★★
whisper-small488MB★★★★★★★
parakeet-ctc-0.6b600MB★★★★★★★★
parakeet-ctc-1.1b1.1GB★★★★★★★★

Language Detection

Detect audio language before transcription:
result = json.loads(cactus_detect_language(model, "audio.wav", None, None))
print(f"Detected language: {result['language']}")
print(f"Confidence: {result['confidence']:.2f}")

CLI Usage

# Live microphone transcription
cactus transcribe

# Transcribe audio file
cactus transcribe --file audio.wav

# Use specific model
cactus transcribe --file audio.wav parakeet-ctc-1.1b

Performance Benchmarks

Real-world latency on mobile devices (30s audio):
DeviceMoonshineWhisper-BaseParakeet 1.1B
iPhone 17 Pro0.2s0.3s0.3s
Mac M4 Pro0.1s0.2s0.1s
Galaxy S25 UltraN/AN/AN/A
Android NPU support for Whisper and Parakeet is coming in March 2026.

Next Steps

Audio Embeddings

Generate embeddings from audio for similarity search

Voice Activity Detection

Detect speech segments in audio

Supported Models

Browse all transcription models

API Reference

Complete transcription API docs

Build docs developers (and LLMs) love