Overview
Cactus supports high-quality speech-to-text transcription with multiple model families:
Whisper (tiny, base, small, medium) - OpenAI’s multilingual models
Moonshine - Lightweight, fast transcription
Parakeet (CTC 0.6B, 1.1B) - NVIDIA’s efficient models with NPU support
All transcription models support both file-based and real-time streaming transcription.
File Transcription
Basic Usage
from cactus import cactus_init, cactus_transcribe, cactus_destroy
import json
model = cactus_init( "weights/parakeet-ctc-1.1b" , None , False )
result = json.loads(
cactus_transcribe(
model,
"audio.wav" , # audio file path
None , # prompt (optional)
None , # options
None , # callback
None # pcm_data
)
)
print (result[ "text" ])
print ( f "Duration: { result[ 'audio_duration_sec' ] :.2f} s" )
print ( f "Latency: { result[ 'total_time_ms' ] :.2f} ms" )
cactus_destroy(model)
C API
#include <cactus.h>
cactus_model_t model = cactus_init ( "weights/whisper-base" , NULL , false );
char response [ 8192 ];
int result = cactus_transcribe (
model,
"audio.wav" ,
NULL , // prompt
response,
sizeof (response),
NULL , // options
NULL , // callback
NULL , // user_data
NULL , // pcm_buffer
0 // pcm_buffer_size
);
if (result == 0 ) {
printf ( " %s \n " , response);
}
cactus_destroy (model);
All models expect 16 kHz mono PCM audio. If your audio is in a different format, resample it before passing to Cactus.
For raw PCM data:
import numpy as np
# Load audio at 16 kHz
audio = np.array([ ... ], dtype = np.float32) # mono, 16 kHz
# Convert to 16-bit PCM
pcm_data = (audio * 32767 ).astype(np.int16).tobytes()
result = json.loads(
cactus_transcribe(
model,
None , # audio_path
None , # prompt
None , # options
None , # callback
pcm_data # PCM data
)
)
{
"success" : true ,
"text" : "This is the transcribed text." ,
"language" : "en" ,
"audio_duration_sec" : 5.2 ,
"time_to_first_token_ms" : 120.5 ,
"total_time_ms" : 450.3 ,
"decode_tps" : 95000 ,
"tokens" : 12
}
Streaming Transcription
Get real-time transcription results as audio is captured:
from cactus import (
cactus_init,
cactus_stream_transcribe_start,
cactus_stream_transcribe_process,
cactus_stream_transcribe_stop,
cactus_destroy
)
import json
model = cactus_init( "weights/moonshine-base" , None , False )
# Start streaming session
options = json.dumps({ "language" : "en" })
stream = cactus_stream_transcribe_start(model, options)
# Process audio chunks (e.g., from microphone)
for audio_chunk in audio_stream:
partial = json.loads(cactus_stream_transcribe_process(stream, audio_chunk))
print ( f "Partial: { partial[ 'text' ] } " , end = " \r " )
# Get final result
final = json.loads(cactus_stream_transcribe_stop(stream))
print ( f " \n Final: { final[ 'text' ] } " )
cactus_destroy(model)
Prompting
Guide transcription with a text prompt:
# Improve accuracy for specific terminology
prompt = "Transcribe this audio about machine learning and neural networks."
result = json.loads(
cactus_transcribe(model, "audio.wav" , prompt, None , None , None )
)
Options
{
"language" : "en" ,
"task" : "transcribe" ,
"temperature" : 0.0 ,
"best_of" : 5
}
Language code (e.g., “en”, “es”, “fr”). Auto-detected if omitted
task
string
default: "transcribe"
Either “transcribe” or “translate” (translate to English)
Sampling temperature for generation
Number of candidates to generate when temperature > 0
Streaming with Callbacks
def on_token ( text , token_id ):
print (text, end = "" , flush = True )
result = json.loads(
cactus_transcribe(
model,
"audio.wav" ,
None ,
None ,
on_token,
None
)
)
Model Comparison
Model Size Speed Quality NPU moonshine-base 80MB ★★★★★ ★★★ ❌ whisper-tiny 75MB ★★★★ ★★★ ✅ whisper-base 145MB ★★★ ★★★★ ✅ whisper-small 488MB ★★ ★★★★★ ✅ parakeet-ctc-0.6b 600MB ★★★★ ★★★★ ✅ parakeet-ctc-1.1b 1.1GB ★★★ ★★★★★ ✅
Language Detection
Detect audio language before transcription:
result = json.loads(cactus_detect_language(model, "audio.wav" , None , None ))
print ( f "Detected language: { result[ 'language' ] } " )
print ( f "Confidence: { result[ 'confidence' ] :.2f} " )
CLI Usage
# Live microphone transcription
cactus transcribe
# Transcribe audio file
cactus transcribe --file audio.wav
# Use specific model
cactus transcribe --file audio.wav parakeet-ctc-1.1b
Real-world latency on mobile devices (30s audio):
Device Moonshine Whisper-Base Parakeet 1.1B iPhone 17 Pro 0.2s 0.3s 0.3s Mac M4 Pro 0.1s 0.2s 0.1s Galaxy S25 Ultra N/A N/A N/A
Android NPU support for Whisper and Parakeet is coming in March 2026.
Next Steps
Audio Embeddings Generate embeddings from audio for similarity search
Voice Activity Detection Detect speech segments in audio
Supported Models Browse all transcription models
API Reference Complete transcription API docs