Skip to main content

Overview

The transcription API supports:
  • Batch transcription from audio files or PCM buffers
  • Streaming transcription for real-time audio
  • Language detection
  • Voice activity detection (VAD)
  • Multiple ASR models (Whisper, Moonshine, Parakeet)

cactus_transcribe

Transcribe audio to text.
int cactus_transcribe(
    cactus_model_t model,
    const char* audio_file_path,
    const char* prompt,
    char* response_buffer,
    size_t buffer_size,
    const char* options_json,
    cactus_token_callback callback,
    void* user_data,
    const uint8_t* pcm_buffer,
    size_t pcm_buffer_size
);
model
cactus_model_t
required
ASR model handle from cactus_init
audio_file_path
string
Path to WAV file. NULL if using pcm_buffer
prompt
string
required
Initial decoder prompt (e.g., <|startoftranscript|><|en|><|transcribe|><|notimestamps|>)
response_buffer
char*
required
Buffer to write JSON response
buffer_size
size_t
required
Size of response buffer
options_json
string
Optional JSON object with transcription options
callback
cactus_token_callback
Optional streaming callback for partial results
user_data
void*
Optional pointer passed to callback
pcm_buffer
uint8_t*
Raw PCM audio (16-bit mono 16kHz). NULL if using audio_file_path
pcm_buffer_size
size_t
Size of PCM buffer in bytes (must be even)
return
int
Number of bytes written to response_buffer on success, -1 on error

Options JSON

{
  "temperature": 0.0,
  "max_tokens": 448,
  "use_vad": true,
  "cloud_handoff_threshold": 0.65
}
temperature
float
default:"0.0"
Sampling temperature (0.0 = greedy decoding recommended)
max_tokens
int
default:"448"
Maximum tokens per audio chunk
use_vad
bool
default:"false"
Split audio using voice activity detection
cloud_handoff_threshold
float
default:"0.0"
Entropy threshold for cloud handoff (0.0 = disabled)

Response Format

{
  "success": true,
  "error": null,
  "text": "Hello, how are you today?",
  "segments": [],
  "time_to_first_token_ms": 38.5,
  "total_time_ms": 156.2,
  "prefill_tokens_per_second": 1200.0,
  "decode_tokens_per_second": 85.3,
  "prompt_tokens": 6,
  "completion_tokens": 12,
  "confidence": 0.96,
  "cloud_handoff": false,
  "ram_usage_mb": 245.1
}
text
string
Transcribed text
confidence
float
Average confidence score (1.0 - mean_entropy)
cloud_handoff
bool
Whether transcription should be retried with cloud ASR

Streaming Transcription

cactus_stream_transcribe_start

Start streaming session.
cactus_stream_transcribe_t cactus_stream_transcribe_start(
    cactus_model_t model,
    const char* options_json
);

cactus_stream_transcribe_process

Process audio chunk.
int cactus_stream_transcribe_process(
    cactus_stream_transcribe_t stream,
    const uint8_t* pcm_buffer,
    size_t pcm_buffer_size,
    char* response_buffer,
    size_t buffer_size
);
stream
cactus_stream_transcribe_t
required
Stream handle from cactus_stream_transcribe_start
pcm_buffer
uint8_t*
required
PCM audio chunk (16-bit mono 16kHz)
pcm_buffer_size
size_t
required
Chunk size in bytes

cactus_stream_transcribe_stop

Finalize streaming session.
int cactus_stream_transcribe_stop(
    cactus_stream_transcribe_t stream,
    char* response_buffer,
    size_t buffer_size
);

Language Detection

cactus_detect_language

Detect spoken language (Whisper only).
int cactus_detect_language(
    cactus_model_t model,
    const char* audio_file_path,
    char* response_buffer,
    size_t buffer_size,
    const char* options_json,
    const uint8_t* pcm_buffer,
    size_t pcm_buffer_size
);

Response Format

{
  "success": true,
  "error": null,
  "language": "en",
  "language_token": "<|en|>",
  "token_id": 50259,
  "confidence": 0.98,
  "entropy": 0.02,
  "total_time_ms": 42.1,
  "ram_usage_mb": 210.5
}
language
string
ISO language code (e.g., en, es, zh)
language_token
string
Whisper language token
confidence
float
Detection confidence (1.0 - entropy)

Example: Batch Transcription

#include "cactus_ffi.h"
#include <stdio.h>

int main() {
    cactus_model_t model = cactus_init("/path/to/whisper", NULL, false);
    
    const char* prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>";
    const char* options = "{\"use_vad\":true}";
    
    char response[16384];
    int result = cactus_transcribe(
        model,
        "/path/to/audio.wav",
        prompt,
        response,
        sizeof(response),
        options,
        NULL, NULL,
        NULL, 0
    );
    
    if (result > 0) {
        printf("%s\n", response);
    } else {
        printf("Error: %s\n", cactus_get_last_error());
    }
    
    cactus_destroy(model);
}

Example: Streaming

#include "cactus_ffi.h"
#include <stdio.h>

void process_audio_stream(cactus_model_t model) {
    cactus_stream_transcribe_t stream = cactus_stream_transcribe_start(
        model,
        "{}"
    );
    
    // Process audio chunks
    char response[4096];
    while (has_audio_data()) {
        uint8_t chunk[4096];
        size_t chunk_size = read_audio_chunk(chunk, sizeof(chunk));
        
        int result = cactus_stream_transcribe_process(
            stream,
            chunk,
            chunk_size,
            response,
            sizeof(response)
        );
        
        if (result > 0) {
            printf("Partial: %s\n", response);
        }
    }
    
    // Finalize
    cactus_stream_transcribe_stop(stream, response, sizeof(response));
    printf("Final: %s\n", response);
}

Example: Language Detection

cactus_model_t model = cactus_init("/path/to/whisper", NULL, false);

char response[2048];
int result = cactus_detect_language(
    model,
    "/path/to/audio.wav",
    response,
    sizeof(response),
    NULL,
    NULL, 0
);

if (result > 0) {
    // Parse JSON to extract language field
    printf("%s\n", response);
}

cactus_destroy(model);

Audio Format Requirements

All audio must be:
  • Sample rate: 16 kHz
  • Channels: Mono (1 channel)
  • Format: 16-bit signed PCM
WAV files are automatically resampled. Raw PCM buffers must already be 16kHz mono.

See Also

VAD API

Voice activity detection

Python SDK

Python transcription API

Transcription Guide

Speech recognition guide

Build docs developers (and LLMs) love