RealtimeSTT includes CPU INT8 sherpa-onnx engines for both the Moonshine and Parakeet model families. These engines are useful when you want fully offline CPU inference without loading NeMo or Transformers at runtime. All computation happens through the ONNX Runtime via theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt
Use this file to discover all available pages before exploring further.
sherpa-onnx Python package, and model files are small enough to vendor locally.
Install
Available Engine Names
Two sherpa-onnx engines are available, each with a set of accepted aliases:| Engine | Class | Aliases |
|---|---|---|
sherpa_onnx_moonshine | SherpaOnnxMoonshineEngine | sherpa_moonshine, moonshine_sherpa_onnx |
sherpa_onnx_parakeet | SherpaOnnxParakeetEngine | sherpa_parakeet, parakeet_sherpa_onnx |
- → _) before lookup.
Model Download
RealtimeSTT does not download sherpa-onnx model bundles automatically. Download the.tar.bz2 archives from the sherpa-onnx ASR model releases, extract them, and pass the extracted directory path as model.
Known bundle names:
sherpa-onnx-moonshine-tiny-en-int8.tar.bz2sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
- Linux / macOS
- Windows PowerShell
Expected Model Files
After extraction, the engine looks for these files inside the model directory:- Moonshine Tiny INT8
- Parakeet TDT INT8
Basic Usage
- Moonshine
- Parakeet
download_root is set, known model ids (such as nvidia/parakeet-tdt-0.6b-v3 or sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8) resolve to the expected extracted directory names under that root.
Engine Options
Pass these keys insidetranscription_engine_options:
| Option | Type | Meaning |
|---|---|---|
model_dir | str | Explicit path to the extracted model directory (overrides model). |
files | dict | Dictionary overriding individual ONNX file names or paths. |
num_threads | int | Number of CPU worker threads. |
provider | str | ONNX Runtime provider. Default: "cpu". |
decoding_method | str | sherpa-onnx decoding method. Default: "greedy_search". |
debug | bool | Enables sherpa-onnx debug output. |
rule_fsts | str | Optional text normalization FST resource path. |
rule_fars | str | Optional text normalization FAR resource path. |
hr_dict_dir | str | Optional heteronym replacement dictionary directory. |
hr_rule_fsts | str | Optional heteronym replacement FST resource path. |
hr_lexicon | str | Optional heteronym replacement lexicon path. |
input_sample_rate | int | Input audio sample rate. Default: 16000. Alias: sample_rate. |
| Option | Type | Default | Meaning |
|---|---|---|---|
model_type | str | "nemo_transducer" | sherpa-onnx model type. |
max_active_paths | int | 4 | Beam search paths. |
hotwords_file | str | "" | Path to hotwords file. |
hotwords_score | float | 1.5 | Hotword boost score. |
blank_penalty | float | 0.0 | Blank token penalty. |
feature_dim | int | 80 | Filterbank feature dimension. |
lm | str | "" | Optional language model path. |
lm_scale | float | 0.1 | Language model interpolation weight. |
lodr_fst | str | "" | Low-order density ratio FST path. |
lodr_scale | float | 0.0 | Low-order density ratio interpolation scale. |
dither | float | 0.0 | Dithering value applied to filterbank features. |
modeling_unit | str | "cjkchar" | Modeling unit for the transducer vocabulary. |
bpe_vocab | str | "" | BPE vocabulary file path. |
When to Use sherpa-onnx
Good fit
- Fully offline CPU inference without PyTorch, NeMo, or Transformers
- Predictable local model files that you control and version
- INT8 ONNX models with small disk and memory footprint
- Environments where network access to Hugging Face Hub is restricted
Consider other engines
- You want automatic model downloads (
faster_whisperoropenai_whisper) - You need multilingual transcription (Moonshine adapter is English-only)
- You have a CUDA-capable GPU (
faster_whisperwithfloat16is faster)
Troubleshooting
Missing file errors
The error message names the exact expected ONNX or
tokens.txt path. Check that the archive was fully extracted — not just downloaded. The model path must point to the extracted directory, not the .tar.bz2 file.High latency
Lower the model size where possible, reduce the realtime cadence with
realtime_processing_pause, and tune num_threads to match your CPU core count.