sherpa-onnx Engine: Offline CPU INT8 Backend for RealtimeSTT

RealtimeSTT includes CPU INT8 sherpa-onnx engines for both the Moonshine and Parakeet model families. These engines are useful when you want fully offline CPU inference without loading NeMo or Transformers at runtime. All computation happens through the ONNX Runtime via the sherpa-onnx Python package, and model files are small enough to vendor locally.

sherpa-onnx models are not downloaded automatically. You must download and extract the model bundle manually before starting the recorder. See Model Download below.

Install

pip install "RealtimeSTT[sherpa-onnx]"

Available Engine Names

Two sherpa-onnx engines are available, each with a set of accepted aliases:

Engine	Class	Aliases
`sherpa_onnx_moonshine`	`SherpaOnnxMoonshineEngine`	`sherpa_moonshine`, `moonshine_sherpa_onnx`
`sherpa_onnx_parakeet`	`SherpaOnnxParakeetEngine`	`sherpa_parakeet`, `parakeet_sherpa_onnx`

All aliases resolve to the same engine class via the factory. Engine names are normalized (- → _) before lookup.

Model Download

RealtimeSTT does not download sherpa-onnx model bundles automatically. Download the .tar.bz2 archives from the sherpa-onnx ASR model releases, extract them, and pass the extracted directory path as model. Known bundle names:

sherpa-onnx-moonshine-tiny-en-int8.tar.bz2
sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2

Linux / macOS
Windows PowerShell

mkdir -p test-model-cache/sherpa-onnx

# Moonshine
curl -L -o test-model-cache/sherpa-onnx/sherpa-onnx-moonshine-tiny-en-int8.tar.bz2 \
  https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-moonshine-tiny-en-int8.tar.bz2
python -c "import tarfile; tarfile.open('test-model-cache/sherpa-onnx/sherpa-onnx-moonshine-tiny-en-int8.tar.bz2', 'r:bz2').extractall('test-model-cache/sherpa-onnx')"

# Parakeet
curl -L -o test-model-cache/sherpa-onnx/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2 \
  https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
python -c "import tarfile; tarfile.open('test-model-cache/sherpa-onnx/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2', 'r:bz2').extractall('test-model-cache/sherpa-onnx')"

New-Item -ItemType Directory -Path test-model-cache\sherpa-onnx -Force

# Moonshine
curl.exe -L -o test-model-cache\sherpa-onnx\sherpa-onnx-moonshine-tiny-en-int8.tar.bz2 https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-moonshine-tiny-en-int8.tar.bz2
python -c "import tarfile; tarfile.open(r'test-model-cache\sherpa-onnx\sherpa-onnx-moonshine-tiny-en-int8.tar.bz2', 'r:bz2').extractall(r'test-model-cache\sherpa-onnx')"

# Parakeet
curl.exe -L -o test-model-cache\sherpa-onnx\sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2 https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2
python -c "import tarfile; tarfile.open(r'test-model-cache\sherpa-onnx\sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2', 'r:bz2').extractall(r'test-model-cache\sherpa-onnx')"

Expected Model Files

After extraction, the engine looks for these files inside the model directory:

Moonshine Tiny INT8
Parakeet TDT INT8

sherpa-onnx-moonshine-tiny-en-int8/
├── preprocess.onnx
├── encode.int8.onnx
├── uncached_decode.int8.onnx
├── cached_decode.int8.onnx
└── tokens.txt

sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/
├── encoder.int8.onnx
├── decoder.int8.onnx
├── joiner.int8.onnx
└── tokens.txt

Basic Usage

Moonshine
Parakeet

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="sherpa_onnx_moonshine",
    model="test-model-cache/sherpa-onnx/sherpa-onnx-moonshine-tiny-en-int8",
    device="cpu",
    language="en",
    transcription_engine_options={
        "num_threads": 2,
        "provider": "cpu",
    },
)

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="sherpa_onnx_parakeet",
    model="test-model-cache/sherpa-onnx/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8",
    device="cpu",
    transcription_engine_options={
        "num_threads": 4,
        "provider": "cpu",
    },
)

When download_root is set, known model ids (such as nvidia/parakeet-tdt-0.6b-v3 or sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8) resolve to the expected extracted directory names under that root.

Engine Options

Pass these keys inside transcription_engine_options:

Option	Type	Meaning
`model_dir`	`str`	Explicit path to the extracted model directory (overrides `model`).
`files`	`dict`	Dictionary overriding individual ONNX file names or paths.
`num_threads`	`int`	Number of CPU worker threads.
`provider`	`str`	ONNX Runtime provider. Default: `"cpu"`.
`decoding_method`	`str`	sherpa-onnx decoding method. Default: `"greedy_search"`.
`debug`	`bool`	Enables sherpa-onnx debug output.
`rule_fsts`	`str`	Optional text normalization FST resource path.
`rule_fars`	`str`	Optional text normalization FAR resource path.
`hr_dict_dir`	`str`	Optional heteronym replacement dictionary directory.
`hr_rule_fsts`	`str`	Optional heteronym replacement FST resource path.
`hr_lexicon`	`str`	Optional heteronym replacement lexicon path.
`input_sample_rate`	`int`	Input audio sample rate. Default: `16000`. Alias: `sample_rate`.

Parakeet-only transducer options:

Option	Type	Default	Meaning
`model_type`	`str`	`"nemo_transducer"`	sherpa-onnx model type.
`max_active_paths`	`int`	`4`	Beam search paths.
`hotwords_file`	`str`	`""`	Path to hotwords file.
`hotwords_score`	`float`	`1.5`	Hotword boost score.
`blank_penalty`	`float`	`0.0`	Blank token penalty.
`feature_dim`	`int`	`80`	Filterbank feature dimension.
`lm`	`str`	`""`	Optional language model path.
`lm_scale`	`float`	`0.1`	Language model interpolation weight.
`lodr_fst`	`str`	`""`	Low-order density ratio FST path.
`lodr_scale`	`float`	`0.0`	Low-order density ratio interpolation scale.
`dither`	`float`	`0.0`	Dithering value applied to filterbank features.
`modeling_unit`	`str`	`"cjkchar"`	Modeling unit for the transducer vocabulary.
`bpe_vocab`	`str`	`""`	BPE vocabulary file path.

When to Use sherpa-onnx

Good fit

Fully offline CPU inference without PyTorch, NeMo, or Transformers
Predictable local model files that you control and version
INT8 ONNX models with small disk and memory footprint
Environments where network access to Hugging Face Hub is restricted

Consider other engines

You want automatic model downloads (faster_whisper or openai_whisper)
You need multilingual transcription (Moonshine adapter is English-only)
You have a CUDA-capable GPU (faster_whisper with float16 is faster)

Troubleshooting

Missing file errors

The error message names the exact expected ONNX or tokens.txt path. Check that the archive was fully extracted — not just downloaded. The model path must point to the extracted directory, not the .tar.bz2 file.

High latency

Lower the model size where possible, reduce the realtime cadence with realtime_processing_pause, and tune num_threads to match your CPU core count.

Language errors on Moonshine

The sherpa-onnx Moonshine tiny INT8 engine supports English ("en") only. Passing any other language raises a TranscriptionEngineError.

Get Started

Guides

Transcription Engines

Resources

sherpa-onnx Engine: Offline CPU INT8 Backend for RealtimeSTT

Install

Available Engine Names

Model Download

Expected Model Files

Basic Usage

Engine Options

When to Use sherpa-onnx

Good fit

Consider other engines

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Guides

Transcription Engines

Resources

Documentation Index

​Install

​Available Engine Names

​Model Download

​Expected Model Files

​Basic Usage

​Engine Options

​When to Use sherpa-onnx

Good fit

Consider other engines

​Troubleshooting

Build docs developers (and LLMs) love

Install

Available Engine Names

Model Download

Expected Model Files

Basic Usage

Engine Options

When to Use sherpa-onnx

Troubleshooting