Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

Meta Omnilingual ASR is a multilingual speech recognition engine that uses Meta’s ASRInferencePipeline. It supports a wide range of languages through both CTC and LLM-family model cards. The adapter is lazy-loaded, so normal RealtimeSTT imports and installs do not require the Omnilingual runtime until this engine is selected.
Platform requirements are strict. The Omnilingual engine only runs on Linux or WSL2 with Python 3.11.x. Native Windows cannot run the Omnilingual runtime because fairseq2n has no Windows wheel. Python 3.12.x currently fails dependency resolution because upstream omnilingual-asr==0.2.0 declares Requires-Python: <=3.12,>=3.10, which excludes normal 3.12 patch releases.

Install

Use a Linux or WSL2 environment with Python 3.11.x, then run:
pip install "RealtimeSTT[omnilingual]"
The aliases omnilingual-asr and meta-omnilingual-asr are equivalent if you prefer the explicit package-family name. When working from a source checkout:
pip install -e ".[omnilingual]"
The extra constrains torch==2.8.0 and torchaudio==2.8.0 on Linux/WSL2. If you install a CUDA-enabled PyTorch stack separately, keep torch and torchaudio on matching releases. A mismatched pair can pass pip check but fail at import time with a missing libcudart.so shared-library error.

Engine Names

All of the following names are accepted as the transcription_engine value:
  • omnilingual_asr
  • omnilingual
  • meta_omnilingual_asr
  • omni_asr
Hyphenated CLI forms such as omnilingual-asr, meta-omnilingual-asr, and omni-asr are also accepted through the generic engine-name normalization.

Basic Usage

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="omnilingual_asr",
    model="omniASR_CTC_1B_v2",
    device="cuda",
    compute_type="float16",
)
If model is left at RealtimeSTT’s default Whisper value, the adapter automatically selects omniASR_CTC_1B_v2.

Default Model

The recommended default model is omniASR_CTC_1B_v2. It requires more VRAM and startup time than the 300M card but is the validated default for this integration. The RealtimeSTT extra requires omnilingual-asr>=0.2.0 because omnilingual-asr==0.1.0 only included older non-v2 cards.

Available v2 Model Cards

Model CardType
omniASR_CTC_300M_v2CTC
omniASR_CTC_1B_v2CTC (recommended default)
omniASR_CTC_3B_v2CTC
omniASR_CTC_7B_v2CTC
omniASR_LLM_300M_v2LLM
omniASR_LLM_1B_v2LLM
omniASR_LLM_3B_v2LLM
omniASR_LLM_7B_v2LLM
omniASR_LLM_7B_ZS_v2LLM
omniASR_LLM_Unlimited_300M_v2LLM Unlimited
omniASR_LLM_Unlimited_1B_v2LLM Unlimited
omniASR_LLM_Unlimited_3B_v2LLM Unlimited
omniASR_LLM_Unlimited_7B_v2LLM Unlimited
Do not silently fall back to older non-v2 cards when a v2 card is unknown. A ModelNotKnownError for a v2 card means the installed omnilingual-asr package is older than expected — upgrade to omnilingual-asr>=0.2.0.

Language Support

CTC model cards ignore the language parameter entirely — the adapter removes lang before calling the CTC pipeline. LLM model cards accept Omnilingual language IDs such as eng_Latn:
recorder = AudioToTextRecorder(
    transcription_engine="omnilingual_asr",
    model="omniASR_LLM_1B_v2",
    language="eng_Latn",
    device="cuda",
    compute_type="float16",
)
Common ISO 639-1 short codes are automatically mapped to Omnilingual language IDs:
Short codeOmnilingual ID
ararb_Arab
dedeu_Latn
eneng_Latn
esspa_Latn
frfra_Latn
hihin_Deva
itita_Latn
jajpn_Jpan
kokor_Hang
nlnld_Latn
plpol_Latn
ptpor_Latn
rurus_Cyrl
trtur_Latn
ukukr_Cyrl
zhzho_Hans
Additional aliases can be supplied through transcription_engine_options["language_aliases"].

Realtime Transcription

For realtime previews, use one shared model lane with the validated CTC model:
recorder = AudioToTextRecorder(
    transcription_engine="omnilingual_asr",
    model="omniASR_CTC_1B_v2",
    enable_realtime_transcription=True,
    realtime_transcription_engine="omnilingual_asr",
    realtime_model_type="omniASR_CTC_1B_v2",
    use_main_model_for_realtime=True,
    device="cuda",
    compute_type="float16",
)
use_main_model_for_realtime=True keeps one shared model in memory, which is the safest first test for VRAM. If you use separate final and realtime models, validate VRAM headroom before increasing model size.

Configuration Options

OptionMeaning
modelOmnilingual model card.
transcription_engine_options["model_card"]Overrides model.
devicePassed to ASRInferencePipeline; cuda becomes cuda:<gpu_device_index>.
compute_typeMapped to torch dtype. Defaults to FP16 on CUDA and FP32 on CPU.
transcription_engine_options["dtype"] / ["torch_dtype"]Explicit torch dtype string: float16, bfloat16, or float32.
transcription_engine_options["sample_rate"]Sample rate for in-memory audio. Defaults to 16000.
batch_sizeUsed when greater than 0; otherwise the adapter defaults to batch size 1.
transcription_engine_options["batch_size"]Overrides the RealtimeSTT batch_size value.
transcription_engine_options["pipeline"]Extra keyword arguments for ASRInferencePipeline.
transcription_engine_options["transcribe"]Extra keyword arguments for pipeline.transcribe(...).
transcription_engine_options["max_audio_seconds"]In-memory audio duration guard. Defaults to 39.9 seconds. Set to false to disable.
transcription_engine_options["language"] / ["lang"]Language used when no language parameter is set.
transcription_engine_options["language_aliases"]Extra short-code to Omnilingual language ID mappings.

File Smoke Test

To test the engine without a source checkout, download the standalone smoke script from the release branch and run it from Linux or WSL2 with Python 3.11.x:
curl -L https://raw.githubusercontent.com/KoljaB/RealtimeSTT/release/v1.0.1/tests/realtimestt_omnilingual_test.py \
  -o realtimestt_omnilingual_test.py
python realtimestt_omnilingual_test.py --file-smoke --device cuda --cache-home .home-omnilingual-smoke
The script downloads a small public-domain LJ Speech audio fixture and expects the recognized text to contain in being. From a source checkout you can run the script directly:
python tests/realtimestt_omnilingual_test.py --file-smoke --device cuda
Omnilingual model assets are large. The 1B-family smoke test can download several GiB of model and cache files on first run.
For an interactive microphone check after the file smoke passes:
python realtimestt_omnilingual_test.py --microphone --device cuda

Model Cache Behavior

The Omnilingual package downloads and caches model assets through its underlying fairseq2/Hugging Face tooling in the Linux user’s normal cache locations. RealtimeSTT does not move or delete those files. The adapter passes in-memory audio as a predecoded object to avoid the Omnilingual package treating a raw NumPy float array as encoded audio bytes:
{"waveform": waveform_np, "sample_rate": sample_rate}

FastAPI Recipe

Run the FastAPI server from a source checkout in WSL2/Linux when using Omnilingual:
PYTHONPATH=. python example_fastapi_server/server.py \
  --host 0.0.0.0 \
  --port 8010 \
  --engine omnilingual_asr \
  --model omniASR_CTC_1B_v2 \
  --realtime-engine omnilingual_asr \
  --realtime-model omniASR_CTC_1B_v2 \
  --use-main-model-for-realtime \
  --device cuda \
  --compute-type float16 \
  --realtime-processing-pause 0.05 \
  --engine-options '{"batch_size":1,"sample_rate":16000}'
Open http://localhost:8010/ from a Windows browser. WSL2 forwards localhost for the default setup. For a browser on another device, connect to the Windows host’s LAN address and ensure the firewall and WSL networking allow the selected port.

Troubleshooting

Missing dependency errors mean omnilingual_asr, PyTorch, fairseq2, or fairseq2n is not importable in the active Linux/WSL environment. Reinstall with pip install "RealtimeSTT[omnilingual]" and confirm you are in a Python 3.11.x environment.
Native Windows installs intentionally skip the Omnilingual runtime because fairseq2n currently has no Windows wheel. Run the Omnilingual runtime inside WSL2 or on a Linux machine.
omnilingual-asr==0.2.0 declares Requires-Python: <=3.12,>=3.10, which makes normal Python 3.12 patch releases fail pip’s dependency resolver. Use Python 3.11.x until upstream package metadata changes.
A missing libcudart.so error usually means torch and torchaudio were resolved from incompatible builds. Install matching versions in the same environment. The Omnilingual extra constrains torch==2.8.0 and torchaudio==2.8.0.
This means the installed omnilingual-asr package does not ship that card. Upgrade or pin the package to a release that includes the documented v2 card: pip install "omnilingual-asr>=0.2.0".
Enable one shared model with use_main_model_for_realtime=True, reduce concurrent server sessions, or evaluate a smaller validated model such as omniASR_CTC_300M_v2 before changing the default.
Set an Omnilingual language code such as eng_Latn. CTC models ignore language, but LLM models require it. If left unset, LLM output quality can be poor or empty.
The tested non-streaming Omnilingual pipeline requires audio shorter than 40 seconds. Keep realtime and final utterances below that limit. The max_audio_seconds option (default 39.9) enforces this guard and raises an error before sending oversized audio to the pipeline.

Build docs developers (and LLMs) love