faster-whisper Engine: CTranslate2 Backend for RealtimeSTT

faster_whisper is the default RealtimeSTT transcription engine. It wraps the faster-whisper package, which runs Whisper models through the CTranslate2 inference library. It supports the familiar Whisper model names alongside local CTranslate2 model directories, and covers both GPU and CPU inference through the same interface.

Install

Install the faster-whisper extra for RealtimeSTT:

pip install "RealtimeSTT[faster-whisper]"

If you are working from a source checkout:

python -m pip install -e ".[faster-whisper]"

Basic Usage

GPU (CUDA)
CPU

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="faster_whisper",
    model="small.en",
    device="cuda",
    compute_type="default",
)

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    model="tiny.en",
    device="cpu",
    compute_type="int8",
)

Model Names

Known model names are downloaded automatically by faster-whisper. Use download_root to control the cache directory:

recorder = AudioToTextRecorder(
    model="small.en",
    download_root="models/faster-whisper",
)

You can also pass a path to a locally converted CTranslate2 model directory as model.

Model name	Notes
`tiny`	Smallest multilingual model
`tiny.en`	English-only, smallest
`base`	Multilingual base
`base.en`	English-only base
`small`	Multilingual small
`small.en`	English-only small
`medium`	Multilingual medium
`medium.en`	English-only medium
`large-v1`	Large multilingual v1
`large-v2`	Large multilingual v2
`large-v3`	Large multilingual v3
`distil-*` variants	Distilled models (e.g. `distil-small.en`, `distil-medium.en`, `distil-large-v3`)

Compute Types

compute_type controls CTranslate2 precision and quantization. Choose based on your hardware:

`compute_type`	Best for	Notes
`default`	GPU or CPU	CTranslate2 picks the best available type automatically
`float16`	GPU	Half-precision; requires sufficient VRAM
`int8_float16`	GPU	INT8 weights, float16 compute; reduces VRAM usage
`int8`	CPU	Integer quantization; fast on CPU
`float32`	CPU reference / debugging	Full precision; slowest on CPU

GPU Setup

Use device="cuda" for GPU inference. gpu_device_index accepts an integer or a list of GPU ids for compatible multi-GPU loading:

recorder = AudioToTextRecorder(
    model="small.en",
    device="cuda",
    compute_type="float16",
    gpu_device_index=0,
)

If CUDA libraries fail to load, reinstall PyTorch and torchaudio for the CUDA version present on your machine before reinstalling faster-whisper.

Engine-Specific Options

The table below maps RealtimeSTT parameters to their underlying faster-whisper counterparts:

RealtimeSTT parameter	faster-whisper mapping
`model`	`WhisperModel(model_size_or_path=...)`
`download_root`	`WhisperModel(download_root=...)`
`device`	`WhisperModel(device=...)`
`compute_type`	`WhisperModel(compute_type=...)`
`gpu_device_index`	`WhisperModel(device_index=...)`
`beam_size`	`model.transcribe(beam_size=...)`
`batch_size`	Enables `BatchedInferencePipeline` when greater than `0`
`language`	Passed as the transcription language when set
`initial_prompt`	Passed as `initial_prompt`
`suppress_tokens`	Passed as `suppress_tokens`
`faster_whisper_vad_filter`	Passed as `vad_filter`
`normalize_audio`	Normalizes audio before transcription when enabled

VAD Filter

faster-whisper includes a built-in voice activity detection filter. Enable it with faster_whisper_vad_filter:

recorder = AudioToTextRecorder(
    model="small.en",
    faster_whisper_vad_filter=True,
)

The VAD filter can reduce hallucinations on silent segments, but RealtimeSTT’s own VAD already gates audio before it reaches the engine. Enable faster_whisper_vad_filter only if you observe spurious output on near-silent segments.

Realtime Configuration

Use a smaller realtime_model_type than the final model to keep realtime updates responsive:

recorder = AudioToTextRecorder(
    model="small.en",
    enable_realtime_transcription=True,
    realtime_model_type="tiny.en",
    realtime_processing_pause=0.15,
)

To share a single model between final and realtime transcription, set use_main_model_for_realtime=True. This saves memory but can reduce responsiveness when final and realtime requests contend for the same model.

Troubleshooting

CUDA libraries fail to load

Reinstall PyTorch and torchaudio for the CUDA version on your machine, then reinstall faster-whisper. Verify with torch.cuda.is_available().

Model downloads fail

Set download_root to a writable directory and verify network access to the Hugging Face Hub. You can also pre-download models and pass the local CTranslate2 directory as model.

Realtime text lags behind speech

Use a smaller realtime_model_type, lower beam_size_realtime to 1, increase realtime_processing_pause, or switch realtime to a CPU-friendly engine such as whisper_cpp.

Get Started

Guides

Transcription Engines

Resources

faster-whisper Engine: CTranslate2 Backend for RealtimeSTT

Install

Basic Usage

Model Names

Compute Types

GPU Setup

Engine-Specific Options

VAD Filter

Realtime Configuration

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Guides

Transcription Engines

Resources

Documentation Index

​Install

​Basic Usage

​Model Names

​Compute Types

​GPU Setup

​Engine-Specific Options

​VAD Filter

​Realtime Configuration

​Troubleshooting

Build docs developers (and LLMs) love

Install

Basic Usage

Model Names

Compute Types

GPU Setup

Engine-Specific Options

VAD Filter

Realtime Configuration

Troubleshooting