whisper.cpp Engine: CPU Speech Recognition for RealtimeSTT

whisper_cpp uses the optional pywhispercpp package to run Whisper models through the whisper.cpp C++ runtime. It is useful when you want local CPU transcription with ggml model files and a smaller Python dependency surface than PyTorch-based engines. No CUDA installation is required.

Install

pip install "RealtimeSTT[whisper-cpp]"

Basic Usage

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="whisper_cpp",
    model="tiny.en",
    device="cpu",
)

Model Handling

model can be a name or path accepted by pywhispercpp. For known model names, pywhispercpp may download the matching ggml model automatically. Use download_root to keep model files in a predictable directory:

recorder = AudioToTextRecorder(
    transcription_engine="whisper_cpp",
    model="small.en-q5_1",
    download_root="models/whispercpp",
    device="cpu",
)

If you download model files manually, pass the model path directly or configure pywhispercpp’s models_dir option through transcription_engine_options:

recorder = AudioToTextRecorder(
    transcription_engine="whisper_cpp",
    model="tiny.en",
    transcription_engine_options={
        "model": {
            "models_dir": "/path/to/your/ggml/models",
        },
    },
)

CPU Tuning

For realtime CPU transcription, use greedy decoding and streaming-friendly pywhispercpp options. A complete realtime configuration with both final and realtime engines looks like this:

recorder = AudioToTextRecorder(
    transcription_engine="whisper_cpp",
    model="tiny.en",
    device="cpu",
    beam_size=5,
    transcription_engine_options={
        "model": {
            "n_threads": 8,
            "redirect_whispercpp_logs_to": None,
        },
    },
    enable_realtime_transcription=True,
    realtime_transcription_engine="whisper_cpp",
    realtime_model_type="tiny.en",
    beam_size_realtime=1,
    realtime_processing_pause=0.15,
    realtime_transcription_engine_options={
        "model": {
            "n_threads": 8,
            "redirect_whispercpp_logs_to": None,
        },
        "transcribe": {
            "single_segment": True,
            "no_context": True,
            "print_timestamps": False,
        },
    },
)

Good starting profiles:

Profile	Model	Final `beam_size`	Realtime `beam_size_realtime`
Fast	`tiny.en` or `base.en-q5_1`	`1`	`1`
Balanced	`small.en-q5_1`	`3`	`1`
More accurate CPU	`small.en`	`5`	`1`

medium.en and larger models can be too slow for interactive CPU use.

Engine-Specific Options

Pass backend-specific configuration through transcription_engine_options:

Option bucket	Meaning
`transcription_engine_options["model"]`	Passed to `pywhispercpp.model.Model`.
`transcription_engine_options["transcribe"]`	Merged into `Model.transcribe(...)`.
`download_root`	Passed as `models_dir`.
`beam_size`	Uses whisper.cpp beam search when greater than `1`; otherwise greedy decoding.
`initial_prompt`	String prompts become `initial_prompt`; token iterables become prompt token fields.

Current Adapter Limitations

compute_type, batch_size, faster_whisper_vad_filter, and suppress_tokens do not map to equivalent whisper.cpp behavior and are ignored.
Language probability is not reported like faster-whisper; explicit languages are returned with probability 1.0.
Native whisper.cpp output may still appear in the console depending on package behavior and options.

When to Use whisper.cpp

Good fit

CPU-only machines without a CUDA-capable GPU
Environments where you want to avoid PyTorch as a dependency
Low-memory setups using quantized ggml models (e.g. q5_1 variants)
Containers or servers where binary size matters

Consider faster-whisper instead

GPU inference with CTranslate2 quantization
Production use cases requiring language probability scores
Batched inference pipelines (batch_size > 0)
Richer option pass-through (VAD filter, suppress_tokens, etc.)

Troubleshooting

Import fails

Ensure pywhispercpp is installed in the active environment: pip install pywhispercpp.

Model cannot be found

Set download_root to a writable directory, or pass an absolute path to the ggml model file as model.

Realtime updates fall behind speech

Reduce model size, increase n_threads up to the CPU’s useful limit, keep beam_size_realtime=1, and increase realtime_processing_pause.

Get Started

Guides

Transcription Engines

Resources

whisper.cpp Engine: CPU Speech Recognition for RealtimeSTT

Install

Basic Usage

Model Handling

CPU Tuning

Engine-Specific Options

Current Adapter Limitations

When to Use whisper.cpp

Good fit

Consider faster-whisper instead

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Guides

Transcription Engines

Resources

Documentation Index

​Install

​Basic Usage

​Model Handling

​CPU Tuning

​Engine-Specific Options

​Current Adapter Limitations

​When to Use whisper.cpp

Good fit

Consider faster-whisper instead

​Troubleshooting

Build docs developers (and LLMs) love

Install

Basic Usage

Model Handling

CPU Tuning

Engine-Specific Options

Current Adapter Limitations

When to Use whisper.cpp

Troubleshooting