Skip to main content
WhisperKit provides a powerful command-line interface for transcribing audio files, streaming from microphone, and testing models outside of Xcode.

Installation

brew install whisperkit-cli

Available Commands

WhisperKit CLI provides three main commands:

transcribe

Transcribe audio files or streams

tts

Text-to-speech generation

serve

Start local server (requires BUILD_ALL=1)

Transcribe Command

Basic Usage

Transcribe an audio file:
swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "audio.wav"

Command-Line Options

--audio-path
string[]
Paths to audio files to transcribe
--audio-folder
string
Path to folder containing audio files (will transcribe all supported formats)
--model-path
string
Path to local model files
--model
string
Model to download if no model-path provided (e.g., tiny, base, small, medium, large-v3)
--model-prefix
string
default:"openai"
Model variant prefix: openai or distil
--task
string
default:"transcribe"
Task to perform: transcribe or translate
--language
string
Source language code (e.g., en, es, ja, zh)
--verbose
boolean
Enable verbose output with progress tracking

Audio Processing Options

--temperature
float
default:"0.0"
Sampling temperature (0.0-1.0). Higher values increase randomness.
--temperature-increment-on-fallback
float
default:"0.2"
Temperature increase on decoding failures
--temperature-fallback-count
int
default:"5"
Number of times to increase temperature
--best-of
int
default:"5"
Number of candidates when sampling with non-zero temperature (topK)

Prompt and Prefix Options

--prompt
string
Text to condition the model on. Useful for guiding transcription style.
--prefix
string
Force prefix text when decoding
--use-prefill-prompt
boolean
Force initial prompt tokens based on language, task, and timestamp options
--use-prefill-cache
boolean
Use decoder prefill data for faster initial decoding

Timestamp Options

--word-timestamps
boolean
Add timestamps for each word in output
--without-timestamps
boolean
Force no timestamps when decoding
--clip-timestamps
float[]
List of timestamps to split audio into segments

Quality Thresholds

--compression-ratio-threshold
float
default:"2.4"
Gzip compression ratio threshold for decoding failure
--logprob-threshold
float
default:"-1.0"
Average log probability threshold for decoding failure
--first-token-logprob-threshold
float
default:"-1.5"
Log probability threshold for first token decoding failure
--no-speech-threshold
float
default:"0.6"
Probability threshold to consider segment as silence

Performance Options

--audio-encoder-compute-units
string
default:"cpuAndNeuralEngine"
Compute units for audio encoder: all, cpuOnly, cpuAndGPU, cpuAndNeuralEngine
--text-decoder-compute-units
string
default:"cpuAndNeuralEngine"
Compute units for text decoder: all, cpuOnly, cpuAndGPU, cpuAndNeuralEngine
--concurrent-worker-count
int
default:"4"
Maximum concurrent inference workers (0 = unlimited)
--chunking-strategy
string
default:"vad"
Audio chunking strategy: none or vad (voice activity detection)

Streaming Options

--stream
boolean
Process audio directly from microphone in real-time
--stream-simulated
boolean
Simulate streaming transcription using input audio file

Output Options

--report
boolean
Generate SRT and JSON report files
--report-path
string
default:"."
Directory to save reports
--skip-special-tokens
boolean
Skip special tokens in output

Usage Examples

Basic Transcription

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "audio.wav"

Translation

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "french_audio.wav" \
  --task "translate" \
  --verbose

Streaming Transcription

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --stream

Word Timestamps

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "audio.wav" \
  --word-timestamps \
  --report \
  --report-path "outputs/"

Using Prompts

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "meeting.wav" \
  --prompt "This is a technical discussion about machine learning and neural networks." \
  --language "en"

Clipping Audio

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-base" \
  --audio-path "long_audio.wav" \
  --clip-timestamps 0 30.5 60.0 90.5 \
  --verbose

Performance Tuning

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-small" \
  --audio-path "audio.wav" \
  --audio-encoder-compute-units cpuAndGPU \
  --text-decoder-compute-units cpuAndGPU

Using Distil Models

swift run whisperkit-cli transcribe \
  --model "large-v3" \
  --model-prefix "distil" \
  --audio-path "audio.wav" \
  --verbose

Model Management

Downloading Models

# Download specific model
make download-model MODEL=large-v3

# Download all available models
make download-models

Model Locations

Downloaded models are stored in:
Models/whisperkit-coreml/openai_whisper-{MODEL_NAME}/
Supported formats: wav, mp3, m4a, flac, aiff, aac

Progress Tracking

When using --verbose, the CLI displays:
  • Model loading time (encoder, decoder, tokenizer)
  • Real-time progress bar with ETA
  • Tokens per second
  • Real-time factor (audio duration / transcription time)
  • Speed factor (inverse of real-time factor)
[==========================] 100% | Elapsed Time: 12.45 s | Remaining: 0.00 s

Transcription Performance:
  - Tokens per second: 124.56
  - Real-time factor: 0.31
  - Speed factor: 3.22

Output Formats

Console Output

By default, prints transcription text to stdout:
swift run whisperkit-cli transcribe --audio-path audio.wav
# Output: "This is the transcribed text."

Report Files

With --report flag, generates:

SRT Subtitle Format

1
00:00:00,000 --> 00:00:03,450
This is the transcribed text.

2
00:00:03,450 --> 00:00:07,890
With timestamps for each segment.

JSON Metadata

{
  "text": "Complete transcription...",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.45,
      "text": "This is the transcribed text.",
      "tokens": [1234, 5678],
      "words": [
        {"word": "This", "start": 0.0, "end": 0.34},
        {"word": "is", "start": 0.34, "end": 0.56}
      ]
    }
  ],
  "timings": {
    "modelLoading": 2.34,
    "tokensPerSecond": 124.56,
    "realTimeFactor": 0.31
  }
}

Troubleshooting

Download the model first:
make download-model MODEL=large-v3
Then use the full path:
--model-path "Models/whisperkit-coreml/openai_whisper-large-v3"
Check supported languages in the error message or see Constants.swift for valid codes.
Grant microphone access in System Settings → Privacy & Security → Microphone
  • Use smaller models (tiny, base)
  • Reduce concurrent worker count: --concurrent-worker-count 1
  • Use CPU-only compute units: --audio-encoder-compute-units cpuOnly

Next Steps

Local Server

Run WhisperKit as an API server

Performance Optimization

Optimize transcription speed

Build docs developers (and LLMs) love