Installation
- Homebrew
- From Source
Available Commands
WhisperKit CLI provides three main commands:transcribe
Transcribe audio files or streams
tts
Text-to-speech generation
serve
Start local server (requires BUILD_ALL=1)
Transcribe Command
Basic Usage
Transcribe an audio file:Command-Line Options
Paths to audio files to transcribe
Path to folder containing audio files (will transcribe all supported formats)
Path to local model files
Model to download if no model-path provided (e.g.,
tiny, base, small, medium, large-v3)Model variant prefix:
openai or distilTask to perform:
transcribe or translateSource language code (e.g.,
en, es, ja, zh)Enable verbose output with progress tracking
Audio Processing Options
Sampling temperature (0.0-1.0). Higher values increase randomness.
Temperature increase on decoding failures
Number of times to increase temperature
Number of candidates when sampling with non-zero temperature (topK)
Prompt and Prefix Options
Text to condition the model on. Useful for guiding transcription style.
Force prefix text when decoding
Force initial prompt tokens based on language, task, and timestamp options
Use decoder prefill data for faster initial decoding
Timestamp Options
Add timestamps for each word in output
Force no timestamps when decoding
List of timestamps to split audio into segments
Quality Thresholds
Gzip compression ratio threshold for decoding failure
Average log probability threshold for decoding failure
Log probability threshold for first token decoding failure
Probability threshold to consider segment as silence
Performance Options
Compute units for audio encoder:
all, cpuOnly, cpuAndGPU, cpuAndNeuralEngineCompute units for text decoder:
all, cpuOnly, cpuAndGPU, cpuAndNeuralEngineMaximum concurrent inference workers (0 = unlimited)
Audio chunking strategy:
none or vad (voice activity detection)Streaming Options
Process audio directly from microphone in real-time
Simulate streaming transcription using input audio file
Output Options
Generate SRT and JSON report files
Directory to save reports
Skip special tokens in output
Usage Examples
Basic Transcription
Translation
Streaming Transcription
Word Timestamps
Using Prompts
Clipping Audio
Performance Tuning
Using Distil Models
Model Management
Downloading Models
Model Locations
Downloaded models are stored in:wav, mp3, m4a, flac, aiff, aac
Progress Tracking
When using--verbose, the CLI displays:
- Model loading time (encoder, decoder, tokenizer)
- Real-time progress bar with ETA
- Tokens per second
- Real-time factor (audio duration / transcription time)
- Speed factor (inverse of real-time factor)
Output Formats
Console Output
By default, prints transcription text to stdout:Report Files
With--report flag, generates:
SRT Subtitle Format
JSON Metadata
Troubleshooting
Model not found
Model not found
Download the model first:Then use the full path:
Invalid language code
Invalid language code
Check supported languages in the error message or see Constants.swift for valid codes.
Microphone permission denied
Microphone permission denied
Grant microphone access in System Settings → Privacy & Security → Microphone
Out of memory errors
Out of memory errors
- Use smaller models (tiny, base)
- Reduce concurrent worker count:
--concurrent-worker-count 1 - Use CPU-only compute units:
--audio-encoder-compute-units cpuOnly
Next Steps
Local Server
Run WhisperKit as an API server
Performance Optimization
Optimize transcription speed