CLI Usage - WhisperKit

WhisperKit provides a powerful command-line interface for transcribing audio files, streaming from microphone, and testing models outside of Xcode.

Installation

Homebrew
From Source

brew install whisperkit-cli

git clone https://github.com/argmaxinc/whisperkit.git
cd whisperkit
make setup
make download-model MODEL=large-v3
swift run whisperkit-cli transcribe --help

Available Commands

WhisperKit CLI provides three main commands:

transcribe

Transcribe audio files or streams

tts

Text-to-speech generation

serve

Start local server (requires BUILD_ALL=1)

Transcribe Command

Basic Usage

Transcribe an audio file:

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "audio.wav"

Command-Line Options

--audio-path

string[]

Paths to audio files to transcribe

--audio-folder

string

Path to folder containing audio files (will transcribe all supported formats)

--model-path

string

Path to local model files

--model

string

Model to download if no model-path provided (e.g., tiny, base, small, medium, large-v3)

--model-prefix

string

default:"openai"

Model variant prefix: openai or distil

--task

string

default:"transcribe"

Task to perform: transcribe or translate

--language

string

Source language code (e.g., en, es, ja, zh)

--verbose

boolean

Enable verbose output with progress tracking

Audio Processing Options

--temperature

float

default:"0.0"

Sampling temperature (0.0-1.0). Higher values increase randomness.

--temperature-increment-on-fallback

float

default:"0.2"

Temperature increase on decoding failures

--temperature-fallback-count

int

default:"5"

Number of times to increase temperature

--best-of

int

default:"5"

Number of candidates when sampling with non-zero temperature (topK)

Prompt and Prefix Options

--prompt

string

Text to condition the model on. Useful for guiding transcription style.

--prefix

string

Force prefix text when decoding

--use-prefill-prompt

boolean

Force initial prompt tokens based on language, task, and timestamp options

--use-prefill-cache

boolean

Use decoder prefill data for faster initial decoding

Timestamp Options

--word-timestamps

boolean

Add timestamps for each word in output

--without-timestamps

boolean

Force no timestamps when decoding

--clip-timestamps

float[]

List of timestamps to split audio into segments

Quality Thresholds

--compression-ratio-threshold

float

default:"2.4"

Gzip compression ratio threshold for decoding failure

--logprob-threshold

float

default:"-1.0"

Average log probability threshold for decoding failure

--first-token-logprob-threshold

float

default:"-1.5"

Log probability threshold for first token decoding failure

--no-speech-threshold

float

default:"0.6"

Probability threshold to consider segment as silence

Performance Options

--audio-encoder-compute-units

string

default:"cpuAndNeuralEngine"

Compute units for audio encoder: all, cpuOnly, cpuAndGPU, cpuAndNeuralEngine

--text-decoder-compute-units

string

default:"cpuAndNeuralEngine"

Compute units for text decoder: all, cpuOnly, cpuAndGPU, cpuAndNeuralEngine

--concurrent-worker-count

int

default:"4"

Maximum concurrent inference workers (0 = unlimited)

--chunking-strategy

string

default:"vad"

Audio chunking strategy: none or vad (voice activity detection)

Streaming Options

--stream

boolean

Process audio directly from microphone in real-time

--stream-simulated

boolean

Simulate streaming transcription using input audio file

Output Options

--report

boolean

Generate SRT and JSON report files

--report-path

string

default:"."

Directory to save reports

--skip-special-tokens

boolean

Skip special tokens in output

Usage Examples

Basic Transcription

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "audio.wav"

Translation

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "french_audio.wav" \
  --task "translate" \
  --verbose

Streaming Transcription

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --stream

Word Timestamps

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "audio.wav" \
  --word-timestamps \
  --report \
  --report-path "outputs/"

Using Prompts

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-path "meeting.wav" \
  --prompt "This is a technical discussion about machine learning and neural networks." \
  --language "en"

Clipping Audio

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-base" \
  --audio-path "long_audio.wav" \
  --clip-timestamps 0 30.5 60.0 90.5 \
  --verbose

Performance Tuning

swift run whisperkit-cli transcribe \
  --model-path "Models/whisperkit-coreml/openai_whisper-small" \
  --audio-path "audio.wav" \
  --audio-encoder-compute-units cpuAndGPU \
  --text-decoder-compute-units cpuAndGPU

Using Distil Models

swift run whisperkit-cli transcribe \
  --model "large-v3" \
  --model-prefix "distil" \
  --audio-path "audio.wav" \
  --verbose

Model Management

Downloading Models

# Download specific model
make download-model MODEL=large-v3

# Download all available models
make download-models

Model Locations

Downloaded models are stored in:

Models/whisperkit-coreml/openai_whisper-{MODEL_NAME}/

Supported formats: wav, mp3, m4a, flac, aiff, aac

Progress Tracking

When using --verbose, the CLI displays:

Model loading time (encoder, decoder, tokenizer)
Real-time progress bar with ETA
Tokens per second
Real-time factor (audio duration / transcription time)
Speed factor (inverse of real-time factor)

[==========================] 100% | Elapsed Time: 12.45 s | Remaining: 0.00 s

Transcription Performance:
  - Tokens per second: 124.56
  - Real-time factor: 0.31
  - Speed factor: 3.22

Output Formats

Console Output

By default, prints transcription text to stdout:

swift run whisperkit-cli transcribe --audio-path audio.wav
# Output: "This is the transcribed text."

Report Files

With --report flag, generates:

SRT Subtitle Format

1
00:00:00,000 --> 00:00:03,450
This is the transcribed text.

2
00:00:03,450 --> 00:00:07,890
With timestamps for each segment.

JSON Metadata

{
  "text": "Complete transcription...",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.45,
      "text": "This is the transcribed text.",
      "tokens": [1234, 5678],
      "words": [
        {"word": "This", "start": 0.0, "end": 0.34},
        {"word": "is", "start": 0.34, "end": 0.56}
      ]
    }
  ],
  "timings": {
    "modelLoading": 2.34,
    "tokensPerSecond": 124.56,
    "realTimeFactor": 0.31
  }
}

Troubleshooting

Model not found

Download the model first:

make download-model MODEL=large-v3

Then use the full path:

--model-path "Models/whisperkit-coreml/openai_whisper-large-v3"

Invalid language code

Check supported languages in the error message or see Constants.swift for valid codes.

Microphone permission denied

Grant microphone access in System Settings → Privacy & Security → Microphone

Out of memory errors

Use smaller models (tiny, base)
Reduce concurrent worker count: --concurrent-worker-count 1
Use CPU-only compute units: --audio-encoder-compute-units cpuOnly

Next Steps

Local Server

Run WhisperKit as an API server

Performance Optimization

Optimize transcription speed

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Installation

​Available Commands

transcribe

tts

serve

​Transcribe Command

​Basic Usage

​Command-Line Options

​Audio Processing Options

​Prompt and Prefix Options

​Timestamp Options

​Quality Thresholds

​Performance Options

​Streaming Options

​Output Options

​Usage Examples

​Basic Transcription

​Translation

​Streaming Transcription

​Word Timestamps

​Using Prompts

​Clipping Audio

​Performance Tuning

​Using Distil Models

​Model Management

​Downloading Models

​Model Locations

​Progress Tracking

​Output Formats

​Console Output

​Report Files

​SRT Subtitle Format

​JSON Metadata

​Troubleshooting

​Next Steps

Local Server

Performance Optimization

Build docs developers (and LLMs) love

Installation

Available Commands

Transcribe Command

Basic Usage

Command-Line Options

Audio Processing Options

Prompt and Prefix Options

Timestamp Options

Quality Thresholds

Performance Options

Streaming Options

Output Options

Usage Examples

Basic Transcription

Translation

Streaming Transcription

Word Timestamps

Using Prompts

Clipping Audio

Performance Tuning

Using Distil Models

Model Management

Downloading Models

Model Locations

Progress Tracking

Output Formats

Console Output

Report Files

SRT Subtitle Format

JSON Metadata

Troubleshooting

Next Steps