Skip to main content

General

WhisperKit is a framework for deploying state-of-the-art speech-to-text systems (like OpenAI’s Whisper) on Apple devices. It provides on-device transcription with features like:
  • Real-time streaming transcription
  • Word-level timestamps
  • Voice activity detection
  • Multiple language support
  • Custom model deployment
All processing happens on-device using Apple’s CoreML and Neural Engine.
TTSKit is an on-device text-to-speech framework that runs Qwen3 TTS models entirely on Apple Silicon. It offers:
  • Real-time streaming playback
  • Multiple voices and languages
  • Style control (1.7B model)
  • No server required
  • macOS and iOS support
WhisperKit is ideal for getting started with on-device speech-to-text. It’s free and open-source.Argmax Pro SDK is designed for production deployments requiring:
  • 9x faster transcription with Nvidia Parakeet V3
  • Real-time speaker diarization
  • Deepgram-compatible WebSocket server
  • Commercial support and SLAs
Start with a 14-day trial when you’re ready to scale.
Yes! WhisperKit is released under the MIT License and is completely free for commercial and personal use. See the License for details.
No. After downloading models (one-time), all processing happens on-device. However:
  • Initial model download requires internet
  • Model updates require internet
  • Benchmark result uploads require internet (optional)

Installation & Setup

WhisperKit:
  • macOS 14.0+ or iOS 16.0+
  • Apple Silicon (M1/M2/M3/M4) or A12 Bionic+
  • Xcode 16.0+
TTSKit:
  • macOS 15.0+ or iOS 18.0+
  • Apple Silicon required
See Supported Devices for details.
No. WhisperKit requires Apple Neural Engine hardware, which is only available on Apple Silicon (M-series) chips.
No. The iOS Simulator doesn’t support the Neural Engine. You must test on physical devices or use Mac Catalyst.
Use Swift Package Manager:
  1. In Xcode: File > Add Package Dependencies
  2. Enter: https://github.com/argmaxinc/whisperkit
  3. Select WhisperKit and/or TTSKit products
Or via command line:
brew install whisperkit-cli
See Installation for detailed instructions.
Model sizes vary:
ModelSize
tiny~40 MB
base~75 MB
small~250 MB
medium~800 MB
large-v3~1.6 GB
distil-large-v3~800 MB
Models are cached locally after first download.

Models & Performance

It depends on your device and requirements:For real-time performance:
  • iPhone 15 Pro: medium or smaller
  • iPhone 14 Pro: small or smaller
  • M1 Mac+: All models including large-v3
For best accuracy:
  • Use large-v3 or distil-large-v3
For fastest inference:
  • Use tiny or base
See Model Catalog for detailed comparisons.
Distilled models (e.g., distil-large-v3) are smaller, faster versions that maintain most of the accuracy of larger models. They’re created through knowledge distillation.Benefits:
  • 50% smaller than full models
  • 2-3x faster inference
  • ~95% of original accuracy
  • Better for resource-constrained devices
Yes! Use whisperkittools to:
  1. Fine-tune Whisper on your data
  2. Convert to CoreML format
  3. Upload to HuggingFace
  4. Load in WhisperKit:
let config = WhisperKitConfig(
    model: "large-v3",
    modelRepo: "username/your-model-repo"
)
let pipe = try await WhisperKit(config)
Performance varies by device and model. Real-time means Real-Time Factor (RTF) < 1.0.Example RTFs on iPhone 15 Pro:
  • tiny: ~0.1 (10x faster than real-time)
  • small: ~0.3
  • medium: ~0.7
  • large-v3: ~1.5 (not real-time)
Check Benchmarks for your device.
WhisperKit uses the same Whisper models as many cloud services. Accuracy is comparable to:
  • OpenAI Whisper API (same models)
  • Deepgram (similar performance)
  • AssemblyAI (competitive)
Advantages of on-device:
  • No network latency
  • Complete privacy
  • Works offline
  • No per-minute costs

Features & Usage

WhisperKit supports:
  • WAV (recommended)
  • MP3
  • M4A
  • FLAC
  • Raw audio buffers
  • Microphone input
Audio is automatically resampled to 16kHz mono for processing.
Configure DecodingOptions with word timestamps:
var options = DecodingOptions()
options.wordTimestamps = true

let result = try await pipe.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

// Access timestamps
for segment in result.segments {
    for word in segment.words {
        print("\(word.word): \(word.start) - \(word.end)")
    }
}
Yes! WhisperKit supports 99+ languages. Specify the language:
var options = DecodingOptions()
options.language = "es" // Spanish

let result = try await pipe.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)
Or let WhisperKit auto-detect the language by not specifying one.
Use the streaming API with microphone input:
let pipe = try await WhisperKit()

// Start streaming
try await pipe.startRecording()

// Receive callbacks with transcription updates
pipe.onTranscriptionUpdate = { result in
    print(result.text)
}

// Stop when done
try await pipe.stopRecording()
See Streaming Guide for details.
Basic speaker detection is available in WhisperKit, but for production-grade speaker diarization, use Argmax Pro SDK which includes pyannoteAI’s flagship model.
Yes! Use the translation task:
var options = DecodingOptions()
options.task = .translate

let result = try await pipe.transcribe(
    audioPath: "spanish_audio.wav",
    decodeOptions: options
)
// Result is in English

TTSKit Specific

0.6B Model:
  • Runs on macOS and iOS
  • ~1 GB download
  • Fast inference
  • 9 voices, 10 languages
1.7B Model:
  • macOS only
  • ~2.2 GB download
  • Higher quality
  • Supports style instructions
  • Same voices and languages
TTSKit includes 9 built-in voices:
  • .ryan - Male, clear and professional
  • .aiden - Male, warm and friendly
  • .onoAnna - Female, bright and energetic
  • .sohee - Female, calm and soothing
  • .eric - Male, deep and authoritative
  • .dylan - Male, young and casual
  • .serena - Female, elegant and refined
  • .vivian - Female, confident and dynamic
  • .uncleFu - Male, wise and mature
Style instructions are only supported on the 1.7B model:
let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))

var options = GenerationOptions()
options.instruction = "Speak slowly and warmly, like a storyteller."

let result = try await tts.generate(
    text: "Once upon a time...",
    options: options
)
Yes! Use the play method for real-time streaming:
try await tts.play(text: "This starts playing immediately.")
Audio begins playing before generation completes.

Local Server

A local HTTP server that implements the OpenAI Audio API, allowing you to use OpenAI SDK clients with WhisperKit:
BUILD_ALL=1 swift run whisperkit-cli serve
Compatible with OpenAI Python SDK and other clients.
Use the OpenAI Python SDK:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:50060/v1")

result = client.audio.transcriptions.create(
    file=open("audio.wav", "rb"),
    model="tiny"
)
print(result.text)
See Local Server Examples for more.
Yes! Use the stream=true parameter:
with client.audio.transcriptions.create(
    file=open("audio.wav", "rb"),
    model="tiny",
    stream=True
) as stream:
    for chunk in stream:
        print(chunk.text, end="", flush=True)

Troubleshooting

Common solutions:
  1. Install git-lfs:
    brew install git-lfs
    git lfs install
    
  2. Check disk space - Ensure sufficient storage for models
  3. Try a smaller model - Start with tiny to verify setup
  4. Clear cache - Delete ~/.cache/whisperkit/ and retry
Try these optimizations:
  1. Use a smaller model - Switch from large to medium/small
  2. Use distilled models - Try distil-large-v3
  3. Adjust compute units - Configure CoreML compute units
  4. Check thermal throttling - Device may be overheating
  5. Reduce precision - Use quantized models if available
See Performance Guide for details.
Improve accuracy by:
  1. Use a larger model - large-v3 is most accurate
  2. Specify the language - Don’t rely on auto-detection for best results
  3. Provide context - Use prompt parameter for domain-specific content
  4. Check audio quality - Ensure clear audio, low background noise
  5. Adjust VAD settings - Fine-tune voice activity detection
Common fixes:
  1. Update Xcode - Ensure Xcode 16.0+
  2. Clean build folder - ⌘⇧K in Xcode
  3. Reset package cache - File > Packages > Reset Package Caches
  4. Check deployment target - macOS 14.0+, iOS 16.0+
  5. Update dependencies - File > Packages > Update to Latest Package Versions
Check:
  1. Memory usage - Large models may exceed device memory
  2. iOS version - Ensure iOS 16.0+ (18.0+ for TTSKit)
  3. Model size - Use smaller model for older devices
  4. Background processing - Check app lifecycle handling
  5. Permissions - Verify microphone permissions

Support & Community

Multiple support channels:
We welcome contributions!
  • Fix bugs and add features
  • Improve documentation
  • Submit benchmark results
  • Share example projects
See Contributing Guide to get started.
Yes! WhisperKit is released under the MIT License, which permits commercial use without restrictions.
Open an issue on GitHub:
  1. Check if the issue already exists
  2. Include:
    • Device and OS version
    • WhisperKit version
    • Steps to reproduce
    • Sample code if possible
  3. Add relevant logs
Create an issue

Next Steps

Quick Start

Get started with WhisperKit

Model Catalog

Explore available models

Guides

Learn advanced features

Join Discord

Get help from the community

Build docs developers (and LLMs) love