Introduction

WhisperKit is an Argmax framework for deploying state-of-the-art speech-to-text systems on Apple devices with advanced features like real-time streaming, word timestamps, voice activity detection, and more.

What is WhisperKit?

WhisperKit brings powerful speech processing capabilities directly to Apple devices:

WhisperKit: Speech-to-text powered by OpenAI’s Whisper models, optimized for on-device inference
TTSKit: Text-to-speech using Qwen3 TTS models for natural-sounding voice synthesis

Both frameworks run entirely on-device using Core ML, ensuring privacy and low latency without requiring network connectivity.

Key Features

WhisperKit (Speech Recognition)

Real-time streaming transcription with live audio input
Word-level timestamps for precise alignment
Voice activity detection (VAD) to optimize processing
Multilingual support with automatic language detection
Flexible model selection from tiny to large-v3 variants
Custom model deployment via HuggingFace integration
OpenAI-compatible API server for easy integration

TTSKit (Text-to-Speech)

Real-time streaming playback as audio generates
9 built-in voices with natural prosody
10 language support including English, Chinese, Japanese, Korean
Style instructions for controlling speech characteristics (1.7B model)
Automatic text chunking for long-form content
Concurrent generation for improved throughput
Prompt caching for 90% faster subsequent generations

Use Cases

Live Transcription

Real-time meeting transcription, lecture notes, accessibility features

Voice Commands

On-device voice control with low latency and full privacy

Content Creation

Generate narration, audiobooks, and voiceovers without cloud services

Accessibility

Text-to-speech for screen readers, speech-to-text for hearing assistance

Platform Support

WhisperKit

iOS 16.0+
macOS 13.0+
watchOS 10.0+
visionOS 1.0+

TTSKit

iOS 18.0+
macOS 15.0+

Architecture

WhisperKit uses a modular, protocol-based architecture where each component can be swapped:

public var audioProcessor: any AudioProcessing
public var featureExtractor: any FeatureExtracting
public var audioEncoder: any AudioEncoding
public var textDecoder: any TextDecoding
public var segmentSeeker: any SegmentSeeking

This design allows you to customize individual components while maintaining compatibility with the rest of the pipeline.

Model Variants

WhisperKit Models

Models are automatically recommended based on your device capabilities:

tiny - Fastest, lowest accuracy (~150 MB)
base - Good balance for mobile devices (~290 MB)
small - Better accuracy, still fast (~967 MB)
medium - High accuracy (~3.1 GB)
large-v3 - Best accuracy (~6.2 GB)
distil-large-v3 - Distilled version with similar accuracy to large-v3 but faster

TTSKit Models

qwen3TTS_0_6b - Fast, runs on all platforms (~1 GB)
qwen3TTS_1_7b - Higher quality, macOS only, supports style instructions (~2.2 GB)

Performance

WhisperKit is optimized for Apple Silicon with:

Neural Engine acceleration for maximum performance
Automatic compute unit selection based on device capabilities
Model specialization for reduced latency on first run
Efficient memory usage with streaming processing

For production deployments requiring real-time transcription and speaker diarization at scale, consider Argmax Pro SDK with 9x faster models and advanced features.

Open Source

WhisperKit is open source under the MIT License:

Next Steps

Installation

Add WhisperKit to your Xcode project

Quick Start

Get started with your first transcription and TTS generation

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

What is WhisperKit?

Key Features

WhisperKit (Speech Recognition)

TTSKit (Text-to-Speech)

Use Cases

Live Transcription

Voice Commands

Content Creation

Accessibility

Platform Support

WhisperKit

TTSKit

Architecture

Model Variants

WhisperKit Models

TTSKit Models

Performance

Open Source

Next Steps

Installation

Quick Start

Build docs developers (and LLMs) love

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​What is WhisperKit?

​Key Features

​WhisperKit (Speech Recognition)

​TTSKit (Text-to-Speech)

​Use Cases

Live Transcription

Voice Commands

Content Creation

Accessibility

​Platform Support

​WhisperKit

​TTSKit

​Architecture

​Model Variants

​WhisperKit Models

​TTSKit Models

​Performance

​Open Source

​Next Steps

Installation

Quick Start

Build docs developers (and LLMs) love

What is WhisperKit?

Key Features

WhisperKit (Speech Recognition)

TTSKit (Text-to-Speech)

Use Cases

Platform Support

WhisperKit

TTSKit

Architecture

Model Variants

WhisperKit Models

TTSKit Models

Performance

Open Source

Next Steps