Skip to main content
WhisperKit is an Argmax framework for deploying state-of-the-art speech-to-text systems on Apple devices with advanced features like real-time streaming, word timestamps, voice activity detection, and more.

What is WhisperKit?

WhisperKit brings powerful speech processing capabilities directly to Apple devices:
  • WhisperKit: Speech-to-text powered by OpenAI’s Whisper models, optimized for on-device inference
  • TTSKit: Text-to-speech using Qwen3 TTS models for natural-sounding voice synthesis
Both frameworks run entirely on-device using Core ML, ensuring privacy and low latency without requiring network connectivity.

Key Features

WhisperKit (Speech Recognition)

  • Real-time streaming transcription with live audio input
  • Word-level timestamps for precise alignment
  • Voice activity detection (VAD) to optimize processing
  • Multilingual support with automatic language detection
  • Flexible model selection from tiny to large-v3 variants
  • Custom model deployment via HuggingFace integration
  • OpenAI-compatible API server for easy integration

TTSKit (Text-to-Speech)

  • Real-time streaming playback as audio generates
  • 9 built-in voices with natural prosody
  • 10 language support including English, Chinese, Japanese, Korean
  • Style instructions for controlling speech characteristics (1.7B model)
  • Automatic text chunking for long-form content
  • Concurrent generation for improved throughput
  • Prompt caching for 90% faster subsequent generations

Use Cases

Live Transcription

Real-time meeting transcription, lecture notes, accessibility features

Voice Commands

On-device voice control with low latency and full privacy

Content Creation

Generate narration, audiobooks, and voiceovers without cloud services

Accessibility

Text-to-speech for screen readers, speech-to-text for hearing assistance

Platform Support

WhisperKit

  • iOS 16.0+
  • macOS 13.0+
  • watchOS 10.0+
  • visionOS 1.0+

TTSKit

  • iOS 18.0+
  • macOS 15.0+

Architecture

WhisperKit uses a modular, protocol-based architecture where each component can be swapped:
public var audioProcessor: any AudioProcessing
public var featureExtractor: any FeatureExtracting
public var audioEncoder: any AudioEncoding
public var textDecoder: any TextDecoding
public var segmentSeeker: any SegmentSeeking
This design allows you to customize individual components while maintaining compatibility with the rest of the pipeline.

Model Variants

WhisperKit Models

Models are automatically recommended based on your device capabilities:
  • tiny - Fastest, lowest accuracy (~150 MB)
  • base - Good balance for mobile devices (~290 MB)
  • small - Better accuracy, still fast (~967 MB)
  • medium - High accuracy (~3.1 GB)
  • large-v3 - Best accuracy (~6.2 GB)
  • distil-large-v3 - Distilled version with similar accuracy to large-v3 but faster

TTSKit Models

  • qwen3TTS_0_6b - Fast, runs on all platforms (~1 GB)
  • qwen3TTS_1_7b - Higher quality, macOS only, supports style instructions (~2.2 GB)

Performance

WhisperKit is optimized for Apple Silicon with:
  • Neural Engine acceleration for maximum performance
  • Automatic compute unit selection based on device capabilities
  • Model specialization for reduced latency on first run
  • Efficient memory usage with streaming processing
For production deployments requiring real-time transcription and speaker diarization at scale, consider Argmax Pro SDK with 9x faster models and advanced features.

Open Source

WhisperKit is open source under the MIT License:

Next Steps

Installation

Add WhisperKit to your Xcode project

Quick Start

Get started with your first transcription and TTS generation

Build docs developers (and LLMs) love