FAQ

General

What is WhisperKit?

WhisperKit is a framework for deploying state-of-the-art speech-to-text systems (like OpenAI’s Whisper) on Apple devices. It provides on-device transcription with features like:

Real-time streaming transcription
Word-level timestamps
Voice activity detection
Multiple language support
Custom model deployment

All processing happens on-device using Apple’s CoreML and Neural Engine.

What is TTSKit?

TTSKit is an on-device text-to-speech framework that runs Qwen3 TTS models entirely on Apple Silicon. It offers:

Real-time streaming playback
Multiple voices and languages
Style control (1.7B model)
No server required
macOS and iOS support

What's the difference between WhisperKit and Argmax Pro SDK?

WhisperKit is ideal for getting started with on-device speech-to-text. It’s free and open-source.Argmax Pro SDK is designed for production deployments requiring:

9x faster transcription with Nvidia Parakeet V3
Real-time speaker diarization
Deepgram-compatible WebSocket server
Commercial support and SLAs

Start with a 14-day trial when you’re ready to scale.

Is WhisperKit free to use?

Yes! WhisperKit is released under the MIT License and is completely free for commercial and personal use. See the License for details.

Does WhisperKit require an internet connection?

No. After downloading models (one-time), all processing happens on-device. However:

Initial model download requires internet
Model updates require internet
Benchmark result uploads require internet (optional)

Installation & Setup

What are the system requirements?

WhisperKit:

macOS 14.0+ or iOS 16.0+
Apple Silicon (M1/M2/M3/M4) or A12 Bionic+
Xcode 16.0+

TTSKit:

macOS 15.0+ or iOS 18.0+
Apple Silicon required

See Supported Devices for details.

Can I use WhisperKit on Intel Macs?

No. WhisperKit requires Apple Neural Engine hardware, which is only available on Apple Silicon (M-series) chips.

Does WhisperKit work in iOS Simulator?

No. The iOS Simulator doesn’t support the Neural Engine. You must test on physical devices or use Mac Catalyst.

How do I install WhisperKit?

Use Swift Package Manager:

In Xcode: File > Add Package Dependencies
Enter: https://github.com/argmaxinc/whisperkit
Select WhisperKit and/or TTSKit products

Or via command line:

brew install whisperkit-cli

See Installation for detailed instructions.

How much storage do models require?

Model sizes vary:

Model	Size
tiny	~40 MB
base	~75 MB
small	~250 MB
medium	~800 MB
large-v3	~1.6 GB
distil-large-v3	~800 MB

Models are cached locally after first download.

Models & Performance

Which model should I use?

It depends on your device and requirements:For real-time performance:

iPhone 15 Pro: medium or smaller
iPhone 14 Pro: small or smaller
M1 Mac+: All models including large-v3

For best accuracy:

Use large-v3 or distil-large-v3

For fastest inference:

Use tiny or base

See Model Catalog for detailed comparisons.

What are distilled models?

Distilled models (e.g., distil-large-v3) are smaller, faster versions that maintain most of the accuracy of larger models. They’re created through knowledge distillation.Benefits:

50% smaller than full models
2-3x faster inference
~95% of original accuracy
Better for resource-constrained devices

Can I use custom fine-tuned models?

Yes! Use whisperkittools to:

Fine-tune Whisper on your data
Convert to CoreML format
Upload to HuggingFace
Load in WhisperKit:

let config = WhisperKitConfig(
    model: "large-v3",
    modelRepo: "username/your-model-repo"
)
let pipe = try await WhisperKit(config)

How fast is real-time transcription?

Performance varies by device and model. Real-time means Real-Time Factor (RTF) < 1.0.Example RTFs on iPhone 15 Pro:

tiny: ~0.1 (10x faster than real-time)
small: ~0.3
medium: ~0.7
large-v3: ~1.5 (not real-time)

Check Benchmarks for your device.

How accurate is WhisperKit compared to cloud APIs?

WhisperKit uses the same Whisper models as many cloud services. Accuracy is comparable to:

OpenAI Whisper API (same models)
Deepgram (similar performance)
AssemblyAI (competitive)

Advantages of on-device:

No network latency
Complete privacy
Works offline
No per-minute costs

Features & Usage

What audio formats are supported?

WhisperKit supports:

WAV (recommended)
MP3
M4A
FLAC
Raw audio buffers
Microphone input

Audio is automatically resampled to 16kHz mono for processing.

How do I get word-level timestamps?

Configure DecodingOptions with word timestamps:

var options = DecodingOptions()
options.wordTimestamps = true

let result = try await pipe.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

// Access timestamps
for segment in result.segments {
    for word in segment.words {
        print("\(word.word): \(word.start) - \(word.end)")
    }
}

Can I transcribe in languages other than English?

Yes! WhisperKit supports 99+ languages. Specify the language:

var options = DecodingOptions()
options.language = "es" // Spanish

let result = try await pipe.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

Or let WhisperKit auto-detect the language by not specifying one.

How do I stream real-time transcription?

Use the streaming API with microphone input:

let pipe = try await WhisperKit()

// Start streaming
try await pipe.startRecording()

// Receive callbacks with transcription updates
pipe.onTranscriptionUpdate = { result in
    print(result.text)
}

// Stop when done
try await pipe.stopRecording()

See Streaming Guide for details.

Does WhisperKit support speaker diarization?

Basic speaker detection is available in WhisperKit, but for production-grade speaker diarization, use Argmax Pro SDK which includes pyannoteAI’s flagship model.

Can I translate audio to English?

Yes! Use the translation task:

var options = DecodingOptions()
options.task = .translate

let result = try await pipe.transcribe(
    audioPath: "spanish_audio.wav",
    decodeOptions: options
)
// Result is in English

TTSKit Specific

Which TTSKit model should I use?

0.6B Model:

Runs on macOS and iOS
~1 GB download
Fast inference
9 voices, 10 languages

1.7B Model:

macOS only
~2.2 GB download
Higher quality
Supports style instructions
Same voices and languages

What voices are available?

TTSKit includes 9 built-in voices:

.ryan - Male, clear and professional
.aiden - Male, warm and friendly
.onoAnna - Female, bright and energetic
.sohee - Female, calm and soothing
.eric - Male, deep and authoritative
.dylan - Male, young and casual
.serena - Female, elegant and refined
.vivian - Female, confident and dynamic
.uncleFu - Male, wise and mature

How do I use style instructions?

Style instructions are only supported on the 1.7B model:

let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))

var options = GenerationOptions()
options.instruction = "Speak slowly and warmly, like a storyteller."

let result = try await tts.generate(
    text: "Once upon a time...",
    options: options
)

Can I stream audio playback?

Yes! Use the play method for real-time streaming:

try await tts.play(text: "This starts playing immediately.")

Audio begins playing before generation completes.

Local Server

What is the WhisperKit Local Server?

A local HTTP server that implements the OpenAI Audio API, allowing you to use OpenAI SDK clients with WhisperKit:

BUILD_ALL=1 swift run whisperkit-cli serve

Compatible with OpenAI Python SDK and other clients.

How do I use Python with WhisperKit?

Use the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:50060/v1")

result = client.audio.transcriptions.create(
    file=open("audio.wav", "rb"),
    model="tiny"
)
print(result.text)

See Local Server Examples for more.

Does the local server support streaming?

Yes! Use the stream=true parameter:

with client.audio.transcriptions.create(
    file=open("audio.wav", "rb"),
    model="tiny",
    stream=True
) as stream:
    for chunk in stream:
        print(chunk.text, end="", flush=True)

Troubleshooting

Model download fails or is very slow

Common solutions:

Install git-lfs:
```
brew install git-lfs
git lfs install
```
Check disk space - Ensure sufficient storage for models
Try a smaller model - Start with tiny to verify setup
Clear cache - Delete ~/.cache/whisperkit/ and retry

Transcription is too slow

Try these optimizations:

Use a smaller model - Switch from large to medium/small
Use distilled models - Try distil-large-v3
Adjust compute units - Configure CoreML compute units
Check thermal throttling - Device may be overheating
Reduce precision - Use quantized models if available

See Performance Guide for details.

Getting poor transcription accuracy

Improve accuracy by:

Use a larger model - large-v3 is most accurate
Specify the language - Don’t rely on auto-detection for best results
Provide context - Use prompt parameter for domain-specific content
Check audio quality - Ensure clear audio, low background noise
Adjust VAD settings - Fine-tune voice activity detection

Build errors in Xcode

Common fixes:

Update Xcode - Ensure Xcode 16.0+
Clean build folder - ⌘⇧K in Xcode
Reset package cache - File > Packages > Reset Package Caches
Check deployment target - macOS 14.0+, iOS 16.0+
Update dependencies - File > Packages > Update to Latest Package Versions

Crashes on device but not Mac

Check:

Memory usage - Large models may exceed device memory
iOS version - Ensure iOS 16.0+ (18.0+ for TTSKit)
Model size - Use smaller model for older devices
Background processing - Check app lifecycle handling
Permissions - Verify microphone permissions

Support & Community

Where can I get help?

Multiple support channels:

Discord: Join our community
GitHub Issues: Report bugs
Email: info@argmaxinc.com
Documentation: Browse docs

How can I contribute?

We welcome contributions!

Fix bugs and add features
Improve documentation
Submit benchmark results
Share example projects

See Contributing Guide to get started.

Can I use WhisperKit commercially?

Yes! WhisperKit is released under the MIT License, which permits commercial use without restrictions.

How do I report a bug?

Open an issue on GitHub:

Check if the issue already exists
Include:
- Device and OS version
- WhisperKit version
- Steps to reproduce
- Sample code if possible
Add relevant logs

Create an issue

Next Steps

Quick Start

Get started with WhisperKit

Model Catalog

Explore available models

Guides

Learn advanced features

Join Discord

Get help from the community

Community

Reference

General

Installation & Setup

Models & Performance

Features & Usage

TTSKit Specific

Local Server

Troubleshooting

Support & Community

Next Steps

Quick Start

Model Catalog

Guides

Join Discord

Build docs developers (and LLMs) love

Community

Reference

Documentation Index

​General

​Installation & Setup

​Models & Performance

​Features & Usage

​TTSKit Specific

​Local Server

​Troubleshooting

​Support & Community

​Next Steps

Quick Start

Model Catalog

Guides

Join Discord

Build docs developers (and LLMs) love

General

Installation & Setup

Models & Performance

Features & Usage

TTSKit Specific

Local Server

Troubleshooting

Support & Community

Next Steps