This guide will walk you through creating your first app with WhisperKit and TTSKit. We’ll cover both speech-to-text and text-to-speech functionality.
WhisperKit: Speech-to-Text
Basic Transcription
Transcribe an audio file with just a few lines of code:
import WhisperKit
Task {
// Initialize WhisperKit with default settings
let pipe = try ? await WhisperKit ()
// Transcribe audio file
let result = try ? await pipe ! . transcribe (
audioPath : "path/to/your/audio.wav"
)
// Print transcription
if let result = result {
print (result. first ? . text ?? "" )
}
}
WhisperKit supports multiple audio formats: .wav, .mp3, .m4a, and .flac
Model Selection
By default, WhisperKit automatically selects the best model for your device. To use a specific model:
// Use a specific model by name
let pipe = try await WhisperKit (
model : "large-v3"
)
// Use glob patterns for fuzzy matching
let pipe = try await WhisperKit (
model : "distil*large-v3" // Matches distil-large-v3
)
Available models from the HuggingFace repo :
tiny Fastest, lowest accuracy ~150 MB
base Good for mobile ~290 MB
small Balanced performance ~967 MB
medium High accuracy ~3.1 GB
large-v3 Best accuracy ~6.2 GB
distil-large-v3 Distilled large-v3 ~3.8 GB, faster
Advanced Transcription Options
Detailed Options
Real-time Streaming
import WhisperKit
Task {
let pipe = try await WhisperKit ( model : "large-v3" )
// Configure transcription options
var options = DecodingOptions ()
options. language = "en" // Specify language
options. temperature = 0.0 // Lower = more deterministic
options. withoutTimestamps = false // Include word timestamps
options. clipTimestamps = [] // Process specific time ranges
options. verbose = true // Enable logging
let results = try await pipe. transcribe (
audioPath : "audio.wav" ,
decodeOptions : options
) { progress in
// Optional progress callback
print ( "Progress: \( progress. timings . tokensPerSecond ) tokens/sec" )
return true // return false to cancel
}
// Access detailed results
if let result = results. first {
print ( "Text: \( result. text ) " )
print ( "Language: \( result. language ) " )
// Word-level timestamps
for segment in result.segments {
print ( "[ \( segment. start ) s - \( segment. end ) s]: \( segment. text ) " )
}
}
}
Language Detection
Automatically detect the language of audio:
let pipe = try await WhisperKit ()
// Detect language from file
let (language, langProbs) = try await pipe. detectLanguage (
audioPath : "audio.wav"
)
print ( "Detected language: \( language ) " )
print ( "Confidence scores: \( langProbs ) " )
// Output: Detected language: en
// Confidence scores: ["en": 0.98, "es": 0.01, ...]
Language detection only works with multilingual models. English-only models will throw an error.
Processing Multiple Files
Transcribe multiple audio files concurrently:
let pipe = try await WhisperKit ()
let audioPaths = [
"audio1.wav" ,
"audio2.mp3" ,
"audio3.m4a"
]
// Transcribe all files with concurrent processing
var options = DecodingOptions ()
options. concurrentWorkerCount = 3 // Process 3 files at once
let results = await pipe. transcribe (
audioPaths : audioPaths,
decodeOptions : options
)
// Results maintain input order
for (path, result) in zip (audioPaths, results) {
if let transcription = result ? . first ? . text {
print ( " \( path ) : \( transcription ) " )
}
}
Custom Models
Deploy your own fine-tuned models:
import WhisperKit
// Load from custom HuggingFace repo
let config = WhisperKitConfig (
model : "large-v3" ,
modelRepo : "username/your-model-repo" // Your HF repo
)
let pipe = try await WhisperKit (config)
Use whisperkittools to convert and upload your models to HuggingFace.
TTSKit: Text-to-Speech
Basic Speech Generation
Generate speech from text:
import TTSKit
Task {
// Initialize TTSKit (auto-downloads 0.6B model on first run)
let tts = try await TTSKit ()
// Generate speech
let result = try await tts. generate (
text : "Hello from TTSKit!"
)
print ( "Generated \( result. audioDuration ) s of audio" )
print ( "Sample rate: \( result. sampleRate ) Hz" )
print ( "Audio samples: \( result. audio . count ) " )
}
Model Selection
TTSKit provides two model variants:
0.6B Model (Default)
1.7B Model (High Quality)
import TTSKit
// Fast model that runs on all platforms (~1 GB download)
let tts = try await TTSKit (
model : . qwen3TTS_0_6b
)
// Good for: iOS, watchOS, lower-end devices
// Supports: 9 voices, 10 languages
Voice and Language Selection
Choose from 9 built-in voices and 10 languages:
import TTSKit
let tts = try await TTSKit ()
// Generate with specific voice and language
let result = try await tts. generate (
text : "こんにちは世界" ,
speaker : . onoAnna ,
language : . japanese
)
Available Voices:
.ryan - Male, neutral (default)
.aiden - Male, energetic
.onoAnna - Female, warm
.sohee - Female, clear
.eric - Male, professional
.dylan - Male, casual
.serena - Female, smooth
.vivian - Female, bright
.uncleFu - Male, deep
Available Languages:
.english, .chinese, .japanese, .korean, .german, .french, .russian, .portuguese, .spanish, .italian
Real-Time Streaming Playback
Play audio as it’s being generated:
import TTSKit
let tts = try await TTSKit ()
// Starts playing before generation finishes
try await tts. play (
text : "This starts playing immediately as audio is generated."
)
// Control buffering strategy
try await tts. play (
text : "Long passage with custom buffering..." ,
playbackStrategy : . auto // Auto-calculates optimal buffer
)
Playback Strategies:
.auto (Recommended)
.stream
.buffered
.generateFirst
Automatically measures generation speed and buffers just enough to prevent gaps: try await tts. play (
text : "Hello world" ,
playbackStrategy : . auto
)
Immediate playback with no buffering (lowest latency): try await tts. play (
text : "Hello world" ,
playbackStrategy : . stream
)
Fixed pre-buffer duration: try await tts. play (
text : "Hello world" ,
playbackStrategy : . buffered ( seconds : 2.0 )
)
Generate all audio before playing: try await tts. play (
text : "Hello world" ,
playbackStrategy : . generateFirst
)
Generation Options
Customize sampling, chunking, and performance:
import TTSKit
let tts = try await TTSKit ()
// Configure generation options
var options = GenerationOptions ()
options. temperature = 0.9 // Randomness (0.0-1.0)
options. topK = 50 // Top-K sampling
options. repetitionPenalty = 1.05 // Reduce repetition
options. maxNewTokens = 245 // Max length per chunk
// Text chunking for long content
options. chunkingStrategy = . sentence // Split at sentence boundaries
options. concurrentWorkerCount = nil // Auto-select based on device
let result = try await tts. generate (
text : "Long article or book chapter..." ,
options : options
)
Style Instructions (1.7B Only)
Control prosody with natural language instructions:
import TTSKit
let tts = try await TTSKit ( model : . qwen3TTS_1_7b ) // Requires 1.7B model
var options = GenerationOptions ()
options. instruction = "Speak slowly and warmly, like a storyteller."
let result = try await tts. generate (
text : "Once upon a time..." ,
speaker : . ryan ,
options : options
)
Style instructions only work with the 1.7B model. The 0.6B model ignores the instruction parameter.
Save Generated Audio
Export audio to WAV or M4A format:
import TTSKit
import Foundation
let tts = try await TTSKit ()
let result = try await tts. generate ( text : "Save me!" )
// Get documents directory
let outputDir = FileManager. default . urls (
for : . documentDirectory ,
in : . userDomainMask
)[ 0 ]
// Save as WAV (lossless)
try await AudioOutput. saveAudio (
result. audio ,
toFolder : outputDir,
filename : "output" ,
format : . wav
)
// Save as M4A (AAC compressed, smaller file)
try await AudioOutput. saveAudio (
result. audio ,
toFolder : outputDir,
filename : "output" ,
format : . m4a
)
print ( "Saved to: \( outputDir. path ) /output.m4a" )
Progress Callbacks
Receive per-step updates during generation:
import TTSKit
let tts = try await TTSKit ()
let result = try await tts. generate (
text : "Hello from TTSKit!"
) { progress in
print ( "Audio chunk: \( progress. audio . count ) samples" )
// First step timing (useful for estimating total time)
if let stepTime = progress.stepTime {
print ( "First step took \( stepTime ) s" )
}
// Chunk progress (when using multi-chunk generation)
if let chunk = progress.chunkIndex, let total = progress.totalChunks {
print ( "Chunk \( chunk + 1 ) / \( total ) " )
}
return true // Return false to cancel generation
}
print ( "Final timings:" )
print ( " Total: \( result. timings . fullPipeline ) s" )
print ( " Time to first audio: \( result. timings . timeToFirstBuffer ) s" )
print ( " Decoding: \( result. timings . decodingLoop ) s" )
Complete Examples
Combined Speech-to-Speech
Transcribe audio and generate speech response:
import SwiftUI
import WhisperKit
import TTSKit
struct ContentView : View {
@State private var whisper: WhisperKit ?
@State private var tts: TTSKit ?
@State private var transcription = ""
@State private var isProcessing = false
var body: some View {
VStack ( spacing : 20 ) {
Text (transcription)
. padding ()
Button ( "Transcribe & Respond" ) {
Task {
await processAudio ()
}
}
. disabled (isProcessing)
}
. task {
// Initialize both systems
whisper = try ? await WhisperKit ()
tts = try ? await TTSKit ()
}
}
func processAudio () async {
isProcessing = true
defer { isProcessing = false }
do {
// Transcribe input
let result = try await whisper ? . transcribe (
audioPath : "input.wav"
)
transcription = result ? . first ? . text ?? ""
// Generate speech response
let response = "You said: \( transcription ) "
try await tts ? . play ( text : response)
} catch {
transcription = "Error: \( error. localizedDescription ) "
}
}
}
Command Line Usage
Both WhisperKit and TTSKit are available via the CLI:
Speech Recognition
Text-to-Speech
# Transcribe a file
whisperkit-cli transcribe \
--audio-path audio.wav \
--model large-v3
# Stream from microphone
whisperkit-cli transcribe --stream
# Detect language
whisperkit-cli transcribe \
--audio-path audio.wav \
--detect-language
# Generate and play
whisperkit-cli tts \
--text "Hello from the command line" \
--play
# Save to file
whisperkit-cli tts \
--text "Save me" \
--output-path output.wav
# Use specific voice and language
whisperkit-cli tts \
--text "日本語テスト" \
--speaker ono-anna \
--language japanese \
--play
# High quality with style
whisperkit-cli tts \
--text-file article.txt \
--model 1.7b \
--instruction "Read cheerfully" \
--play
Use prewarm to compile models in the background: let config = WhisperKitConfig (
model : "large-v3" ,
prewarm : true , // Compile models without loading weights
load : false // Load later with loadModels()
)
let pipe = try await WhisperKit (config)
// Do other setup work...
// Now load the models
try await pipe. loadModels ()
Initialize once and reuse: class SpeechService {
let whisper: WhisperKit
let tts: TTSKit
init () async throws {
whisper = try await WhisperKit ()
tts = try await TTSKit ()
}
func transcribe ( _ path : String ) async throws -> String {
let result = try await whisper. transcribe ( audioPath : path)
return result. first ? . text ?? ""
}
}
Process Long Audio Efficiently
Use chunking and VAD for long recordings: var options = DecodingOptions ()
options. chunkingStrategy = . vad // Voice activity detection
options. concurrentWorkerCount = 2 // Process chunks in parallel
let results = try await pipe. transcribe (
audioPath : "long_recording.wav" ,
decodeOptions : options
)
Build prompt cache once for 90% faster subsequent generations: let tts = try await TTSKit ()
// Build cache (slow, ~1s)
try await tts. buildPromptCache (
voice : "ryan" ,
language : "english"
)
// Subsequent generations reuse cache (fast, ~0.1s overhead)
let result1 = try await tts. generate ( text : "First sentence" )
let result2 = try await tts. generate ( text : "Second sentence" ) // Much faster!
Next Steps
API Reference Explore detailed API documentation
Advanced Features Learn about streaming, VAD, and custom models
Example Apps Browse complete example applications
Best Practices Optimization tips and production guidelines