Skip to main content
This guide will walk you through creating your first app with WhisperKit and TTSKit. We’ll cover both speech-to-text and text-to-speech functionality.

WhisperKit: Speech-to-Text

Basic Transcription

Transcribe an audio file with just a few lines of code:
import WhisperKit

Task {
    // Initialize WhisperKit with default settings
    let pipe = try? await WhisperKit()
    
    // Transcribe audio file
    let result = try? await pipe!.transcribe(
        audioPath: "path/to/your/audio.wav"
    )
    
    // Print transcription
    if let result = result {
        print(result.first?.text ?? "")
    }
}
WhisperKit supports multiple audio formats: .wav, .mp3, .m4a, and .flac

Model Selection

By default, WhisperKit automatically selects the best model for your device. To use a specific model:
// Use a specific model by name
let pipe = try await WhisperKit(
    model: "large-v3"
)

// Use glob patterns for fuzzy matching
let pipe = try await WhisperKit(
    model: "distil*large-v3"  // Matches distil-large-v3
)
Available models from the HuggingFace repo:

tiny

Fastest, lowest accuracy~150 MB

base

Good for mobile~290 MB

small

Balanced performance~967 MB

medium

High accuracy~3.1 GB

large-v3

Best accuracy~6.2 GB

distil-large-v3

Distilled large-v3~3.8 GB, faster

Advanced Transcription Options

import WhisperKit

Task {
    let pipe = try await WhisperKit(model: "large-v3")
    
    // Configure transcription options
    var options = DecodingOptions()
    options.language = "en"              // Specify language
    options.temperature = 0.0            // Lower = more deterministic
    options.withoutTimestamps = false    // Include word timestamps
    options.clipTimestamps = []          // Process specific time ranges
    options.verbose = true               // Enable logging
    
    let results = try await pipe.transcribe(
        audioPath: "audio.wav",
        decodeOptions: options
    ) { progress in
        // Optional progress callback
        print("Progress: \(progress.timings.tokensPerSecond) tokens/sec")
        return true  // return false to cancel
    }
    
    // Access detailed results
    if let result = results.first {
        print("Text: \(result.text)")
        print("Language: \(result.language)")
        
        // Word-level timestamps
        for segment in result.segments {
            print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
        }
    }
}

Language Detection

Automatically detect the language of audio:
let pipe = try await WhisperKit()

// Detect language from file
let (language, langProbs) = try await pipe.detectLanguage(
    audioPath: "audio.wav"
)

print("Detected language: \(language)")
print("Confidence scores: \(langProbs)")
// Output: Detected language: en
// Confidence scores: ["en": 0.98, "es": 0.01, ...]
Language detection only works with multilingual models. English-only models will throw an error.

Processing Multiple Files

Transcribe multiple audio files concurrently:
let pipe = try await WhisperKit()

let audioPaths = [
    "audio1.wav",
    "audio2.mp3",
    "audio3.m4a"
]

// Transcribe all files with concurrent processing
var options = DecodingOptions()
options.concurrentWorkerCount = 3  // Process 3 files at once

let results = await pipe.transcribe(
    audioPaths: audioPaths,
    decodeOptions: options
)

// Results maintain input order
for (path, result) in zip(audioPaths, results) {
    if let transcription = result?.first?.text {
        print("\(path): \(transcription)")
    }
}

Custom Models

Deploy your own fine-tuned models:
import WhisperKit

// Load from custom HuggingFace repo
let config = WhisperKitConfig(
    model: "large-v3",
    modelRepo: "username/your-model-repo"  // Your HF repo
)

let pipe = try await WhisperKit(config)
Use whisperkittools to convert and upload your models to HuggingFace.

TTSKit: Text-to-Speech

Basic Speech Generation

Generate speech from text:
import TTSKit

Task {
    // Initialize TTSKit (auto-downloads 0.6B model on first run)
    let tts = try await TTSKit()
    
    // Generate speech
    let result = try await tts.generate(
        text: "Hello from TTSKit!"
    )
    
    print("Generated \(result.audioDuration)s of audio")
    print("Sample rate: \(result.sampleRate)Hz")
    print("Audio samples: \(result.audio.count)")
}

Model Selection

TTSKit provides two model variants:
import TTSKit

// Fast model that runs on all platforms (~1 GB download)
let tts = try await TTSKit(
    model: .qwen3TTS_0_6b
)

// Good for: iOS, watchOS, lower-end devices
// Supports: 9 voices, 10 languages

Voice and Language Selection

Choose from 9 built-in voices and 10 languages:
import TTSKit

let tts = try await TTSKit()

// Generate with specific voice and language
let result = try await tts.generate(
    text: "こんにちは世界",
    speaker: .onoAnna,
    language: .japanese
)
Available Voices:
  • .ryan - Male, neutral (default)
  • .aiden - Male, energetic
  • .onoAnna - Female, warm
  • .sohee - Female, clear
  • .eric - Male, professional
  • .dylan - Male, casual
  • .serena - Female, smooth
  • .vivian - Female, bright
  • .uncleFu - Male, deep
Available Languages: .english, .chinese, .japanese, .korean, .german, .french, .russian, .portuguese, .spanish, .italian

Real-Time Streaming Playback

Play audio as it’s being generated:
import TTSKit

let tts = try await TTSKit()

// Starts playing before generation finishes
try await tts.play(
    text: "This starts playing immediately as audio is generated."
)

// Control buffering strategy
try await tts.play(
    text: "Long passage with custom buffering...",
    playbackStrategy: .auto  // Auto-calculates optimal buffer
)
Playback Strategies:

Generation Options

Customize sampling, chunking, and performance:
import TTSKit

let tts = try await TTSKit()

// Configure generation options
var options = GenerationOptions()
options.temperature = 0.9           // Randomness (0.0-1.0)
options.topK = 50                   // Top-K sampling
options.repetitionPenalty = 1.05    // Reduce repetition
options.maxNewTokens = 245          // Max length per chunk

// Text chunking for long content
options.chunkingStrategy = .sentence  // Split at sentence boundaries
options.concurrentWorkerCount = nil   // Auto-select based on device

let result = try await tts.generate(
    text: "Long article or book chapter...",
    options: options
)

Style Instructions (1.7B Only)

Control prosody with natural language instructions:
import TTSKit

let tts = try await TTSKit(model: .qwen3TTS_1_7b)  // Requires 1.7B model

var options = GenerationOptions()
options.instruction = "Speak slowly and warmly, like a storyteller."

let result = try await tts.generate(
    text: "Once upon a time...",
    speaker: .ryan,
    options: options
)
Style instructions only work with the 1.7B model. The 0.6B model ignores the instruction parameter.

Save Generated Audio

Export audio to WAV or M4A format:
import TTSKit
import Foundation

let tts = try await TTSKit()
let result = try await tts.generate(text: "Save me!")

// Get documents directory
let outputDir = FileManager.default.urls(
    for: .documentDirectory, 
    in: .userDomainMask
)[0]

// Save as WAV (lossless)
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output",
    format: .wav
)

// Save as M4A (AAC compressed, smaller file)
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output",
    format: .m4a
)

print("Saved to: \(outputDir.path)/output.m4a")

Progress Callbacks

Receive per-step updates during generation:
import TTSKit

let tts = try await TTSKit()

let result = try await tts.generate(
    text: "Hello from TTSKit!"
) { progress in
    print("Audio chunk: \(progress.audio.count) samples")
    
    // First step timing (useful for estimating total time)
    if let stepTime = progress.stepTime {
        print("First step took \(stepTime)s")
    }
    
    // Chunk progress (when using multi-chunk generation)
    if let chunk = progress.chunkIndex, let total = progress.totalChunks {
        print("Chunk \(chunk + 1)/\(total)")
    }
    
    return true  // Return false to cancel generation
}

print("Final timings:")
print("  Total: \(result.timings.fullPipeline)s")
print("  Time to first audio: \(result.timings.timeToFirstBuffer)s")
print("  Decoding: \(result.timings.decodingLoop)s")

Complete Examples

Combined Speech-to-Speech

Transcribe audio and generate speech response:
import SwiftUI
import WhisperKit
import TTSKit

struct ContentView: View {
    @State private var whisper: WhisperKit?
    @State private var tts: TTSKit?
    @State private var transcription = ""
    @State private var isProcessing = false
    
    var body: some View {
        VStack(spacing: 20) {
            Text(transcription)
                .padding()
            
            Button("Transcribe & Respond") {
                Task {
                    await processAudio()
                }
            }
            .disabled(isProcessing)
        }
        .task {
            // Initialize both systems
            whisper = try? await WhisperKit()
            tts = try? await TTSKit()
        }
    }
    
    func processAudio() async {
        isProcessing = true
        defer { isProcessing = false }
        
        do {
            // Transcribe input
            let result = try await whisper?.transcribe(
                audioPath: "input.wav"
            )
            
            transcription = result?.first?.text ?? ""
            
            // Generate speech response
            let response = "You said: \(transcription)"
            try await tts?.play(text: response)
            
        } catch {
            transcription = "Error: \(error.localizedDescription)"
        }
    }
}

Command Line Usage

Both WhisperKit and TTSKit are available via the CLI:
# Transcribe a file
whisperkit-cli transcribe \
  --audio-path audio.wav \
  --model large-v3

# Stream from microphone
whisperkit-cli transcribe --stream

# Detect language
whisperkit-cli transcribe \
  --audio-path audio.wav \
  --detect-language

Performance Tips

Use prewarm to compile models in the background:
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true,  // Compile models without loading weights
    load: false     // Load later with loadModels()
)
let pipe = try await WhisperKit(config)

// Do other setup work...

// Now load the models
try await pipe.loadModels()
Initialize once and reuse:
class SpeechService {
    let whisper: WhisperKit
    let tts: TTSKit
    
    init() async throws {
        whisper = try await WhisperKit()
        tts = try await TTSKit()
    }
    
    func transcribe(_ path: String) async throws -> String {
        let result = try await whisper.transcribe(audioPath: path)
        return result.first?.text ?? ""
    }
}
Use chunking and VAD for long recordings:
var options = DecodingOptions()
options.chunkingStrategy = .vad  // Voice activity detection
options.concurrentWorkerCount = 2  // Process chunks in parallel

let results = try await pipe.transcribe(
    audioPath: "long_recording.wav",
    decodeOptions: options
)
Build prompt cache once for 90% faster subsequent generations:
let tts = try await TTSKit()

// Build cache (slow, ~1s)
try await tts.buildPromptCache(
    voice: "ryan",
    language: "english"
)

// Subsequent generations reuse cache (fast, ~0.1s overhead)
let result1 = try await tts.generate(text: "First sentence")
let result2 = try await tts.generate(text: "Second sentence")  // Much faster!

Next Steps

API Reference

Explore detailed API documentation

Advanced Features

Learn about streaming, VAD, and custom models

Example Apps

Browse complete example applications

Best Practices

Optimization tips and production guidelines

Build docs developers (and LLMs) love