Skip to main content

Overview

TTSKit provides on-device text-to-speech using Qwen3 TTS models running entirely on Apple silicon with real-time streaming playback.
TTSKit requires macOS 15.0+ or iOS 18.0+

Quick Start

Basic Speech Generation

import TTSKit

Task {
    let tts = try await TTSKit()
    let result = try await tts.generate(text: "Hello from TTSKit!")
    print("Generated \(result.audioDuration)s of audio at \(result.sampleRate)Hz")
}
TTSKit() automatically downloads the default 0.6B model on first run and loads all necessary components.

Real-Time Streaming Playback

Play audio as it’s being generated:
try await tts.play(text: "This starts playing before generation finishes.")
The audio begins playing immediately, streaming frame-by-frame as it’s generated.

Model Selection

TTSKit offers two model sizes:
// Fast, works on all platforms (~1 GB)
let tts = try await TTSKit(
    TTSKitConfig(model: .qwen3TTS_0_6b)
)

// Higher quality, macOS only (~2.2 GB, supports style instructions)
let tts = try await TTSKit(
    TTSKitConfig(model: .qwen3TTS_1_7b)
)
Models are hosted on HuggingFace and cached locally after first download.

Voice Selection

Choose from 9 built-in voices:
let result = try await tts.generate(
    text: "Hello world",
    speaker: .ryan,    // or .aiden, .onoAnna, .sohee, .eric, .dylan, .serena, .vivian, .uncleFu
    language: .english
)

Available Voices

  • ryan - Clear, neutral male voice
  • aiden - Warm male voice
  • onoAnna - Professional female voice
  • sohee - Friendly female voice
  • eric - Authoritative male voice
  • dylan - Casual male voice
  • serena - Calm female voice
  • vivian - Energetic female voice
  • uncleFu - Character voice

Multi-Language Support

TTSKit supports 10 languages:
// Japanese
let result = try await tts.generate(
    text: "こんにちは世界",
    speaker: .onoAnna,
    language: .japanese
)

// Spanish
let result = try await tts.generate(
    text: "Hola mundo",
    speaker: .serena,
    language: .spanish
)

Supported Languages

.english, .chinese, .japanese, .korean, .german, .french, .russian, .portuguese, .spanish, .italian

Generation Options

Customize the generation behavior:
var options = GenerationOptions()
options.temperature = 0.9
options.topK = 50
options.repetitionPenalty = 1.05
options.maxNewTokens = 245

// Auto-split long text at sentence boundaries
options.chunkingStrategy = .sentence

// Run all chunks concurrently (nil = auto)
options.concurrentWorkerCount = nil

let result = try await tts.generate(
    text: longArticle,
    speaker: .ryan,
    language: .english,
    options: options
)

Style Instructions (1.7B Model Only)

The 1.7B model accepts natural-language style instructions:
var options = GenerationOptions()
options.instruction = "Speak slowly and warmly, like a storyteller."

let result = try await tts.generate(
    text: "Once upon a time...",
    speaker: .ryan,
    options: options
)
Style instructions only work with the 1.7B model. They are ignored by the 0.6B model.

Playback Strategies

Control how audio is buffered and played:
// Auto: Measures first step, buffers just enough to avoid underruns
try await tts.play(
    text: "Long passage...",
    playbackStrategy: .auto
)

// Stream: Immediate playback, no pre-buffer
try await tts.play(
    text: "Real-time speech",
    playbackStrategy: .stream
)

// Buffered: Fixed pre-buffer duration
try await tts.play(
    text: "Smooth playback",
    playbackStrategy: .buffered(seconds: 2.0)
)

// GenerateFirst: Generate all audio first, then play
try await tts.play(
    text: "Complete before play",
    playbackStrategy: .generateFirst
)

Saving Audio Files

Export generated audio to disk:
let result = try await tts.generate(text: "Save me!")
let outputDir = FileManager.default.urls(
    for: .documentDirectory,
    in: .userDomainMask
)[0]

// Save as WAV
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output",
    format: .wav
)

// Save as M4A (AAC)
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output",
    format: .m4a
)

Progress Callbacks

Receive per-step audio during generation:
let result = try await tts.generate(
    text: "Hello!"
) { progress in
    print("Audio chunk: \(progress.audio.count) samples")
    
    if let stepTime = progress.stepTime {
        print("First step took \(stepTime)s")
    }
    
    // Return false to cancel early
    return true
}

Complete SwiftUI Example

import SwiftUI
import TTSKit

struct TTSView: View {
    @StateObject private var viewModel = TTSViewModel()
    @State private var inputText = "Hello from TTSKit!"
    @State private var selectedSpeaker: Qwen3Speaker = .ryan
    @State private var selectedLanguage: Qwen3Language = .english
    
    var body: some View {
        VStack(spacing: 20) {
            // Input
            TextEditor(text: $inputText)
                .frame(height: 100)
                .border(Color.gray, width: 1)
            
            // Voice selection
            HStack {
                Text("Speaker:")
                Picker("Speaker", selection: $selectedSpeaker) {
                    Text("Ryan").tag(Qwen3Speaker.ryan)
                    Text("Aiden").tag(Qwen3Speaker.aiden)
                    Text("Ono Anna").tag(Qwen3Speaker.onoAnna)
                    Text("Sohee").tag(Qwen3Speaker.sohee)
                }
            }
            
            // Language selection
            HStack {
                Text("Language:")
                Picker("Language", selection: $selectedLanguage) {
                    Text("English").tag(Qwen3Language.english)
                    Text("Japanese").tag(Qwen3Language.japanese)
                    Text("Spanish").tag(Qwen3Language.spanish)
                }
            }
            
            // Controls
            HStack {
                Button("Generate") {
                    Task {
                        await viewModel.generate(
                            text: inputText,
                            speaker: selectedSpeaker,
                            language: selectedLanguage
                        )
                    }
                }
                .disabled(viewModel.isGenerating)
                
                Button("Play") {
                    Task {
                        await viewModel.playGenerated()
                    }
                }
                .disabled(viewModel.audioSamples.isEmpty || viewModel.isPlaying)
            }
            
            // Waveform visualization
            if !viewModel.waveform.isEmpty {
                WaveformView(samples: viewModel.waveform)
                    .frame(height: 100)
            }
            
            // Status
            Text(viewModel.statusMessage)
                .foregroundColor(.secondary)
            
            if viewModel.isGenerating {
                ProgressView()
            }
        }
        .padding()
        .task {
            await viewModel.loadModel()
        }
    }
}

@MainActor
class TTSViewModel: ObservableObject {
    @Published var statusMessage = "Loading model..."
    @Published var isGenerating = false
    @Published var isPlaying = false
    @Published var audioSamples: [Float] = []
    @Published var waveform: [Float] = []
    @Published var audioDuration: TimeInterval = 0
    
    private var ttsKit: TTSKit?
    private var audioPlayer: AVAudioPlayer?
    
    func loadModel() async {
        do {
            ttsKit = try await TTSKit(TTSKitConfig(model: .qwen3TTS_0_6b))
            statusMessage = "Ready"
        } catch {
            statusMessage = "Failed to load model: \(error)"
        }
    }
    
    func generate(text: String, speaker: Qwen3Speaker, language: Qwen3Language) async {
        guard let ttsKit = ttsKit else { return }
        
        isGenerating = true
        statusMessage = "Generating..."
        waveform = []
        
        do {
            let result = try await ttsKit.generate(
                text: text,
                speaker: speaker,
                language: language
            ) { progress in
                // Collect waveform peaks
                let peak = progress.audio.reduce(Float(0)) { max($0, abs($1)) }
                Task { @MainActor in
                    self.waveform.append(peak)
                }
                return true
            }
            
            audioSamples = result.audio
            audioDuration = result.audioDuration
            statusMessage = String(format: "Generated %.1fs (RTF: %.2f)",
                                 result.audioDuration,
                                 result.timings.realTimeFactor)
        } catch {
            statusMessage = "Error: \(error)"
        }
        
        isGenerating = false
    }
    
    func playGenerated() async {
        guard let ttsKit = ttsKit, !audioSamples.isEmpty else { return }
        
        do {
            // Save to temporary file
            let tempURL = FileManager.default.temporaryDirectory
                .appendingPathComponent("temp_audio.wav")
            
            try await AudioOutput.saveAudio(
                audioSamples,
                toFolder: tempURL.deletingLastPathComponent(),
                filename: "temp_audio",
                format: .wav
            )
            
            // Play with AVAudioPlayer
            audioPlayer = try AVAudioPlayer(contentsOf: tempURL)
            audioPlayer?.play()
            isPlaying = true
            statusMessage = "Playing..."
            
            // Wait for playback to finish
            try await Task.sleep(for: .seconds(audioDuration))
            
            isPlaying = false
            statusMessage = "Playback complete"
            
        } catch {
            statusMessage = "Playback error: \(error)"
            isPlaying = false
        }
    }
}

struct WaveformView: View {
    let samples: [Float]
    
    var body: some View {
        GeometryReader { geometry in
            HStack(spacing: 1) {
                ForEach(samples.indices, id: \.self) { index in
                    RoundedRectangle(cornerRadius: 2)
                        .fill(Color.blue)
                        .frame(
                            width: 2,
                            height: CGFloat(samples[index]) * geometry.size.height
                        )
                }
            }
        }
    }
}

Command Line Usage

TTSKit is available through the whisperkit-cli tool:
# Generate and play speech
swift run whisperkit-cli tts \
    --text "Hello from the command line" \
    --play

# Save to file
swift run whisperkit-cli tts \
    --text "Save to file" \
    --output-path output.wav

# Japanese with specific speaker
swift run whisperkit-cli tts \
    --text "日本語テスト" \
    --speaker ono-anna \
    --language japanese

# Use 1.7B model with style instruction
swift run whisperkit-cli tts \
    --text-file article.txt \
    --model 1.7b \
    --instruction "Read cheerfully"

# See all options
swift run whisperkit-cli tts --help

Performance Optimization

Compute Units

Optimize for your device:
let config = TTSKitConfig(
    model: .qwen3TTS_0_6b,
    computeOptions: ComputeOptions(
        embedderComputeUnits: .cpuOnly,
        codeDecoderComputeUnits: .cpuAndNeuralEngine,
        multiCodeDecoderComputeUnits: .cpuAndNeuralEngine,
        speechDecoderComputeUnits: .cpuAndNeuralEngine
    )
)

let tts = try await TTSKit(config)

Concurrent Workers

Adjust concurrency for long text:
var options = GenerationOptions()

// Use 4 concurrent workers
options.concurrentWorkerCount = 4

// Or let TTSKit decide (recommended)
options.concurrentWorkerCount = nil

Demo App

The TTSKitExample app showcases:
  • Real-time streaming playback
  • Model management UI
  • Waveform visualization
  • Generation history
  • macOS and iOS support
Build and run to explore all features!

Next Steps

Basic Transcription

Learn speech-to-text with WhisperKit

Real-Time Streaming

Transcribe audio in real-time

Build docs developers (and LLMs) love