Overview

TTSKit is an on-device text-to-speech framework built on Core ML. It runs Qwen3 TTS models entirely on Apple silicon with real-time streaming playback, no server required.

Quick Start

import TTSKit

Task {
    let tts = try await TTSKit()
    let result = try await tts.generate(text: "Hello from TTSKit!")
    print("Generated \(result.audioDuration)s of audio at \(result.sampleRate)Hz")
}

TTSKit() automatically downloads the default 0.6B model on first run, loads the tokenizer and six CoreML models concurrently, and is ready to generate.

Requirements

macOS 15.0 or later
iOS 18.0 or later
Xcode 16.0 or later

Features

Real-Time Streaming

Generate and play audio frame-by-frame with adaptive buffering

Multiple Voices

9 built-in voices across 10 languages

Concurrent Generation

Automatic text chunking with parallel generation

Style Control

Natural-language prosody instructions (1.7B model)

Model Variants

TTSKit ships two model sizes:

Model	Size	Platforms	Features
0.6B	~1 GB	macOS, iOS	Fast, runs on all devices
1.7B	~2.2 GB	macOS only	Higher quality, style instructions

// Fast, runs on all platforms
let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_0_6b))

// Higher quality, macOS only
let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))

Models are hosted on HuggingFace and cached locally after the first download.

Architecture

TTSKit follows the same component-based architecture as WhisperKit. The pipeline consists of six model components:

public class TTSKit {
    // Model components (protocol-typed, swappable)
    public var textProjector: any TextProjecting
    public var codeEmbedder: any CodeEmbedding
    public var multiCodeEmbedder: any MultiCodeEmbedding
    public var codeDecoder: any CodeDecoding
    public var multiCodeDecoder: any MultiCodeDecoding
    public var speechDecoder: any SpeechDecoding
    public var tokenizer: (any Tokenizer)?
}

Each component can be swapped at runtime:

let config = TTSKitConfig(load: false)
let tts = try await TTSKit(config)
tts.codeDecoder = MyOptimizedCodeDecoder()
try await tts.loadModels()

Model Lifecycle

TTSKit provides fine-grained control over model loading:

// Auto-load on init (default)
let tts = try await TTSKit()

// Manual control
let config = TTSKitConfig(load: false)
let tts = try await TTSKit(config)

// Prewarm: compile models sequentially to cap peak memory
try await tts.prewarmModels()

// Load: load all models concurrently
try await tts.loadModels()

// Unload to free memory
await tts.unloadModels()

The modelState property tracks the current lifecycle state:

public enum ModelState {
    case unloaded
    case downloading
    case downloaded
    case loading
    case loaded
    case prewarming
    case prewarmed
    case unloading
}

Next Steps

Generate Speech

Learn about generation options and chunking

Playback

Stream audio with real-time playback strategies

Voices & Languages

Explore available voices and language support

Configuration

Configure compute units and model variants

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Quick Start

Requirements

Features

Real-Time Streaming

Multiple Voices

Concurrent Generation

Style Control

Model Variants

Architecture

Model Lifecycle

Next Steps

Generate Speech

Playback

Voices & Languages

Configuration

Build docs developers (and LLMs) love

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Quick Start

​Requirements

​Features

Real-Time Streaming

Multiple Voices

Concurrent Generation

Style Control

​Model Variants

​Architecture

​Model Lifecycle

​Next Steps

Generate Speech

Playback

Voices & Languages

Configuration

Build docs developers (and LLMs) love

Quick Start

Requirements

Features

Model Variants

Architecture

Model Lifecycle

Next Steps