TTSKit

Overview

TTSKit is the main entry point for text-to-speech synthesis. It orchestrates text chunking, concurrent generation, crossfade, and audio playback. The class follows the WhisperKit pattern, exposing each model component as a protocol-typed public property that can be swapped at runtime.

open class TTSKit: @unchecked Sendable

Initialization

init(_:)

Create a TTSKit instance from a TTSKitConfig.

public init(_ config: TTSKitConfig = TTSKitConfig()) async throws

config

TTSKitConfig

default:"TTSKitConfig()"

Pipeline configuration containing model variant, paths, compute units, component overrides, and lifecycle flags.

Throws: TTSError if the model family is unsupported or component instantiation fails. Example:

let tts = try await TTSKit()

init(model:modelFolder:…)

Convenience initializer that exposes all configuration fields as individual parameters.

public convenience init(
    model: TTSModelVariant = .qwen3TTS_0_6b,
    modelFolder: URL? = nil,
    downloadBase: URL? = nil,
    modelRepo: String = Qwen3TTSConstants.defaultModelRepo,
    tokenizerFolder: URL? = nil,
    modelToken: String? = nil,
    computeOptions: ComputeOptions? = nil,
    textProjector: (any TextProjecting)? = nil,
    codeEmbedder: (any CodeEmbedding)? = nil,
    multiCodeEmbedder: (any MultiCodeEmbedding)? = nil,
    codeDecoder: (any CodeDecoding)? = nil,
    multiCodeDecoder: (any MultiCodeDecoding)? = nil,
    speechDecoder: (any SpeechDecoding)? = nil,
    verbose: Bool = false,
    logLevel: Logging.LogLevel = .debug,
    prewarm: Bool? = nil,
    load: Bool? = nil,
    download: Bool = true,
    useBackgroundDownloadSession: Bool = false,
    seed: UInt64? = nil
) async throws

model

TTSModelVariant

default:".qwen3TTS_0_6b"

Model variant to use.

modelFolder

URL?

default:"nil"

Explicit local folder URL. When provided, download is skipped.

downloadBase

URL?

default:"nil"

Base URL for Hub cache.

modelRepo

String

default:"Qwen3TTSConstants.defaultModelRepo"

HuggingFace repo ID.

tokenizerFolder

URL?

default:"nil"

Local tokenizer folder path.

modelToken

String?

default:"nil"

HuggingFace API token.

computeOptions

ComputeOptions?

default:"nil"

Per-component CoreML compute unit configuration.

textProjector

(any TextProjecting)?

default:"nil"

Custom text projector implementation.

codeEmbedder

(any CodeEmbedding)?

default:"nil"

Custom code embedder implementation.

multiCodeEmbedder

(any MultiCodeEmbedding)?

default:"nil"

Custom multi-code embedder implementation.

codeDecoder

(any CodeDecoding)?

default:"nil"

Custom code decoder implementation.

multiCodeDecoder

(any MultiCodeDecoding)?

default:"nil"

Custom multi-code decoder implementation.

speechDecoder

(any SpeechDecoding)?

default:"nil"

Custom speech decoder implementation.

verbose

Bool

default:"false"

Enable diagnostic logging.

logLevel

Logging.LogLevel

default:".debug"

Logging level when verbose is true.

prewarm

Bool?

default:"nil"

Enable model prewarming to serialize compilation.

load

Bool?

default:"nil"

Load models immediately after init. nil loads when modelFolder is non-nil.

download

Bool

default:"true"

Download models if not already available locally.

useBackgroundDownloadSession

Bool

default:"false"

Use a background URLSession for model downloads.

seed

UInt64?

default:"nil"

Optional seed for reproducible generation.

Properties

Model Components

textProjector

any TextProjecting

Text token to embedding converter. Swappable at runtime.

codeEmbedder

any CodeEmbedding

Codec-0 token to embedding converter.

multiCodeEmbedder

any MultiCodeEmbedding

Multi-code token to embedding converter.

codeDecoder

any CodeDecoding

Autoregressive code-0 decoder.

multiCodeDecoder

any MultiCodeDecoding

Per-frame decoder.

speechDecoder

any SpeechDecoding

RVQ codes to audio waveform converter.

tokenizer

(any Tokenizer)?

Tokenizer instance. nil before the first loadModels() call or after unloadModels().

State

modelState

ModelState

Current lifecycle state of the loaded models. Read-only.Transitions: .unloaded → .downloading → .downloaded → .loading → .loadedOr: .unloaded → .prewarming → .prewarmed

config

TTSKitConfig

Pipeline configuration.

modelFolder

URL?

Direct accessor for the resolved local model folder. Backed by config.modelFolder.

useBackgroundDownloadSession

Bool

Whether to use a background URLSession for model downloads. Backed by config.useBackgroundDownloadSession.

currentTimings

SpeechTimings

Cumulative timings for the most recent pipeline run. Read-only.

modelLoadTime

TimeInterval

Wall-clock seconds for the most recent full model load. Read-only.

tokenizerLoadTime

TimeInterval

Wall-clock seconds for the most recent tokenizer load. Read-only.

audioOutput

AudioOutput

Audio output instance used by play(). Read-only.

promptCache

TTSPromptCache?

Cached prefix state for the most recently used voice/language/instruction. Automatically built on the first generate call and reused for subsequent calls with the same parameters. Set to nil to force a full prefill.

modelStateCallback

ModelStateCallback?

Invoked whenever modelState changes.

seed

UInt64?

Seed for reproducible generation. Read-only.

Static Methods

recommendedModels()

Returns the recommended model variant for the current platform.

public static func recommendedModels() -> TTSModelVariant

returns

TTSModelVariant

The best default variant for the current platform.

fetchAvailableModels(from:matching:downloadBase:token:endpoint:)

Fetch all available model variants from the HuggingFace Hub.

public static func fetchAvailableModels(
    from repo: String = Qwen3TTSConstants.defaultModelRepo,
    matching: [String] = ["*"],
    downloadBase: URL? = nil,
    token: String? = nil,
    endpoint: String = Qwen3TTSConstants.defaultEndpoint
) async throws -> [String]

repo

String

default:"Qwen3TTSConstants.defaultModelRepo"

HuggingFace repo ID to query.

matching

[String]

default:"[\"*\"]"

Glob patterns to filter returned variant names.

downloadBase

URL?

default:"nil"

Optional base URL for Hub downloads.

token

String?

default:"nil"

HuggingFace API token.

endpoint

String

default:"Qwen3TTSConstants.defaultEndpoint"

HuggingFace Hub endpoint URL.

returns

[String]

Display names of available model variants matching the given patterns.

Throws: TTSError if the Hub request fails.

download(variant:downloadBase:useBackgroundSession:from:token:endpoint:revision:additionalPatterns:progressCallback:)

Download models for a specific variant from HuggingFace Hub.

open class func download(
    variant: TTSModelVariant = .defaultForCurrentPlatform,
    downloadBase: URL? = nil,
    useBackgroundSession: Bool = false,
    from repo: String = Qwen3TTSConstants.defaultModelRepo,
    token: String? = nil,
    endpoint: String = Qwen3TTSConstants.defaultEndpoint,
    revision: String? = nil,
    additionalPatterns: [String] = [],
    progressCallback: (@Sendable (Progress) -> Void)? = nil
) async throws -> URL

variant

TTSModelVariant

default:".defaultForCurrentPlatform"

The model variant to download.

downloadBase

URL?

default:"nil"

Base URL for the local cache.

useBackgroundSession

Bool

default:"false"

Use a background URLSession for the download.

repo

String

default:"Qwen3TTSConstants.defaultModelRepo"

HuggingFace repo ID.

token

String?

default:"nil"

HuggingFace API token.

endpoint

String

default:"Qwen3TTSConstants.defaultEndpoint"

HuggingFace Hub endpoint URL.

revision

String?

default:"nil"

Specific git revision (commit SHA, tag, or branch) to download.

additionalPatterns

[String]

default:"[]"

Extra glob patterns to include alongside the default component patterns.

progressCallback

(@Sendable (Progress) -> Void)?

default:"nil"

Optional closure receiving download progress updates.

returns

URL

Local URL of the downloaded model folder.

Throws: TTSError if the Hub download fails.

download(config:progressCallback:)

Download models using a full TTSKitConfig.

open class func download(
    config: TTSKitConfig = TTSKitConfig(),
    progressCallback: (@Sendable (Progress) -> Void)? = nil
) async throws -> URL

config

TTSKitConfig

default:"TTSKitConfig()"

Pipeline configuration containing modelRepo, modelToken, downloadRevision, downloadAdditionalPatterns, and variant settings.

progressCallback

(@Sendable (Progress) -> Void)?

default:"nil"

Optional closure receiving download progress updates.

returns

URL

Local URL of the downloaded model folder.

Throws: TTSError if the Hub download fails.

Instance Methods

Model Lifecycle

setupModels(model:downloadBase:modelRepo:modelToken:modelFolder:download:endpoint:)

Resolve the local model folder, downloading from HuggingFace Hub if needed.

open func setupModels(
    model: TTSModelVariant? = nil,
    downloadBase: URL? = nil,
    modelRepo: String? = nil,
    modelToken: String? = nil,
    modelFolder: URL? = nil,
    download: Bool,
    endpoint: String = Qwen3TTSConstants.defaultEndpoint
) async throws

model

TTSModelVariant?

default:"nil"

Model variant to download. nil uses config.model.

downloadBase

URL?

default:"nil"

Base URL for Hub cache. nil uses the Hub library default.

modelRepo

String?

default:"nil"

HuggingFace repo ID. nil uses config.modelRepo.

modelToken

String?

default:"nil"

HuggingFace API token. nil uses config.modelToken.

modelFolder

URL?

default:"nil"

Explicit local folder URL. When non-nil the download is skipped.

download

Bool

required

When true and modelFolder is nil, download from the resolved repo.

endpoint

String

default:"Qwen3TTSConstants.defaultEndpoint"

HuggingFace Hub endpoint URL.

Throws: TTSError if the download fails or the model folder cannot be resolved.

prewarmModels()

Prewarm all CoreML models by compiling them sequentially, then discarding weights.

open func prewarmModels() async throws

Serializes CoreML compilation to cap peak memory. Call before loadModels() on first launch or after a model update. Throws: TTSError if model compilation fails.

loadModels(prewarmMode:)

Load all models and the tokenizer.

open func loadModels(prewarmMode: Bool = false) async throws

prewarmMode

Bool

default:"false"

When true, compile models one at a time and discard weights to limit peak memory (prewarm). When false (default), load all concurrently.

Expects config.modelFolder to be set (call setupModels first if needed). Throws: TTSError if model compilation or tokenizer loading fails.

loadTokenizerIfNeeded()

Load the tokenizer only if it has not been loaded yet.

open func loadTokenizerIfNeeded() async throws

Skips loading when tokenizer is already set. Throws: TTSError if tokenizer loading fails.

loadTokenizer()

Load the tokenizer from config.tokenizerSource.

open func loadTokenizer() async throws -> any Tokenizer

Checks for a local tokenizer.json file first; falls back to downloading from the Hugging Face Hub if no local file is found.

returns

any Tokenizer

The loaded tokenizer instance.

Throws: TTSError if tokenizer loading fails.

unloadModels()

Release all model weights and the tokenizer from memory.

open func unloadModels() async

Transitions through .unloading before reaching .unloaded.

clearState()

Reset all accumulated timing statistics.

open func clearState()

Call between generation runs when you want fresh per-run timing data.

Pipeline Setup

setupPipeline(for:config:)

Configure the model-specific component properties for the active model family.

open func setupPipeline(for variant: TTSModelVariant, config: TTSKitConfig)

variant

TTSModelVariant

required

Model variant to configure.

config

TTSKitConfig

required

Configuration containing component overrides.

Uses the component overrides in config if set; otherwise instantiates the default components for the given variant’s model family.

setupGenerateTask(currentTimings:progress:tokenizer:sampler:)

Setup the generate task used for speech synthesis.

open func setupGenerateTask(
    currentTimings: SpeechTimings,
    progress: Progress,
    tokenizer: any Tokenizer,
    sampler: any TokenSampling
) throws -> any SpeechGenerating

currentTimings

SpeechTimings

required

Timing accumulator for the current run.

progress

Progress

required

Progress tracking instance.

tokenizer

any Tokenizer

required

Tokenizer instance.

sampler

any TokenSampling

required

Token sampling strategy.

returns

any SpeechGenerating

A configured generation task.

Subclasses may override to provide custom behavior. Throws: TTSError if task setup fails.

createTask(progress:)

Create a fresh generation task with the guard/seed/counter boilerplate.

open func createTask(progress: Progress? = nil) throws -> any SpeechGenerating

progress

Progress?

default:"nil"

Optional progress tracking instance.

returns

any SpeechGenerating

An independent task with its own sampler seed and per-task buffers.

Throws: TTSError if the tokenizer is not loaded.

Speech Generation

generate(text:voice:language:options:callback:)

Synthesize speech from text and return the complete audio result.

open func generate(
    text: String,
    voice: String? = nil,
    language: String? = nil,
    options: GenerationOptions = GenerationOptions(),
    callback: SpeechCallback = nil
) async throws -> SpeechResult

text

String

required

The text to synthesize.

voice

String?

default:"nil"

Voice/speaker identifier. Format is model-specific (e.g., "ryan" for Qwen3 TTS).

language

String?

default:"nil"

Language identifier. Format is model-specific (e.g., "english" for Qwen3 TTS).

options

GenerationOptions

default:"GenerationOptions()"

Sampling and generation options.

callback

SpeechCallback

default:"nil"

Optional per-step callback receiving decoded audio chunks. Return false to cancel; nil or true to continue.

returns

SpeechResult

A SpeechResult containing the raw audio samples and timing breakdown.

Handles text chunking, optional prompt caching, and concurrent multi-chunk generation. Throws: TTSError if text is empty, models are not loaded, or generation fails.

generate(text:speaker:language:options:callback:)

Generate speech from text using typed Qwen3 speaker and language enums.

open func generate(
    text: String,
    speaker: Qwen3Speaker,
    language: Qwen3Language = .english,
    options: GenerationOptions = GenerationOptions(),
    callback: SpeechCallback = nil
) async throws -> SpeechResult

text

String

required

Input text to synthesise.

speaker

Qwen3Speaker

required

The Qwen3Speaker voice to use.

language

Qwen3Language

default:".english"

The Qwen3Language to synthesise in.

options

GenerationOptions

default:"GenerationOptions()"

Generation options controlling sampling, chunking, and concurrency.

callback

SpeechCallback

default:"nil"

Per-step callback receiving decoded audio chunks. Return false to cancel.

returns

SpeechResult

The assembled SpeechResult.

Throws: TTSError on generation failure or task cancellation.

play(text:voice:language:options:playbackStrategy:callback:)

Generate speech and stream it through the audio output in real time.

open func play(
    text: String,
    voice: String? = nil,
    language: String? = nil,
    options: GenerationOptions = GenerationOptions(),
    playbackStrategy: PlaybackStrategy = .auto,
    callback: SpeechCallback = nil
) async throws -> SpeechResult

text

String

required

The text to synthesize.

voice

String?

default:"nil"

Voice/speaker identifier.

language

String?

default:"nil"

Language identifier.

options

GenerationOptions

default:"GenerationOptions()"

Sampling and generation options.

playbackStrategy

PlaybackStrategy

default:".auto"

Controls how audio is buffered before playback begins.

callback

SpeechCallback

default:"nil"

Optional per-step callback.

returns

SpeechResult

A SpeechResult with the complete audio and timing breakdown.

For streaming strategies (auto, stream, buffered) chunking is forced to sequential (concurrentWorkerCount = 1) so frames can be enqueued in order. Throws: TTSError on generation failure or task cancellation.

play(text:speaker:language:options:playbackStrategy:callback:)

Generate speech and stream playback using typed Qwen3 speaker and language enums.

open func play(
    text: String,
    speaker: Qwen3Speaker,
    language: Qwen3Language = .english,
    options: GenerationOptions = GenerationOptions(),
    playbackStrategy: PlaybackStrategy = .auto,
    callback: SpeechCallback = nil
) async throws -> SpeechResult

text

String

required

Input text to synthesise.

speaker

Qwen3Speaker

required

The Qwen3Speaker voice to use.

language

Qwen3Language

default:".english"

The Qwen3Language to synthesise in.

options

GenerationOptions

default:"GenerationOptions()"

Generation options controlling sampling, chunking, and concurrency.

playbackStrategy

PlaybackStrategy

default:".auto"

Controls how much audio is buffered before playback begins.

callback

SpeechCallback

default:"nil"

Per-step callback receiving decoded audio chunks. Return false to cancel.

returns

SpeechResult

The assembled SpeechResult.

Throws: TTSError on generation failure or task cancellation.

Prompt Cache Management

buildPromptCache(voice:language:instruction:)

Build a prompt cache for the given voice/language/instruction combination.

open func buildPromptCache(
    voice: String? = nil,
    language: String? = nil,
    instruction: String? = nil
) async throws -> TTSPromptCache

voice

String?

default:"nil"

Voice/speaker identifier. nil uses the model’s defaultVoice.

language

String?

default:"nil"

Language identifier. nil uses the model’s defaultLanguage.

instruction

String?

default:"nil"

Optional style instruction prepended to the TTS prompt.

returns

TTSPromptCache

The built TTSPromptCache that can be passed to subsequent generate calls.

Pre-computes the invariant prefix embeddings and prefills them through the CodeDecoder, returning a reusable cache that eliminates ~90% of prefill cost on subsequent generate calls. Throws: TTSError if the model is not loaded or prompt caching is unsupported.

buildPromptCache(speaker:language:instruction:)

Build a prompt cache using typed Qwen3 speaker and language enums.

open func buildPromptCache(
    speaker: Qwen3Speaker,
    language: Qwen3Language,
    instruction: String? = nil
) async throws -> TTSPromptCache

speaker

Qwen3Speaker

required

The Qwen3Speaker to pre-warm the cache for.

language

Qwen3Language

required

The Qwen3Language to pre-warm the cache for.

instruction

String?

default:"nil"

Optional style instruction (1.7B only).

returns

TTSPromptCache

A TTSPromptCache for the given parameters.

Throws: TTSError on generation failure.

savePromptCache()

Save the current prompt cache to disk under the model’s embeddings directory.

public func savePromptCache() throws

The file is saved at <modelFolder>/embeddings/<voice>_<language>.promptcache. Throws: TTSError if saving fails or modelFolder is not set.

loadPromptCache(voice:language:instruction:)

Load a prompt cache from disk if one exists for the given parameters.

public func loadPromptCache(
    voice: String,
    language: String,
    instruction: String? = nil
) -> TTSPromptCache?

voice

String

required

Voice/speaker identifier.

language

String

required

Language identifier.

instruction

String?

default:"nil"

Optional style instruction.

returns

TTSPromptCache?

The loaded cache, or nil if not found.

Returns nil if no cached file exists. Also stores the loaded cache on self.promptCache for automatic reuse.

Logging

loggingCallback(_:)

open func loggingCallback(_ callback: Logging.LoggingCallback?)

callback

Logging.LoggingCallback?

required

Custom logging callback. Pass nil to restore the default print-based logger.

SpeechModel Conformance

sampleRate

Int

The output sample rate of the currently loaded speech decoder.

Example Usage

Basic Generation

let tts = try await TTSKit()
let result = try await tts.generate(
    text: "Hello, world!",
    voice: "ryan",
    language: "english"
)
print("Generated \(result.audio.count) samples")

With Custom Configuration

var config = TTSKitConfig(
    model: .qwen3TTS_0_6b,
    verbose: true,
    seed: 42
)
let tts = try await TTSKit(config)

Real-time Playback

let result = try await tts.play(
    text: "This is streaming audio.",
    speaker: .ryan,
    playbackStrategy: .auto
)

Component Swapping

let config = TTSKitConfig(load: false)
let tts = try await TTSKit(config)
tts.codeDecoder = MyOptimizedCodeDecoder()
try await tts.loadModels()

WhisperKit

Core Types

Documentation Index

​Overview

​Initialization

​init(_:)

​init(model:modelFolder:…)

​Properties

​Model Components

​State

​Static Methods

​recommendedModels()

​fetchAvailableModels(from:matching:downloadBase:token:endpoint:)

​download(variant:downloadBase:useBackgroundSession:from:token:endpoint:revision:additionalPatterns:progressCallback:)

​download(config:progressCallback:)

​Instance Methods

​Model Lifecycle

​setupModels(model:downloadBase:modelRepo:modelToken:modelFolder:download:endpoint:)

​prewarmModels()

​loadModels(prewarmMode:)

​loadTokenizerIfNeeded()

​loadTokenizer()

​unloadModels()

​clearState()

​Pipeline Setup

​setupPipeline(for:config:)

​setupGenerateTask(currentTimings:progress:tokenizer:sampler:)

​createTask(progress:)

​Speech Generation

​generate(text:voice:language:options:callback:)

​generate(text:speaker:language:options:callback:)

​play(text:voice:language:options:playbackStrategy:callback:)

​play(text:speaker:language:options:playbackStrategy:callback:)

​Prompt Cache Management

​buildPromptCache(voice:language:instruction:)

​buildPromptCache(speaker:language:instruction:)

​savePromptCache()

​loadPromptCache(voice:language:instruction:)

​Logging

​loggingCallback(_:)

​SpeechModel Conformance

​Example Usage

​Basic Generation

​With Custom Configuration

​Real-time Playback

​Component Swapping

Build docs developers (and LLMs) love

Overview

Initialization

init(_:)

init(model:modelFolder:…)

Properties

Model Components

State

Static Methods

recommendedModels()

fetchAvailableModels(from:matching:downloadBase:token:endpoint:)

download(variant:downloadBase:useBackgroundSession:from:token:endpoint:revision:additionalPatterns:progressCallback:)

download(config:progressCallback:)

Instance Methods

Model Lifecycle

setupModels(model:downloadBase:modelRepo:modelToken:modelFolder:download:endpoint:)

prewarmModels()

loadModels(prewarmMode:)

loadTokenizerIfNeeded()

loadTokenizer()

unloadModels()

clearState()

Pipeline Setup

setupPipeline(for:config:)

setupGenerateTask(currentTimings:progress:tokenizer:sampler:)

createTask(progress:)

Speech Generation

generate(text:voice:language:options:callback:)

generate(text:speaker:language:options:callback:)

play(text:voice:language:options:playbackStrategy:callback:)

play(text:speaker:language:options:playbackStrategy:callback:)

Prompt Cache Management

buildPromptCache(voice:language:instruction:)

buildPromptCache(speaker:language:instruction:)

savePromptCache()

loadPromptCache(voice:language:instruction:)

Logging

loggingCallback(_:)

SpeechModel Conformance

Example Usage

Basic Generation

With Custom Configuration

Real-time Playback

Component Swapping