Skip to main content

Overview

TTSKit is the main entry point for text-to-speech synthesis. It orchestrates text chunking, concurrent generation, crossfade, and audio playback. The class follows the WhisperKit pattern, exposing each model component as a protocol-typed public property that can be swapped at runtime.
open class TTSKit: @unchecked Sendable

Initialization

init(_:)

Create a TTSKit instance from a TTSKitConfig.
public init(_ config: TTSKitConfig = TTSKitConfig()) async throws
config
TTSKitConfig
default:"TTSKitConfig()"
Pipeline configuration containing model variant, paths, compute units, component overrides, and lifecycle flags.
Throws: TTSError if the model family is unsupported or component instantiation fails. Example:
let tts = try await TTSKit()

init(model:modelFolder:…)

Convenience initializer that exposes all configuration fields as individual parameters.
public convenience init(
    model: TTSModelVariant = .qwen3TTS_0_6b,
    modelFolder: URL? = nil,
    downloadBase: URL? = nil,
    modelRepo: String = Qwen3TTSConstants.defaultModelRepo,
    tokenizerFolder: URL? = nil,
    modelToken: String? = nil,
    computeOptions: ComputeOptions? = nil,
    textProjector: (any TextProjecting)? = nil,
    codeEmbedder: (any CodeEmbedding)? = nil,
    multiCodeEmbedder: (any MultiCodeEmbedding)? = nil,
    codeDecoder: (any CodeDecoding)? = nil,
    multiCodeDecoder: (any MultiCodeDecoding)? = nil,
    speechDecoder: (any SpeechDecoding)? = nil,
    verbose: Bool = false,
    logLevel: Logging.LogLevel = .debug,
    prewarm: Bool? = nil,
    load: Bool? = nil,
    download: Bool = true,
    useBackgroundDownloadSession: Bool = false,
    seed: UInt64? = nil
) async throws
model
TTSModelVariant
default:".qwen3TTS_0_6b"
Model variant to use.
modelFolder
URL?
default:"nil"
Explicit local folder URL. When provided, download is skipped.
downloadBase
URL?
default:"nil"
Base URL for Hub cache.
modelRepo
String
default:"Qwen3TTSConstants.defaultModelRepo"
HuggingFace repo ID.
tokenizerFolder
URL?
default:"nil"
Local tokenizer folder path.
modelToken
String?
default:"nil"
HuggingFace API token.
computeOptions
ComputeOptions?
default:"nil"
Per-component CoreML compute unit configuration.
textProjector
(any TextProjecting)?
default:"nil"
Custom text projector implementation.
codeEmbedder
(any CodeEmbedding)?
default:"nil"
Custom code embedder implementation.
multiCodeEmbedder
(any MultiCodeEmbedding)?
default:"nil"
Custom multi-code embedder implementation.
codeDecoder
(any CodeDecoding)?
default:"nil"
Custom code decoder implementation.
multiCodeDecoder
(any MultiCodeDecoding)?
default:"nil"
Custom multi-code decoder implementation.
speechDecoder
(any SpeechDecoding)?
default:"nil"
Custom speech decoder implementation.
verbose
Bool
default:"false"
Enable diagnostic logging.
logLevel
Logging.LogLevel
default:".debug"
Logging level when verbose is true.
prewarm
Bool?
default:"nil"
Enable model prewarming to serialize compilation.
load
Bool?
default:"nil"
Load models immediately after init. nil loads when modelFolder is non-nil.
download
Bool
default:"true"
Download models if not already available locally.
useBackgroundDownloadSession
Bool
default:"false"
Use a background URLSession for model downloads.
seed
UInt64?
default:"nil"
Optional seed for reproducible generation.

Properties

Model Components

textProjector
any TextProjecting
Text token to embedding converter. Swappable at runtime.
codeEmbedder
any CodeEmbedding
Codec-0 token to embedding converter.
multiCodeEmbedder
any MultiCodeEmbedding
Multi-code token to embedding converter.
codeDecoder
any CodeDecoding
Autoregressive code-0 decoder.
multiCodeDecoder
any MultiCodeDecoding
Per-frame decoder.
speechDecoder
any SpeechDecoding
RVQ codes to audio waveform converter.
tokenizer
(any Tokenizer)?
Tokenizer instance. nil before the first loadModels() call or after unloadModels().

State

modelState
ModelState
Current lifecycle state of the loaded models. Read-only.Transitions: .unloaded.downloading.downloaded.loading.loadedOr: .unloaded.prewarming.prewarmed
config
TTSKitConfig
Pipeline configuration.
modelFolder
URL?
Direct accessor for the resolved local model folder. Backed by config.modelFolder.
useBackgroundDownloadSession
Bool
Whether to use a background URLSession for model downloads. Backed by config.useBackgroundDownloadSession.
currentTimings
SpeechTimings
Cumulative timings for the most recent pipeline run. Read-only.
modelLoadTime
TimeInterval
Wall-clock seconds for the most recent full model load. Read-only.
tokenizerLoadTime
TimeInterval
Wall-clock seconds for the most recent tokenizer load. Read-only.
audioOutput
AudioOutput
Audio output instance used by play(). Read-only.
promptCache
TTSPromptCache?
Cached prefix state for the most recently used voice/language/instruction. Automatically built on the first generate call and reused for subsequent calls with the same parameters. Set to nil to force a full prefill.
modelStateCallback
ModelStateCallback?
Invoked whenever modelState changes.
seed
UInt64?
Seed for reproducible generation. Read-only.

Static Methods

recommendedModels()

Returns the recommended model variant for the current platform.
public static func recommendedModels() -> TTSModelVariant
returns
TTSModelVariant
The best default variant for the current platform.

fetchAvailableModels(from:matching:downloadBase:token:endpoint:)

Fetch all available model variants from the HuggingFace Hub.
public static func fetchAvailableModels(
    from repo: String = Qwen3TTSConstants.defaultModelRepo,
    matching: [String] = ["*"],
    downloadBase: URL? = nil,
    token: String? = nil,
    endpoint: String = Qwen3TTSConstants.defaultEndpoint
) async throws -> [String]
repo
String
default:"Qwen3TTSConstants.defaultModelRepo"
HuggingFace repo ID to query.
matching
[String]
default:"[\"*\"]"
Glob patterns to filter returned variant names.
downloadBase
URL?
default:"nil"
Optional base URL for Hub downloads.
token
String?
default:"nil"
HuggingFace API token.
endpoint
String
default:"Qwen3TTSConstants.defaultEndpoint"
HuggingFace Hub endpoint URL.
returns
[String]
Display names of available model variants matching the given patterns.
Throws: TTSError if the Hub request fails.

download(variant:downloadBase:useBackgroundSession:from:token:endpoint:revision:additionalPatterns:progressCallback:)

Download models for a specific variant from HuggingFace Hub.
open class func download(
    variant: TTSModelVariant = .defaultForCurrentPlatform,
    downloadBase: URL? = nil,
    useBackgroundSession: Bool = false,
    from repo: String = Qwen3TTSConstants.defaultModelRepo,
    token: String? = nil,
    endpoint: String = Qwen3TTSConstants.defaultEndpoint,
    revision: String? = nil,
    additionalPatterns: [String] = [],
    progressCallback: (@Sendable (Progress) -> Void)? = nil
) async throws -> URL
variant
TTSModelVariant
default:".defaultForCurrentPlatform"
The model variant to download.
downloadBase
URL?
default:"nil"
Base URL for the local cache.
useBackgroundSession
Bool
default:"false"
Use a background URLSession for the download.
repo
String
default:"Qwen3TTSConstants.defaultModelRepo"
HuggingFace repo ID.
token
String?
default:"nil"
HuggingFace API token.
endpoint
String
default:"Qwen3TTSConstants.defaultEndpoint"
HuggingFace Hub endpoint URL.
revision
String?
default:"nil"
Specific git revision (commit SHA, tag, or branch) to download.
additionalPatterns
[String]
default:"[]"
Extra glob patterns to include alongside the default component patterns.
progressCallback
(@Sendable (Progress) -> Void)?
default:"nil"
Optional closure receiving download progress updates.
returns
URL
Local URL of the downloaded model folder.
Throws: TTSError if the Hub download fails.

download(config:progressCallback:)

Download models using a full TTSKitConfig.
open class func download(
    config: TTSKitConfig = TTSKitConfig(),
    progressCallback: (@Sendable (Progress) -> Void)? = nil
) async throws -> URL
config
TTSKitConfig
default:"TTSKitConfig()"
Pipeline configuration containing modelRepo, modelToken, downloadRevision, downloadAdditionalPatterns, and variant settings.
progressCallback
(@Sendable (Progress) -> Void)?
default:"nil"
Optional closure receiving download progress updates.
returns
URL
Local URL of the downloaded model folder.
Throws: TTSError if the Hub download fails.

Instance Methods

Model Lifecycle

setupModels(model:downloadBase:modelRepo:modelToken:modelFolder:download:endpoint:)

Resolve the local model folder, downloading from HuggingFace Hub if needed.
open func setupModels(
    model: TTSModelVariant? = nil,
    downloadBase: URL? = nil,
    modelRepo: String? = nil,
    modelToken: String? = nil,
    modelFolder: URL? = nil,
    download: Bool,
    endpoint: String = Qwen3TTSConstants.defaultEndpoint
) async throws
model
TTSModelVariant?
default:"nil"
Model variant to download. nil uses config.model.
downloadBase
URL?
default:"nil"
Base URL for Hub cache. nil uses the Hub library default.
modelRepo
String?
default:"nil"
HuggingFace repo ID. nil uses config.modelRepo.
modelToken
String?
default:"nil"
HuggingFace API token. nil uses config.modelToken.
modelFolder
URL?
default:"nil"
Explicit local folder URL. When non-nil the download is skipped.
download
Bool
required
When true and modelFolder is nil, download from the resolved repo.
endpoint
String
default:"Qwen3TTSConstants.defaultEndpoint"
HuggingFace Hub endpoint URL.
Throws: TTSError if the download fails or the model folder cannot be resolved.

prewarmModels()

Prewarm all CoreML models by compiling them sequentially, then discarding weights.
open func prewarmModels() async throws
Serializes CoreML compilation to cap peak memory. Call before loadModels() on first launch or after a model update. Throws: TTSError if model compilation fails.

loadModels(prewarmMode:)

Load all models and the tokenizer.
open func loadModels(prewarmMode: Bool = false) async throws
prewarmMode
Bool
default:"false"
When true, compile models one at a time and discard weights to limit peak memory (prewarm). When false (default), load all concurrently.
Expects config.modelFolder to be set (call setupModels first if needed). Throws: TTSError if model compilation or tokenizer loading fails.

loadTokenizerIfNeeded()

Load the tokenizer only if it has not been loaded yet.
open func loadTokenizerIfNeeded() async throws
Skips loading when tokenizer is already set. Throws: TTSError if tokenizer loading fails.

loadTokenizer()

Load the tokenizer from config.tokenizerSource.
open func loadTokenizer() async throws -> any Tokenizer
Checks for a local tokenizer.json file first; falls back to downloading from the Hugging Face Hub if no local file is found.
returns
any Tokenizer
The loaded tokenizer instance.
Throws: TTSError if tokenizer loading fails.

unloadModels()

Release all model weights and the tokenizer from memory.
open func unloadModels() async
Transitions through .unloading before reaching .unloaded.

clearState()

Reset all accumulated timing statistics.
open func clearState()
Call between generation runs when you want fresh per-run timing data.

Pipeline Setup

setupPipeline(for:config:)

Configure the model-specific component properties for the active model family.
open func setupPipeline(for variant: TTSModelVariant, config: TTSKitConfig)
variant
TTSModelVariant
required
Model variant to configure.
config
TTSKitConfig
required
Configuration containing component overrides.
Uses the component overrides in config if set; otherwise instantiates the default components for the given variant’s model family.

setupGenerateTask(currentTimings:progress:tokenizer:sampler:)

Setup the generate task used for speech synthesis.
open func setupGenerateTask(
    currentTimings: SpeechTimings,
    progress: Progress,
    tokenizer: any Tokenizer,
    sampler: any TokenSampling
) throws -> any SpeechGenerating
currentTimings
SpeechTimings
required
Timing accumulator for the current run.
progress
Progress
required
Progress tracking instance.
tokenizer
any Tokenizer
required
Tokenizer instance.
sampler
any TokenSampling
required
Token sampling strategy.
returns
any SpeechGenerating
A configured generation task.
Subclasses may override to provide custom behavior. Throws: TTSError if task setup fails.

createTask(progress:)

Create a fresh generation task with the guard/seed/counter boilerplate.
open func createTask(progress: Progress? = nil) throws -> any SpeechGenerating
progress
Progress?
default:"nil"
Optional progress tracking instance.
returns
any SpeechGenerating
An independent task with its own sampler seed and per-task buffers.
Throws: TTSError if the tokenizer is not loaded.

Speech Generation

generate(text:voice:language:options:callback:)

Synthesize speech from text and return the complete audio result.
open func generate(
    text: String,
    voice: String? = nil,
    language: String? = nil,
    options: GenerationOptions = GenerationOptions(),
    callback: SpeechCallback = nil
) async throws -> SpeechResult
text
String
required
The text to synthesize.
voice
String?
default:"nil"
Voice/speaker identifier. Format is model-specific (e.g., "ryan" for Qwen3 TTS).
language
String?
default:"nil"
Language identifier. Format is model-specific (e.g., "english" for Qwen3 TTS).
options
GenerationOptions
default:"GenerationOptions()"
Sampling and generation options.
callback
SpeechCallback
default:"nil"
Optional per-step callback receiving decoded audio chunks. Return false to cancel; nil or true to continue.
returns
SpeechResult
A SpeechResult containing the raw audio samples and timing breakdown.
Handles text chunking, optional prompt caching, and concurrent multi-chunk generation. Throws: TTSError if text is empty, models are not loaded, or generation fails.

generate(text:speaker:language:options:callback:)

Generate speech from text using typed Qwen3 speaker and language enums.
open func generate(
    text: String,
    speaker: Qwen3Speaker,
    language: Qwen3Language = .english,
    options: GenerationOptions = GenerationOptions(),
    callback: SpeechCallback = nil
) async throws -> SpeechResult
text
String
required
Input text to synthesise.
speaker
Qwen3Speaker
required
The Qwen3Speaker voice to use.
language
Qwen3Language
default:".english"
The Qwen3Language to synthesise in.
options
GenerationOptions
default:"GenerationOptions()"
Generation options controlling sampling, chunking, and concurrency.
callback
SpeechCallback
default:"nil"
Per-step callback receiving decoded audio chunks. Return false to cancel.
returns
SpeechResult
The assembled SpeechResult.
Throws: TTSError on generation failure or task cancellation.

play(text:voice:language:options:playbackStrategy:callback:)

Generate speech and stream it through the audio output in real time.
open func play(
    text: String,
    voice: String? = nil,
    language: String? = nil,
    options: GenerationOptions = GenerationOptions(),
    playbackStrategy: PlaybackStrategy = .auto,
    callback: SpeechCallback = nil
) async throws -> SpeechResult
text
String
required
The text to synthesize.
voice
String?
default:"nil"
Voice/speaker identifier.
language
String?
default:"nil"
Language identifier.
options
GenerationOptions
default:"GenerationOptions()"
Sampling and generation options.
playbackStrategy
PlaybackStrategy
default:".auto"
Controls how audio is buffered before playback begins.
callback
SpeechCallback
default:"nil"
Optional per-step callback.
returns
SpeechResult
A SpeechResult with the complete audio and timing breakdown.
For streaming strategies (auto, stream, buffered) chunking is forced to sequential (concurrentWorkerCount = 1) so frames can be enqueued in order. Throws: TTSError on generation failure or task cancellation.

play(text:speaker:language:options:playbackStrategy:callback:)

Generate speech and stream playback using typed Qwen3 speaker and language enums.
open func play(
    text: String,
    speaker: Qwen3Speaker,
    language: Qwen3Language = .english,
    options: GenerationOptions = GenerationOptions(),
    playbackStrategy: PlaybackStrategy = .auto,
    callback: SpeechCallback = nil
) async throws -> SpeechResult
text
String
required
Input text to synthesise.
speaker
Qwen3Speaker
required
The Qwen3Speaker voice to use.
language
Qwen3Language
default:".english"
The Qwen3Language to synthesise in.
options
GenerationOptions
default:"GenerationOptions()"
Generation options controlling sampling, chunking, and concurrency.
playbackStrategy
PlaybackStrategy
default:".auto"
Controls how much audio is buffered before playback begins.
callback
SpeechCallback
default:"nil"
Per-step callback receiving decoded audio chunks. Return false to cancel.
returns
SpeechResult
The assembled SpeechResult.
Throws: TTSError on generation failure or task cancellation.

Prompt Cache Management

buildPromptCache(voice:language:instruction:)

Build a prompt cache for the given voice/language/instruction combination.
open func buildPromptCache(
    voice: String? = nil,
    language: String? = nil,
    instruction: String? = nil
) async throws -> TTSPromptCache
voice
String?
default:"nil"
Voice/speaker identifier. nil uses the model’s defaultVoice.
language
String?
default:"nil"
Language identifier. nil uses the model’s defaultLanguage.
instruction
String?
default:"nil"
Optional style instruction prepended to the TTS prompt.
returns
TTSPromptCache
The built TTSPromptCache that can be passed to subsequent generate calls.
Pre-computes the invariant prefix embeddings and prefills them through the CodeDecoder, returning a reusable cache that eliminates ~90% of prefill cost on subsequent generate calls. Throws: TTSError if the model is not loaded or prompt caching is unsupported.

buildPromptCache(speaker:language:instruction:)

Build a prompt cache using typed Qwen3 speaker and language enums.
open func buildPromptCache(
    speaker: Qwen3Speaker,
    language: Qwen3Language,
    instruction: String? = nil
) async throws -> TTSPromptCache
speaker
Qwen3Speaker
required
The Qwen3Speaker to pre-warm the cache for.
language
Qwen3Language
required
The Qwen3Language to pre-warm the cache for.
instruction
String?
default:"nil"
Optional style instruction (1.7B only).
returns
TTSPromptCache
A TTSPromptCache for the given parameters.
Throws: TTSError on generation failure.

savePromptCache()

Save the current prompt cache to disk under the model’s embeddings directory.
public func savePromptCache() throws
The file is saved at <modelFolder>/embeddings/<voice>_<language>.promptcache. Throws: TTSError if saving fails or modelFolder is not set.

loadPromptCache(voice:language:instruction:)

Load a prompt cache from disk if one exists for the given parameters.
public func loadPromptCache(
    voice: String,
    language: String,
    instruction: String? = nil
) -> TTSPromptCache?
voice
String
required
Voice/speaker identifier.
language
String
required
Language identifier.
instruction
String?
default:"nil"
Optional style instruction.
returns
TTSPromptCache?
The loaded cache, or nil if not found.
Returns nil if no cached file exists. Also stores the loaded cache on self.promptCache for automatic reuse.

Logging

loggingCallback(_:)

Register a custom log sink for all Logging output from TTSKit.
open func loggingCallback(_ callback: Logging.LoggingCallback?)
callback
Logging.LoggingCallback?
required
Custom logging callback. Pass nil to restore the default print-based logger.

SpeechModel Conformance

sampleRate
Int
The output sample rate of the currently loaded speech decoder.

Example Usage

Basic Generation

let tts = try await TTSKit()
let result = try await tts.generate(
    text: "Hello, world!",
    voice: "ryan",
    language: "english"
)
print("Generated \(result.audio.count) samples")

With Custom Configuration

var config = TTSKitConfig(
    model: .qwen3TTS_0_6b,
    verbose: true,
    seed: 42
)
let tts = try await TTSKit(config)

Real-time Playback

let result = try await tts.play(
    text: "This is streaming audio.",
    speaker: .ryan,
    playbackStrategy: .auto
)

Component Swapping

let config = TTSKitConfig(load: false)
let tts = try await TTSKit(config)
tts.codeDecoder = MyOptimizedCodeDecoder()
try await tts.loadModels()

Build docs developers (and LLMs) love