Overview
TTSKit is the main entry point for text-to-speech synthesis. It orchestrates text chunking, concurrent generation, crossfade, and audio playback. The class follows the WhisperKit pattern, exposing each model component as a protocol-typed public property that can be swapped at runtime.
Initialization
init(_:)
Create aTTSKit instance from a TTSKitConfig.
Pipeline configuration containing model variant, paths, compute units, component overrides, and lifecycle flags.
TTSError if the model family is unsupported or component instantiation fails.
Example:
init(model:modelFolder:…)
Convenience initializer that exposes all configuration fields as individual parameters.Model variant to use.
Explicit local folder URL. When provided, download is skipped.
Base URL for Hub cache.
HuggingFace repo ID.
Local tokenizer folder path.
HuggingFace API token.
Per-component CoreML compute unit configuration.
Custom text projector implementation.
Custom code embedder implementation.
Custom multi-code embedder implementation.
Custom code decoder implementation.
Custom multi-code decoder implementation.
Custom speech decoder implementation.
Enable diagnostic logging.
Logging level when verbose is true.
Enable model prewarming to serialize compilation.
Load models immediately after init.
nil loads when modelFolder is non-nil.Download models if not already available locally.
Use a background URLSession for model downloads.
Optional seed for reproducible generation.
Properties
Model Components
Text token to embedding converter. Swappable at runtime.
Codec-0 token to embedding converter.
Multi-code token to embedding converter.
Autoregressive code-0 decoder.
Per-frame decoder.
RVQ codes to audio waveform converter.
Tokenizer instance.
nil before the first loadModels() call or after unloadModels().State
Current lifecycle state of the loaded models. Read-only.Transitions:
.unloaded → .downloading → .downloaded → .loading → .loadedOr: .unloaded → .prewarming → .prewarmedPipeline configuration.
Direct accessor for the resolved local model folder. Backed by
config.modelFolder.Whether to use a background URLSession for model downloads. Backed by
config.useBackgroundDownloadSession.Cumulative timings for the most recent pipeline run. Read-only.
Wall-clock seconds for the most recent full model load. Read-only.
Wall-clock seconds for the most recent tokenizer load. Read-only.
Audio output instance used by
play(). Read-only.Cached prefix state for the most recently used voice/language/instruction.
Automatically built on the first
generate call and reused for subsequent calls with the same parameters.
Set to nil to force a full prefill.Invoked whenever
modelState changes.Seed for reproducible generation. Read-only.
Static Methods
recommendedModels()
Returns the recommended model variant for the current platform.The best default variant for the current platform.
fetchAvailableModels(from:matching:downloadBase:token:endpoint:)
Fetch all available model variants from the HuggingFace Hub.HuggingFace repo ID to query.
Glob patterns to filter returned variant names.
Optional base URL for Hub downloads.
HuggingFace API token.
HuggingFace Hub endpoint URL.
Display names of available model variants matching the given patterns.
TTSError if the Hub request fails.
download(variant:downloadBase:useBackgroundSession:from:token:endpoint:revision:additionalPatterns:progressCallback:)
Download models for a specific variant from HuggingFace Hub.The model variant to download.
Base URL for the local cache.
Use a background URLSession for the download.
HuggingFace repo ID.
HuggingFace API token.
HuggingFace Hub endpoint URL.
Specific git revision (commit SHA, tag, or branch) to download.
Extra glob patterns to include alongside the default component patterns.
Optional closure receiving download progress updates.
Local URL of the downloaded model folder.
TTSError if the Hub download fails.
download(config:progressCallback:)
Download models using a fullTTSKitConfig.
Pipeline configuration containing
modelRepo, modelToken, downloadRevision, downloadAdditionalPatterns, and variant settings.Optional closure receiving download progress updates.
Local URL of the downloaded model folder.
TTSError if the Hub download fails.
Instance Methods
Model Lifecycle
setupModels(model:downloadBase:modelRepo:modelToken:modelFolder:download:endpoint:)
Resolve the local model folder, downloading from HuggingFace Hub if needed.Model variant to download.
nil uses config.model.Base URL for Hub cache.
nil uses the Hub library default.HuggingFace repo ID.
nil uses config.modelRepo.HuggingFace API token.
nil uses config.modelToken.Explicit local folder URL. When non-nil the download is skipped.
When
true and modelFolder is nil, download from the resolved repo.HuggingFace Hub endpoint URL.
TTSError if the download fails or the model folder cannot be resolved.
prewarmModels()
Prewarm all CoreML models by compiling them sequentially, then discarding weights.loadModels() on first launch or after a model update.
Throws: TTSError if model compilation fails.
loadModels(prewarmMode:)
Load all models and the tokenizer.When
true, compile models one at a time and discard weights to limit peak memory (prewarm).
When false (default), load all concurrently.config.modelFolder to be set (call setupModels first if needed).
Throws: TTSError if model compilation or tokenizer loading fails.
loadTokenizerIfNeeded()
Load the tokenizer only if it has not been loaded yet.tokenizer is already set.
Throws: TTSError if tokenizer loading fails.
loadTokenizer()
Load the tokenizer fromconfig.tokenizerSource.
tokenizer.json file first; falls back to downloading from the Hugging Face Hub if no local file is found.
The loaded tokenizer instance.
TTSError if tokenizer loading fails.
unloadModels()
Release all model weights and the tokenizer from memory..unloading before reaching .unloaded.
clearState()
Reset all accumulated timing statistics.Pipeline Setup
setupPipeline(for:config:)
Configure the model-specific component properties for the active model family.Model variant to configure.
Configuration containing component overrides.
config if set; otherwise instantiates the default components for the given variant’s model family.
setupGenerateTask(currentTimings:progress:tokenizer:sampler:)
Setup the generate task used for speech synthesis.Timing accumulator for the current run.
Progress tracking instance.
Tokenizer instance.
Token sampling strategy.
A configured generation task.
TTSError if task setup fails.
createTask(progress:)
Create a fresh generation task with the guard/seed/counter boilerplate.Optional progress tracking instance.
An independent task with its own sampler seed and per-task buffers.
TTSError if the tokenizer is not loaded.
Speech Generation
generate(text:voice:language:options:callback:)
Synthesize speech from text and return the complete audio result.The text to synthesize.
Voice/speaker identifier. Format is model-specific (e.g.,
"ryan" for Qwen3 TTS).Language identifier. Format is model-specific (e.g.,
"english" for Qwen3 TTS).Sampling and generation options.
Optional per-step callback receiving decoded audio chunks. Return
false to cancel; nil or true to continue.A
SpeechResult containing the raw audio samples and timing breakdown.TTSError if text is empty, models are not loaded, or generation fails.
generate(text:speaker:language:options:callback:)
Generate speech from text using typed Qwen3 speaker and language enums.Input text to synthesise.
The
Qwen3Speaker voice to use.The
Qwen3Language to synthesise in.Generation options controlling sampling, chunking, and concurrency.
Per-step callback receiving decoded audio chunks. Return
false to cancel.The assembled
SpeechResult.TTSError on generation failure or task cancellation.
play(text:voice:language:options:playbackStrategy:callback:)
Generate speech and stream it through the audio output in real time.The text to synthesize.
Voice/speaker identifier.
Language identifier.
Sampling and generation options.
Controls how audio is buffered before playback begins.
Optional per-step callback.
A
SpeechResult with the complete audio and timing breakdown.concurrentWorkerCount = 1) so frames can be enqueued in order.
Throws: TTSError on generation failure or task cancellation.
play(text:speaker:language:options:playbackStrategy:callback:)
Generate speech and stream playback using typed Qwen3 speaker and language enums.Input text to synthesise.
The
Qwen3Speaker voice to use.The
Qwen3Language to synthesise in.Generation options controlling sampling, chunking, and concurrency.
Controls how much audio is buffered before playback begins.
Per-step callback receiving decoded audio chunks. Return
false to cancel.The assembled
SpeechResult.TTSError on generation failure or task cancellation.
Prompt Cache Management
buildPromptCache(voice:language:instruction:)
Build a prompt cache for the given voice/language/instruction combination.Voice/speaker identifier.
nil uses the model’s defaultVoice.Language identifier.
nil uses the model’s defaultLanguage.Optional style instruction prepended to the TTS prompt.
The built
TTSPromptCache that can be passed to subsequent generate calls.generate calls.
Throws: TTSError if the model is not loaded or prompt caching is unsupported.
buildPromptCache(speaker:language:instruction:)
Build a prompt cache using typed Qwen3 speaker and language enums.The
Qwen3Speaker to pre-warm the cache for.The
Qwen3Language to pre-warm the cache for.Optional style instruction (1.7B only).
A
TTSPromptCache for the given parameters.TTSError on generation failure.
savePromptCache()
Save the current prompt cache to disk under the model’s embeddings directory.<modelFolder>/embeddings/<voice>_<language>.promptcache.
Throws: TTSError if saving fails or modelFolder is not set.
loadPromptCache(voice:language:instruction:)
Load a prompt cache from disk if one exists for the given parameters.Voice/speaker identifier.
Language identifier.
Optional style instruction.
The loaded cache, or
nil if not found.nil if no cached file exists. Also stores the loaded cache on self.promptCache for automatic reuse.
Logging
loggingCallback(_:)
Register a custom log sink for allLogging output from TTSKit.
Custom logging callback. Pass
nil to restore the default print-based logger.SpeechModel Conformance
The output sample rate of the currently loaded speech decoder.