The generate method synthesizes speech from text and returns the complete audio result.
Basic Usage
import TTSKit
let tts = try await TTSKit ()
let result = try await tts. generate ( text : "Hello from TTSKit!" )
// Access the result
print ( "Audio duration: \( result. audioDuration ) s" )
print ( "Sample rate: \( result. sampleRate ) Hz" )
print ( "Samples: \( result. audio . count ) " )
print ( "Timings: \( result. timings ) " )
Voices and Languages
Specify the speaker voice and language:
let result = try await tts. generate (
text : "こんにちは世界" ,
speaker : . onoAnna ,
language : . japanese
)
See Voices & Languages for the complete list of available voices and languages.
Generation Options
Customize sampling, chunking, and concurrency via GenerationOptions:
var options = GenerationOptions ()
// Sampling parameters (recommended defaults from Qwen)
options. temperature = 0.9
options. topK = 50
options. repetitionPenalty = 1.05
options. maxNewTokens = 245
// Text chunking
options. chunkingStrategy = . sentence // .none, .sentence, or .token
options. targetChunkSize = 200 // tokens per chunk
options. minChunkSize = 50 // minimum chunk size
// Concurrency
options. concurrentWorkerCount = 0 // 0 = max concurrency, 1 = sequential
let result = try await tts. generate ( text : longText, options : options)
Sampling Parameters
Sampling temperature. Higher values (e.g., 1.0) increase randomness; lower values (e.g., 0.5) make output more deterministic.
Top-k sampling: only sample from the k most likely tokens at each step.
Penalty for repeated tokens. Values > 1.0 discourage repetition.
Maximum number of tokens to generate per chunk.
Chunking Strategy
Long text is automatically split into chunks for efficient generation:
public enum ChunkingStrategy {
case none // No chunking, generate entire text as one chunk
case sentence // Split at sentence boundaries (default)
case token // Split by token count only
}
How it works:
The TextChunker tokenizes the input and splits it based on the strategy
Each chunk is generated independently (optionally in parallel)
Audio chunks are assembled with crossfade at boundaries
options. chunkingStrategy = . sentence
options. targetChunkSize = 200 // Target 200 tokens per chunk
options. minChunkSize = 50 // Don't create chunks smaller than 50 tokens
let result = try await tts. generate ( text : longArticle, options : options)
Sentence-based chunking preserves natural prosody boundaries. Token-based chunking may split mid-sentence for very long passages.
Concurrent Generation
Control how many chunks generate in parallel:
// Maximum concurrency (default)
options. concurrentWorkerCount = 0
// Sequential (one chunk at a time)
options. concurrentWorkerCount = 1
// Fixed concurrency (2 chunks at a time)
options. concurrentWorkerCount = 2
let result = try await tts. generate ( text : text, options : options)
Higher concurrency increases memory usage (each worker holds its own KV cache). For memory-constrained devices, use concurrentWorkerCount = 1.
Style Instructions (1.7B Only)
The 1.7B model accepts natural-language style instructions to control prosody:
let config = TTSKitConfig ( model : . qwen3TTS_1_7b )
let tts = try await TTSKit (config)
var options = GenerationOptions ()
options. instruction = "Speak slowly and warmly, like a storyteller."
let result = try await tts. generate (
text : "Once upon a time..." ,
speaker : . ryan ,
options : options
)
Style instructions are only supported by the 1.7B model. The 0.6B model ignores this parameter.
Progress Callbacks
Receive per-step audio during generation:
let result = try await tts. generate ( text : "Hello!" ) { progress in
print ( "Audio chunk: \( progress. audio . count ) samples" )
// First step includes timing info
if let stepTime = progress.stepTime {
print ( "First step took \( stepTime ) s" )
}
// Chunk progress
if let chunkIndex = progress.chunkIndex {
print ( "Chunk \( chunkIndex + 1 ) / \( progress. totalChunks ?? 1 ) " )
}
// Decoding steps
print ( "Steps: \( progress. stepsCompleted ) / \( progress. totalSteps ?? 0 ) " )
return true // Return false to cancel
}
PCM audio samples generated in this step.
Cumulative timing breakdown for the current generation.
Wall-clock time for the first decoding step (only set on first callback).
Index of the current chunk (when using chunked generation).
Total number of chunks (when using chunked generation).
Number of decoding steps completed so far.
Estimated total decoding steps.
Prompt Caching
TTSKit automatically caches the invariant prefix embeddings for each voice/language combination:
// First call builds and caches the prompt for (ryan, english)
let result1 = try await tts. generate ( text : "First sentence." , speaker : . ryan , language : . english )
// Second call reuses the cached prefix (~90% faster prefill)
let result2 = try await tts. generate ( text : "Second sentence." , speaker : . ryan , language : . english )
// Different voice builds a new cache
let result3 = try await tts. generate ( text : "Third sentence." , speaker : . aiden , language : . english )
Manual Cache Management
You can also build and save prompt caches explicitly:
// Build and save cache
let cache = try await tts. buildPromptCache ( speaker : . ryan , language : . english )
try tts. savePromptCache ()
// Load cache from disk
if let cache = tts. loadPromptCache ( voice : "ryan" , language : "english" ) {
print ( "Loaded cached prefix of \( cache. prefixLength ) tokens" )
}
// Clear cache to force fresh prefill
tts. promptCache = nil
Caches are saved to <modelFolder>/embeddings/<voice>_<language>.promptcache.
Speech Result
The SpeechResult contains the generated audio and detailed timing breakdown:
public struct SpeechResult {
public let audio: [ Float ] // PCM samples (mono, Float32)
public let timings: SpeechTimings // Timing breakdown
public let sampleRate: Int // Sample rate (24000 Hz)
public var audioDuration: TimeInterval // Computed duration
}
Timing Breakdown
The SpeechTimings struct provides detailed performance metrics:
print ( "Model load: \( result. timings . modelLoading ) s" )
print ( "Tokenizer load: \( result. timings . tokenizerLoadTime ) s" )
print ( "Time to first buffer: \( result. timings . timeToFirstBuffer ) s" )
print ( "Total decoding loops: \( result. timings . totalDecodingLoops ) " )
print ( "Full pipeline: \( result. timings . fullPipeline ) s" )
Saving Audio
Generated audio can be saved to WAV or M4A:
let result = try await tts. generate ( text : "Save me!" )
let outputDir = FileManager. default . urls ( for : . documentDirectory , in : . userDomainMask )[ 0 ]
// Save as WAV
try await AudioOutput. saveAudio (
result. audio ,
toFolder : outputDir,
filename : "output.wav" ,
sampleRate : result. sampleRate ,
format : . wav
)
// Save as M4A (AAC)
try await AudioOutput. saveAudio (
result. audio ,
toFolder : outputDir,
filename : "output.m4a" ,
sampleRate : result. sampleRate ,
format : . m4a
)
M4A export is not available on watchOS and automatically falls back to WAV.
Next Steps
Playback Stream audio with real-time playback
Voices & Languages Explore available voices and languages