Skip to main content
The generate method synthesizes speech from text and returns the complete audio result.

Basic Usage

import TTSKit

let tts = try await TTSKit()
let result = try await tts.generate(text: "Hello from TTSKit!")

// Access the result
print("Audio duration: \(result.audioDuration)s")
print("Sample rate: \(result.sampleRate)Hz")
print("Samples: \(result.audio.count)")
print("Timings: \(result.timings)")

Voices and Languages

Specify the speaker voice and language:
let result = try await tts.generate(
    text: "こんにちは世界",
    speaker: .onoAnna,
    language: .japanese
)
See Voices & Languages for the complete list of available voices and languages.

Generation Options

Customize sampling, chunking, and concurrency via GenerationOptions:
var options = GenerationOptions()

// Sampling parameters (recommended defaults from Qwen)
options.temperature = 0.9
options.topK = 50
options.repetitionPenalty = 1.05
options.maxNewTokens = 245

// Text chunking
options.chunkingStrategy = .sentence  // .none, .sentence, or .token
options.targetChunkSize = 200         // tokens per chunk
options.minChunkSize = 50             // minimum chunk size

// Concurrency
options.concurrentWorkerCount = 0     // 0 = max concurrency, 1 = sequential

let result = try await tts.generate(text: longText, options: options)

Sampling Parameters

temperature
Float
default:"0.9"
Sampling temperature. Higher values (e.g., 1.0) increase randomness; lower values (e.g., 0.5) make output more deterministic.
topK
Int
default:"50"
Top-k sampling: only sample from the k most likely tokens at each step.
repetitionPenalty
Float
default:"1.05"
Penalty for repeated tokens. Values > 1.0 discourage repetition.
maxNewTokens
Int
default:"245"
Maximum number of tokens to generate per chunk.

Chunking Strategy

Long text is automatically split into chunks for efficient generation:
public enum ChunkingStrategy {
    case none      // No chunking, generate entire text as one chunk
    case sentence  // Split at sentence boundaries (default)
    case token     // Split by token count only
}
How it works:
  1. The TextChunker tokenizes the input and splits it based on the strategy
  2. Each chunk is generated independently (optionally in parallel)
  3. Audio chunks are assembled with crossfade at boundaries
options.chunkingStrategy = .sentence
options.targetChunkSize = 200  // Target 200 tokens per chunk
options.minChunkSize = 50      // Don't create chunks smaller than 50 tokens

let result = try await tts.generate(text: longArticle, options: options)
Sentence-based chunking preserves natural prosody boundaries. Token-based chunking may split mid-sentence for very long passages.

Concurrent Generation

Control how many chunks generate in parallel:
// Maximum concurrency (default)
options.concurrentWorkerCount = 0

// Sequential (one chunk at a time)
options.concurrentWorkerCount = 1

// Fixed concurrency (2 chunks at a time)
options.concurrentWorkerCount = 2

let result = try await tts.generate(text: text, options: options)
Higher concurrency increases memory usage (each worker holds its own KV cache). For memory-constrained devices, use concurrentWorkerCount = 1.

Style Instructions (1.7B Only)

The 1.7B model accepts natural-language style instructions to control prosody:
let config = TTSKitConfig(model: .qwen3TTS_1_7b)
let tts = try await TTSKit(config)

var options = GenerationOptions()
options.instruction = "Speak slowly and warmly, like a storyteller."

let result = try await tts.generate(
    text: "Once upon a time...",
    speaker: .ryan,
    options: options
)
Style instructions are only supported by the 1.7B model. The 0.6B model ignores this parameter.

Progress Callbacks

Receive per-step audio during generation:
let result = try await tts.generate(text: "Hello!") { progress in
    print("Audio chunk: \(progress.audio.count) samples")
    
    // First step includes timing info
    if let stepTime = progress.stepTime {
        print("First step took \(stepTime)s")
    }
    
    // Chunk progress
    if let chunkIndex = progress.chunkIndex {
        print("Chunk \(chunkIndex + 1)/\(progress.totalChunks ?? 1)")
    }
    
    // Decoding steps
    print("Steps: \(progress.stepsCompleted)/\(progress.totalSteps ?? 0)")
    
    return true  // Return false to cancel
}
audio
[Float]
PCM audio samples generated in this step.
timings
SpeechTimings
Cumulative timing breakdown for the current generation.
stepTime
TimeInterval?
Wall-clock time for the first decoding step (only set on first callback).
chunkIndex
Int?
Index of the current chunk (when using chunked generation).
totalChunks
Int?
Total number of chunks (when using chunked generation).
stepsCompleted
Int
Number of decoding steps completed so far.
totalSteps
Int?
Estimated total decoding steps.

Prompt Caching

TTSKit automatically caches the invariant prefix embeddings for each voice/language combination:
// First call builds and caches the prompt for (ryan, english)
let result1 = try await tts.generate(text: "First sentence.", speaker: .ryan, language: .english)

// Second call reuses the cached prefix (~90% faster prefill)
let result2 = try await tts.generate(text: "Second sentence.", speaker: .ryan, language: .english)

// Different voice builds a new cache
let result3 = try await tts.generate(text: "Third sentence.", speaker: .aiden, language: .english)

Manual Cache Management

You can also build and save prompt caches explicitly:
// Build and save cache
let cache = try await tts.buildPromptCache(speaker: .ryan, language: .english)
try tts.savePromptCache()

// Load cache from disk
if let cache = tts.loadPromptCache(voice: "ryan", language: "english") {
    print("Loaded cached prefix of \(cache.prefixLength) tokens")
}

// Clear cache to force fresh prefill
tts.promptCache = nil
Caches are saved to <modelFolder>/embeddings/<voice>_<language>.promptcache.

Speech Result

The SpeechResult contains the generated audio and detailed timing breakdown:
public struct SpeechResult {
    public let audio: [Float]              // PCM samples (mono, Float32)
    public let timings: SpeechTimings      // Timing breakdown
    public let sampleRate: Int             // Sample rate (24000 Hz)
    
    public var audioDuration: TimeInterval // Computed duration
}

Timing Breakdown

The SpeechTimings struct provides detailed performance metrics:
print("Model load: \(result.timings.modelLoading)s")
print("Tokenizer load: \(result.timings.tokenizerLoadTime)s")
print("Time to first buffer: \(result.timings.timeToFirstBuffer)s")
print("Total decoding loops: \(result.timings.totalDecodingLoops)")
print("Full pipeline: \(result.timings.fullPipeline)s")

Saving Audio

Generated audio can be saved to WAV or M4A:
let result = try await tts.generate(text: "Save me!")
let outputDir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]

// Save as WAV
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output.wav",
    sampleRate: result.sampleRate,
    format: .wav
)

// Save as M4A (AAC)
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output.m4a",
    sampleRate: result.sampleRate,
    format: .m4a
)
M4A export is not available on watchOS and automatically falls back to WAV.

Next Steps

Playback

Stream audio with real-time playback

Voices & Languages

Explore available voices and languages

Build docs developers (and LLMs) love