Speech Generation

The generate method synthesizes speech from text and returns the complete audio result.

Basic Usage

import TTSKit

let tts = try await TTSKit()
let result = try await tts.generate(text: "Hello from TTSKit!")

// Access the result
print("Audio duration: \(result.audioDuration)s")
print("Sample rate: \(result.sampleRate)Hz")
print("Samples: \(result.audio.count)")
print("Timings: \(result.timings)")

Voices and Languages

Specify the speaker voice and language:

let result = try await tts.generate(
    text: "こんにちは世界",
    speaker: .onoAnna,
    language: .japanese
)

See Voices & Languages for the complete list of available voices and languages.

Generation Options

Customize sampling, chunking, and concurrency via GenerationOptions:

var options = GenerationOptions()

// Sampling parameters (recommended defaults from Qwen)
options.temperature = 0.9
options.topK = 50
options.repetitionPenalty = 1.05
options.maxNewTokens = 245

// Text chunking
options.chunkingStrategy = .sentence  // .none, .sentence, or .token
options.targetChunkSize = 200         // tokens per chunk
options.minChunkSize = 50             // minimum chunk size

// Concurrency
options.concurrentWorkerCount = 0     // 0 = max concurrency, 1 = sequential

let result = try await tts.generate(text: longText, options: options)

Sampling Parameters

temperature

Float

default:"0.9"

Sampling temperature. Higher values (e.g., 1.0) increase randomness; lower values (e.g., 0.5) make output more deterministic.

topK

Int

default:"50"

Top-k sampling: only sample from the k most likely tokens at each step.

repetitionPenalty

Float

default:"1.05"

Penalty for repeated tokens. Values > 1.0 discourage repetition.

maxNewTokens

Int

default:"245"

Maximum number of tokens to generate per chunk.

Chunking Strategy

Long text is automatically split into chunks for efficient generation:

public enum ChunkingStrategy {
    case none      // No chunking, generate entire text as one chunk
    case sentence  // Split at sentence boundaries (default)
    case token     // Split by token count only
}

How it works:

The TextChunker tokenizes the input and splits it based on the strategy
Each chunk is generated independently (optionally in parallel)
Audio chunks are assembled with crossfade at boundaries

options.chunkingStrategy = .sentence
options.targetChunkSize = 200  // Target 200 tokens per chunk
options.minChunkSize = 50      // Don't create chunks smaller than 50 tokens

let result = try await tts.generate(text: longArticle, options: options)

Sentence-based chunking preserves natural prosody boundaries. Token-based chunking may split mid-sentence for very long passages.

Concurrent Generation

Control how many chunks generate in parallel:

// Maximum concurrency (default)
options.concurrentWorkerCount = 0

// Sequential (one chunk at a time)
options.concurrentWorkerCount = 1

// Fixed concurrency (2 chunks at a time)
options.concurrentWorkerCount = 2

let result = try await tts.generate(text: text, options: options)

Higher concurrency increases memory usage (each worker holds its own KV cache). For memory-constrained devices, use concurrentWorkerCount = 1.

Style Instructions (1.7B Only)

The 1.7B model accepts natural-language style instructions to control prosody:

let config = TTSKitConfig(model: .qwen3TTS_1_7b)
let tts = try await TTSKit(config)

var options = GenerationOptions()
options.instruction = "Speak slowly and warmly, like a storyteller."

let result = try await tts.generate(
    text: "Once upon a time...",
    speaker: .ryan,
    options: options
)

Style instructions are only supported by the 1.7B model. The 0.6B model ignores this parameter.

Progress Callbacks

Receive per-step audio during generation:

let result = try await tts.generate(text: "Hello!") { progress in
    print("Audio chunk: \(progress.audio.count) samples")
    
    // First step includes timing info
    if let stepTime = progress.stepTime {
        print("First step took \(stepTime)s")
    }
    
    // Chunk progress
    if let chunkIndex = progress.chunkIndex {
        print("Chunk \(chunkIndex + 1)/\(progress.totalChunks ?? 1)")
    }
    
    // Decoding steps
    print("Steps: \(progress.stepsCompleted)/\(progress.totalSteps ?? 0)")
    
    return true  // Return false to cancel
}

audio

[Float]

PCM audio samples generated in this step.

timings

SpeechTimings

Cumulative timing breakdown for the current generation.

stepTime

TimeInterval?

Wall-clock time for the first decoding step (only set on first callback).

chunkIndex

Int?

Index of the current chunk (when using chunked generation).

totalChunks

Int?

Total number of chunks (when using chunked generation).

stepsCompleted

Int

Number of decoding steps completed so far.

totalSteps

Int?

Estimated total decoding steps.

Prompt Caching

TTSKit automatically caches the invariant prefix embeddings for each voice/language combination:

// First call builds and caches the prompt for (ryan, english)
let result1 = try await tts.generate(text: "First sentence.", speaker: .ryan, language: .english)

// Second call reuses the cached prefix (~90% faster prefill)
let result2 = try await tts.generate(text: "Second sentence.", speaker: .ryan, language: .english)

// Different voice builds a new cache
let result3 = try await tts.generate(text: "Third sentence.", speaker: .aiden, language: .english)

Manual Cache Management

You can also build and save prompt caches explicitly:

// Build and save cache
let cache = try await tts.buildPromptCache(speaker: .ryan, language: .english)
try tts.savePromptCache()

// Load cache from disk
if let cache = tts.loadPromptCache(voice: "ryan", language: "english") {
    print("Loaded cached prefix of \(cache.prefixLength) tokens")
}

// Clear cache to force fresh prefill
tts.promptCache = nil

Caches are saved to <modelFolder>/embeddings/<voice>_<language>.promptcache.

Speech Result

The SpeechResult contains the generated audio and detailed timing breakdown:

public struct SpeechResult {
    public let audio: [Float]              // PCM samples (mono, Float32)
    public let timings: SpeechTimings      // Timing breakdown
    public let sampleRate: Int             // Sample rate (24000 Hz)
    
    public var audioDuration: TimeInterval // Computed duration
}

Timing Breakdown

The SpeechTimings struct provides detailed performance metrics:

print("Model load: \(result.timings.modelLoading)s")
print("Tokenizer load: \(result.timings.tokenizerLoadTime)s")
print("Time to first buffer: \(result.timings.timeToFirstBuffer)s")
print("Total decoding loops: \(result.timings.totalDecodingLoops)")
print("Full pipeline: \(result.timings.fullPipeline)s")

Saving Audio

Generated audio can be saved to WAV or M4A:

let result = try await tts.generate(text: "Save me!")
let outputDir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]

// Save as WAV
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output.wav",
    sampleRate: result.sampleRate,
    format: .wav
)

// Save as M4A (AAC)
try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputDir,
    filename: "output.m4a",
    sampleRate: result.sampleRate,
    format: .m4a
)

M4A export is not available on watchOS and automatically falls back to WAV.

Next Steps

Playback

Stream audio with real-time playback

Voices & Languages

Explore available voices and languages

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Basic Usage

Voices and Languages

Generation Options

Sampling Parameters

Chunking Strategy

Concurrent Generation

Style Instructions (1.7B Only)

Progress Callbacks

Prompt Caching

Manual Cache Management

Speech Result

Timing Breakdown

Saving Audio

Next Steps

Playback

Voices & Languages

Build docs developers (and LLMs) love

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Basic Usage

​Voices and Languages

​Generation Options

​Sampling Parameters

​Chunking Strategy

​Concurrent Generation

​Style Instructions (1.7B Only)

​Progress Callbacks

​Prompt Caching

​Manual Cache Management

​Speech Result

​Timing Breakdown

​Saving Audio

​Next Steps

Playback

Voices & Languages

Build docs developers (and LLMs) love

Basic Usage

Voices and Languages

Generation Options

Sampling Parameters

Chunking Strategy

Concurrent Generation

Style Instructions (1.7B Only)

Progress Callbacks

Prompt Caching

Manual Cache Management

Speech Result

Timing Breakdown

Saving Audio

Next Steps