Skip to main content
The play method generates speech and streams it to the device speakers frame-by-frame as it is generated.

Basic Usage

import TTSKit

let tts = try await TTSKit()
try await tts.play(text: "This starts playing before generation finishes.")
The method blocks until both generation and playback complete.

Playback Strategies

You can control how much audio is buffered before playback begins:
public enum PlaybackStrategy {
    case auto                   // Adaptive: measure first step, compute required buffer
    case stream                 // Immediate: start playing as soon as first frame arrives
    case buffered(TimeInterval) // Fixed: pre-buffer specified seconds
    case generateFirst          // Generate all audio first, then play
}

Auto (Default)

Measures the first generation step and pre-buffers just enough to avoid underruns:
try await tts.play(
    text: "Long passage...",
    playbackStrategy: .auto
)
How it works:
  1. Start generation without starting the audio engine
  2. Measure the time for the first decoding step
  3. Compute required buffer: buffer = stepTime * maxNewTokens - audioPerStep
  4. Accumulate audio until buffer threshold is met
  5. Start playback
This ensures smooth playback on any device speed without over-buffering.
The auto strategy is recommended for most use cases. It balances latency and smoothness.

Stream

Start playing as soon as the first frame arrives (lowest latency):
try await tts.play(
    text: "Real-time response",
    playbackStrategy: .stream
)
Stream mode may stutter on slower devices if generation can’t keep up with playback. Use .auto or .buffered for guaranteed smooth playback.

Buffered

Pre-buffer a fixed duration before starting playback:
// Buffer 2 seconds before playing
try await tts.play(
    text: "High-latency but smooth",
    playbackStrategy: .buffered(2.0)
)
Useful when you know generation is slower than real-time on the target device.

Generate First

Generate all audio before playing anything (highest latency, but allows concurrent chunk generation):
try await tts.play(
    text: "Generate everything first",
    playbackStrategy: .generateFirst
)
This is the only strategy that respects concurrentWorkerCount > 1, since playback doesn’t start until all chunks are ready.

Voices and Languages

All voice and language options work with play:
try await tts.play(
    text: "Bonjour le monde",
    voice: "serena",
    language: "french",
    playbackStrategy: .auto
)
Or using typed enums:
try await tts.play(
    text: "Bonjour le monde",
    speaker: .serena,
    language: .french,
    playbackStrategy: .auto
)

Generation Options

All generation options are supported:
var options = GenerationOptions()
options.temperature = 0.9
options.chunkingStrategy = .sentence
options.instruction = "Speak slowly and warmly"  // 1.7B only

try await tts.play(
    text: longText,
    options: options,
    playbackStrategy: .auto
)
For streaming strategies (.auto, .stream, .buffered), chunking is forced to sequential (concurrentWorkerCount = 1) so frames can be enqueued in order. .generateFirst respects the caller’s concurrency setting.

Progress Callbacks

Receive per-step updates during playback:
try await tts.play(text: "Hello!", playbackStrategy: .auto) { progress in
    print("Playing chunk: \(progress.audio.count) samples")
    
    if let stepTime = progress.stepTime {
        print("First step took \(stepTime)s")
    }
    
    return true  // Return false to cancel
}
The callback receives SpeechProgress with the same fields as in generation callbacks.

Playback Control

The AudioOutput class manages playback:
// Access the audio output
let audioOut = tts.audioOutput

// Check current playback position
let position = audioOut.currentPlaybackTime  // seconds

// Check buffering status
let remaining = audioOut.silentBufferRemaining  // seconds until playback starts

// Stop playback early
await audioOut.stopPlayback(waitForCompletion: false)

Audio Format

TTSKit outputs mono PCM Float32 audio at 24 kHz:
print("Sample rate: \(audioOut.sampleRate)Hz")       // 24000
print("Format: \(audioOut.audioFormat.commonFormat)") // .pcmFormatFloat32

Crossfading

When generating long text with chunking, audio chunks are automatically crossfaded at boundaries:
// Chunks are crossfaded with 100ms overlap (2400 samples at 24kHz)
let crossfadeSamples = tts.audioOutput.sampleRate / 10
The AudioOutput.crossfade method uses equal-power curves (cos/sin) to preserve energy through the overlap region.

Edge Fading

The audio output applies automatic fade-in/fade-out at discontinuities:
  • Fade-in: Applied at the start of playback, start of a chunk, or after an underrun
  • Fade-out: Applied at the end of playback, end of a chunk, or before an underrun
Interior frames of contiguous playback are untouched. Fades are 256 samples (~10.7ms at 24kHz) - imperceptible on continuous audio but eliminate clicks at gaps.

Underrun Detection

The audio output detects underruns using wall-clock timing:
  1. Track when all scheduled audio should finish playing (expectedPlaybackEnd)
  2. If current time > expectedPlaybackEnd, the player has drained
  3. Apply fade-out to the last frame before the gap
  4. Apply fade-in to the next frame after the gap
This ensures smooth transitions even when generation can’t keep up with playback.

iOS Audio Session

On iOS, the audio session is automatically configured for playback:
#if os(iOS)
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playback, mode: .default, options: [])
try session.setActive(true)
#endif
This ensures audio routes to the correct output (speaker, headphones, etc.).

Example: Real-Time Streaming

Here’s a complete example with progress updates and playback control:
import TTSKit

Task {
    let tts = try await TTSKit()
    
    let startTime = Date()
    
    let result = try await tts.play(
        text: "The quick brown fox jumps over the lazy dog. This sentence demonstrates real-time streaming playback with adaptive buffering.",
        speaker: .ryan,
        language: .english,
        playbackStrategy: .auto
    ) { progress in
        if let stepTime = progress.stepTime {
            let buffer = tts.audioOutput.silentBufferRemaining
            print("First step: \(Int(stepTime * 1000))ms, buffering \(String(format: "%.2f", buffer))s")
        }
        
        let elapsed = Date().timeIntervalSince(startTime)
        let position = tts.audioOutput.currentPlaybackTime
        print("Elapsed: \(String(format: "%.2f", elapsed))s, Playing: \(String(format: "%.2f", position))s")
        
        return true
    }
    
    print("Done! Generated \(result.audioDuration)s of audio")
    print("Time to first buffer: \(result.timings.timeToFirstBuffer)s")
    print("Full pipeline: \(result.timings.fullPipeline)s")
}

Platform Support

  • macOS: Full support for all playback strategies
  • iOS: Full support for all playback strategies
  • watchOS: M4A export not available, use WAV format

Next Steps

Generation

Learn about generation options and chunking

Voices & Languages

Explore available voices and language support

Build docs developers (and LLMs) love