AudioOutput

Overview

AudioOutput handles audio export to file and real-time streaming playback via AVAudioEngine. It supports adaptive pre-buffering and edge-fading to prevent audible clicks during playback.

public class AudioOutput: @unchecked Sendable

Key Features

Pre-buffering: Accumulates audio frames until a threshold is reached before flushing to the player, preventing underruns on slower devices
Edge-fading: Applies fade-in/fade-out only at actual audio discontinuities (session start/end, chunk boundaries, underruns)
Underrun detection: Uses wall-clock timing to detect when the player has drained and needs fade-in on the next frame
File export: Supports M4A and WAV formats with optional metadata embedding

Initialization

public init(sampleRate: Int = 24000)

sampleRate

Int

default:"24000"

Output sample rate in Hz. Defaults to 24000 (Qwen3 TTS).

Properties

sampleRate

Int

Output sample rate in Hz. Read-only. Updated by TTSKit.loadModels() to match the loaded speech decoder’s actual sample rate.

audioFormat

AVAudioFormat

The audio format used for playback and export (derived from sampleRate). Read-only.

scheduledAudioDuration

TimeInterval

Cumulative duration (seconds) of real audio that has been scheduled via scheduleWithFades. Read-only.The silent sentinel buffer used for drain detection is not included.

currentPlaybackTime

TimeInterval

Current playback position in seconds, based on the audio engine’s render timeline. Read-only.Returns 0 if the player is not active, no audio has been scheduled yet, or the player hasn’t started rendering.Clamped to scheduledAudioDuration so the position never advances into silence gaps between chunks or past the last real audio frame.

silentBufferRemaining

TimeInterval

How many seconds of audio still need to accumulate in the pre-buffer before the next chunk flushes and playback resumes. Read-only.Non-zero only while in buffering mode (bufferThresholdMet == false and a positive bufferDuration is set).

Static Properties

fadeLengthSamples

Int

Number of samples for the fade-in/fade-out ramp. Value: 256256 samples at 24kHz ≈ 10.7ms - imperceptible on contiguous audio but smoothly eliminates clicks at discontinuities.

Instance Methods

Configuration

configure(sampleRate:)

Update the sample rate to match the loaded speech decoder.

public func configure(sampleRate newRate: Int)

newRate

Int

required

The new sample rate in Hz.

Must be called before startPlayback().

Playback Control

startPlayback(deferEngineStart:)

Start the audio engine for streaming playback.

public func startPlayback(deferEngineStart: Bool = false) throws

deferEngineStart

Bool

default:"false"

When true, the audio engine is created and connected but not started. The engine will start automatically on the first enqueueAudioChunk call. This avoids the render thread contending with model predictions during the critical time-to-first-buffer path.

Resets all buffering, fade, and timing state. After calling this, configure the buffer threshold via setBufferDuration(_:). Throws: TTSError if the audio engine fails to start.

setBufferDuration(_:)

Configure the pre-buffer duration.

public func setBufferDuration(_ seconds: TimeInterval)

seconds

TimeInterval

required

Duration of audio to accumulate before flushing. Pass 0 for immediate streaming (fast devices).

Call after startPlayback().

If seconds == 0: immediately flushes any pending frames and switches to direct streaming (no buffering)
If seconds > 0: sets the threshold. If enough audio has already accumulated, flushes immediately
Can be called multiple times (e.g., per-chunk reassessment). Any held tail frame from the previous chunk is committed with fade-out first

enqueueAudioChunk(_:)

Enqueue a chunk of audio samples for playback.

public func enqueueAudioChunk(_ samples: [Float])

samples

[Float]

required

Mono Float32 PCM samples to enqueue.

In streaming mode, detects underruns via wall-clock timing: if the player has drained since the last buffer, the held tail is committed with fade-out (it was the last frame before the gap) and the incoming frame is marked for fade-in. On contiguous playback, no fades are applied to interior frames.

stopPlayback(waitForCompletion:)

Stop playback and tear down the audio engine.

public func stopPlayback(waitForCompletion: Bool = true) async

waitForCompletion

Bool

default:"true"

When true, waits for any remaining scheduled buffers to finish playing before tearing down the engine.

The held tail frame is committed with fade-out (it’s the last frame of the session). Any remaining pending frames are flushed first.

File Export

saveAudio(_:toFolder:filename:sampleRate:format:metadataProvider:)

Save audio samples to a file.

public static func saveAudio(
    _ samples: [Float],
    toFolder folder: URL,
    filename: String,
    sampleRate: Int = 24000,
    format: AudioFileFormat? = nil,
    metadataProvider: (@Sendable () throws -> [AVMetadataItem])? = nil
) async throws -> URL

samples

[Float]

required

Mono Float32 PCM samples.

folder

URL

required

Destination directory. Created if it doesn’t exist.

filename

String

required

File name, with or without extension. Any extension already present in filename is stripped before writing.

sampleRate

Int

default:"24000"

Sample rate in Hz.

format

AudioFileFormat?

default:"nil"

Output format. Inferred from filename extension when nil. Defaults to .m4a if no extension found.

metadataProvider

(@Sendable () throws -> [AVMetadataItem])?

default:"nil"

Optional metadata callback for items to embed into the file container for m4a formats.

returns

URL

The URL of the written file.

For M4A with metadata: writes PCM → AAC to a temp file, then uses AVAssetExportSession passthrough to remux with embedded metadata atoms (no re-encode). For WAV or metadata-free M4A: writes directly. On watchOS, .m4a automatically falls back to .wav. Throws: TTSError if audio encoding or export fails. Example:

let outputURL = try await AudioOutput.saveAudio(
    result.audio,
    toFolder: URL(fileURLWithPath: "/tmp"),
    filename: "speech.m4a",
    sampleRate: 24000
)

duration(of:)

Return the playback duration of an audio file in seconds.

public static func duration(of url: URL) async throws -> TimeInterval

url

URL

required

URL to the audio file.

returns

TimeInterval

Duration in seconds.

Throws: Error if the file cannot be read.

Crossfade Assembly

crossfade(_:fadeLength:)

Assemble multiple audio chunks into one array with equal-power crossfades at each boundary.

public static func crossfade(_ chunks: [[Float]], fadeLength: Int) -> [Float]

chunks

[[Float]]

required

Ordered audio chunks to concatenate.

fadeLength

Int

required

Number of overlap samples for each crossfade.

returns

[Float]

Single concatenated audio array with crossfades applied at chunk boundaries.

Uses cos(t*pi/2) fade-out and sin(t*pi/2) fade-in so that energy is preserved through the overlap region. Fade curves are pre-computed once via Accelerate (vDSP_vramp + vvcosf/vvsinf) and reused at every chunk boundary. Example:

let chunks = [
    [0.1, 0.2, 0.3, 0.4],
    [0.5, 0.6, 0.7, 0.8],
    [0.9, 1.0, 0.9, 0.8]
]
let fused = AudioOutput.crossfade(chunks, fadeLength: 2)

AudioFileFormat

Supported audio export formats.

public enum AudioFileFormat: String, Sendable {
    case m4a
    case wav
}

Properties

fileExtension

String

The file extension for this format (e.g., “m4a”, “wav”).

Static Methods

resolve(_:)

Resolve the effective format for the current platform.

public static func resolve(_ preferred: AudioFileFormat = .m4a) -> AudioFileFormat

preferred

AudioFileFormat

default:".m4a"

Preferred format.

returns

AudioFileFormat

The resolved format. On watchOS, M4A is not supported so falls back to WAV with a warning.

Buffer Lifecycle

The buffer lifecycle for streaming playback follows these steps:

startPlayback() - resets all state; frames accumulate until configured
setBufferDuration(_:) - configures threshold (call after start)
enqueueAudioChunk(_:) - pushes frames through the buffer/tail pipeline
stopPlayback() - commits the tail with fade-out, waits, tears down

Example Usage

Save to File

let result = try await tts.generate(
    text: "Hello, world!",
    voice: "ryan"
)

let outputURL = try await AudioOutput.saveAudio(
    result.audio,
    toFolder: URL(fileURLWithPath: "/tmp"),
    filename: "output.m4a"
)
print("Saved to \(outputURL.path)")

Save with Metadata

let metadata = [
    AVMetadataItem.makeMetadata(.commonIdentifierTitle, value: "My Speech"),
    AVMetadataItem.makeMetadata(.commonIdentifierArtist, value: "TTSKit")
]

let outputURL = try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputFolder,
    filename: "speech.m4a",
    metadataProvider: { metadata }
)

Stream with Custom Buffer

let audioOutput = AudioOutput()
try audioOutput.startPlayback()
audioOutput.setBufferDuration(0.5)  // 500ms buffer

// Enqueue chunks as they're generated
for chunk in audioChunks {
    audioOutput.enqueueAudioChunk(chunk)
}

await audioOutput.stopPlayback(waitForCompletion: true)

Monitor Playback Position

try audioOutput.startPlayback()
audioOutput.setBufferDuration(0)

// Monitor playback position
Task {
    while audioOutput.currentPlaybackTime < totalDuration {
        print("Position: \(audioOutput.currentPlaybackTime)s")
        try await Task.sleep(for: .milliseconds(100))
    }
}

audioOutput.enqueueAudioChunk(samples)
await audioOutput.stopPlayback(waitForCompletion: true)

Crossfade Multiple Chunks

let chunk1 = try await tts.generate(text: "First sentence.", voice: "ryan")
let chunk2 = try await tts.generate(text: "Second sentence.", voice: "ryan")
let chunk3 = try await tts.generate(text: "Third sentence.", voice: "ryan")

let chunks = [chunk1.audio, chunk2.audio, chunk3.audio]
let fused = AudioOutput.crossfade(chunks, fadeLength: 2400)  // 100ms at 24kHz

try await AudioOutput.saveAudio(
    fused,
    toFolder: outputFolder,
    filename: "combined.m4a"
)

WhisperKit

TTSKit

Core Types

Overview

Key Features

Initialization

Properties

Static Properties

Instance Methods

Configuration

configure(sampleRate:)

Playback Control

startPlayback(deferEngineStart:)

setBufferDuration(_:)

enqueueAudioChunk(_:)

stopPlayback(waitForCompletion:)

File Export

saveAudio(_:toFolder:filename:sampleRate:format:metadataProvider:)

duration(of:)

Crossfade Assembly

crossfade(_:fadeLength:)

AudioFileFormat

Properties

Static Methods

resolve(_:)

Buffer Lifecycle

Example Usage

Save to File

Save with Metadata

Stream with Custom Buffer

Monitor Playback Position

Crossfade Multiple Chunks

Build docs developers (and LLMs) love

WhisperKit

TTSKit

Core Types

Documentation Index

​Overview

​Key Features

​Initialization

​Properties

​Static Properties

​Instance Methods

​Configuration

​configure(sampleRate:)

​Playback Control

​startPlayback(deferEngineStart:)

​setBufferDuration(_:)

​enqueueAudioChunk(_:)

​stopPlayback(waitForCompletion:)

​File Export

​saveAudio(_:toFolder:filename:sampleRate:format:metadataProvider:)

​duration(of:)

​Crossfade Assembly

​crossfade(_:fadeLength:)

​AudioFileFormat

​Properties

​Static Methods

​resolve(_:)

​Buffer Lifecycle

​Example Usage

​Save to File

​Save with Metadata

​Stream with Custom Buffer

​Monitor Playback Position

​Crossfade Multiple Chunks

Build docs developers (and LLMs) love

Overview

Key Features

Initialization

Properties

Static Properties

Instance Methods

Configuration

configure(sampleRate:)

Playback Control

startPlayback(deferEngineStart:)

setBufferDuration(_:)

enqueueAudioChunk(_:)

stopPlayback(waitForCompletion:)

File Export

saveAudio(_:toFolder:filename:sampleRate:format:metadataProvider:)

duration(of:)

Crossfade Assembly

crossfade(_:fadeLength:)

AudioFileFormat

Properties

Static Methods

resolve(_:)

Buffer Lifecycle

Example Usage

Save to File

Save with Metadata

Stream with Custom Buffer

Monitor Playback Position

Crossfade Multiple Chunks