Skip to main content

Overview

AudioOutput handles audio export to file and real-time streaming playback via AVAudioEngine. It supports adaptive pre-buffering and edge-fading to prevent audible clicks during playback.
public class AudioOutput: @unchecked Sendable

Key Features

  • Pre-buffering: Accumulates audio frames until a threshold is reached before flushing to the player, preventing underruns on slower devices
  • Edge-fading: Applies fade-in/fade-out only at actual audio discontinuities (session start/end, chunk boundaries, underruns)
  • Underrun detection: Uses wall-clock timing to detect when the player has drained and needs fade-in on the next frame
  • File export: Supports M4A and WAV formats with optional metadata embedding

Initialization

public init(sampleRate: Int = 24000)
sampleRate
Int
default:"24000"
Output sample rate in Hz. Defaults to 24000 (Qwen3 TTS).

Properties

sampleRate
Int
Output sample rate in Hz. Read-only. Updated by TTSKit.loadModels() to match the loaded speech decoder’s actual sample rate.
audioFormat
AVAudioFormat
The audio format used for playback and export (derived from sampleRate). Read-only.
scheduledAudioDuration
TimeInterval
Cumulative duration (seconds) of real audio that has been scheduled via scheduleWithFades. Read-only.The silent sentinel buffer used for drain detection is not included.
currentPlaybackTime
TimeInterval
Current playback position in seconds, based on the audio engine’s render timeline. Read-only.Returns 0 if the player is not active, no audio has been scheduled yet, or the player hasn’t started rendering.Clamped to scheduledAudioDuration so the position never advances into silence gaps between chunks or past the last real audio frame.
silentBufferRemaining
TimeInterval
How many seconds of audio still need to accumulate in the pre-buffer before the next chunk flushes and playback resumes. Read-only.Non-zero only while in buffering mode (bufferThresholdMet == false and a positive bufferDuration is set).

Static Properties

fadeLengthSamples
Int
Number of samples for the fade-in/fade-out ramp. Value: 256256 samples at 24kHz ≈ 10.7ms - imperceptible on contiguous audio but smoothly eliminates clicks at discontinuities.

Instance Methods

Configuration

configure(sampleRate:)

Update the sample rate to match the loaded speech decoder.
public func configure(sampleRate newRate: Int)
newRate
Int
required
The new sample rate in Hz.
Must be called before startPlayback().

Playback Control

startPlayback(deferEngineStart:)

Start the audio engine for streaming playback.
public func startPlayback(deferEngineStart: Bool = false) throws
deferEngineStart
Bool
default:"false"
When true, the audio engine is created and connected but not started. The engine will start automatically on the first enqueueAudioChunk call. This avoids the render thread contending with model predictions during the critical time-to-first-buffer path.
Resets all buffering, fade, and timing state. After calling this, configure the buffer threshold via setBufferDuration(_:). Throws: TTSError if the audio engine fails to start.

setBufferDuration(_:)

Configure the pre-buffer duration.
public func setBufferDuration(_ seconds: TimeInterval)
seconds
TimeInterval
required
Duration of audio to accumulate before flushing. Pass 0 for immediate streaming (fast devices).
Call after startPlayback().
  • If seconds == 0: immediately flushes any pending frames and switches to direct streaming (no buffering)
  • If seconds > 0: sets the threshold. If enough audio has already accumulated, flushes immediately
  • Can be called multiple times (e.g., per-chunk reassessment). Any held tail frame from the previous chunk is committed with fade-out first

enqueueAudioChunk(_:)

Enqueue a chunk of audio samples for playback.
public func enqueueAudioChunk(_ samples: [Float])
samples
[Float]
required
Mono Float32 PCM samples to enqueue.
In streaming mode, detects underruns via wall-clock timing: if the player has drained since the last buffer, the held tail is committed with fade-out (it was the last frame before the gap) and the incoming frame is marked for fade-in. On contiguous playback, no fades are applied to interior frames.

stopPlayback(waitForCompletion:)

Stop playback and tear down the audio engine.
public func stopPlayback(waitForCompletion: Bool = true) async
waitForCompletion
Bool
default:"true"
When true, waits for any remaining scheduled buffers to finish playing before tearing down the engine.
The held tail frame is committed with fade-out (it’s the last frame of the session). Any remaining pending frames are flushed first.

File Export

saveAudio(_:toFolder:filename:sampleRate:format:metadataProvider:)

Save audio samples to a file.
public static func saveAudio(
    _ samples: [Float],
    toFolder folder: URL,
    filename: String,
    sampleRate: Int = 24000,
    format: AudioFileFormat? = nil,
    metadataProvider: (@Sendable () throws -> [AVMetadataItem])? = nil
) async throws -> URL
samples
[Float]
required
Mono Float32 PCM samples.
folder
URL
required
Destination directory. Created if it doesn’t exist.
filename
String
required
File name, with or without extension. Any extension already present in filename is stripped before writing.
sampleRate
Int
default:"24000"
Sample rate in Hz.
format
AudioFileFormat?
default:"nil"
Output format. Inferred from filename extension when nil. Defaults to .m4a if no extension found.
metadataProvider
(@Sendable () throws -> [AVMetadataItem])?
default:"nil"
Optional metadata callback for items to embed into the file container for m4a formats.
returns
URL
The URL of the written file.
For M4A with metadata: writes PCM → AAC to a temp file, then uses AVAssetExportSession passthrough to remux with embedded metadata atoms (no re-encode). For WAV or metadata-free M4A: writes directly. On watchOS, .m4a automatically falls back to .wav. Throws: TTSError if audio encoding or export fails. Example:
let outputURL = try await AudioOutput.saveAudio(
    result.audio,
    toFolder: URL(fileURLWithPath: "/tmp"),
    filename: "speech.m4a",
    sampleRate: 24000
)

duration(of:)

Return the playback duration of an audio file in seconds.
public static func duration(of url: URL) async throws -> TimeInterval
url
URL
required
URL to the audio file.
returns
TimeInterval
Duration in seconds.
Throws: Error if the file cannot be read.

Crossfade Assembly

crossfade(_:fadeLength:)

Assemble multiple audio chunks into one array with equal-power crossfades at each boundary.
public static func crossfade(_ chunks: [[Float]], fadeLength: Int) -> [Float]
chunks
[[Float]]
required
Ordered audio chunks to concatenate.
fadeLength
Int
required
Number of overlap samples for each crossfade.
returns
[Float]
Single concatenated audio array with crossfades applied at chunk boundaries.
Uses cos(t*pi/2) fade-out and sin(t*pi/2) fade-in so that energy is preserved through the overlap region. Fade curves are pre-computed once via Accelerate (vDSP_vramp + vvcosf/vvsinf) and reused at every chunk boundary. Example:
let chunks = [
    [0.1, 0.2, 0.3, 0.4],
    [0.5, 0.6, 0.7, 0.8],
    [0.9, 1.0, 0.9, 0.8]
]
let fused = AudioOutput.crossfade(chunks, fadeLength: 2)

AudioFileFormat

Supported audio export formats.
public enum AudioFileFormat: String, Sendable {
    case m4a
    case wav
}

Properties

fileExtension
String
The file extension for this format (e.g., “m4a”, “wav”).

Static Methods

resolve(_:)

Resolve the effective format for the current platform.
public static func resolve(_ preferred: AudioFileFormat = .m4a) -> AudioFileFormat
preferred
AudioFileFormat
default:".m4a"
Preferred format.
returns
AudioFileFormat
The resolved format. On watchOS, M4A is not supported so falls back to WAV with a warning.

Buffer Lifecycle

The buffer lifecycle for streaming playback follows these steps:
  1. startPlayback() - resets all state; frames accumulate until configured
  2. setBufferDuration(_:) - configures threshold (call after start)
  3. enqueueAudioChunk(_:) - pushes frames through the buffer/tail pipeline
  4. stopPlayback() - commits the tail with fade-out, waits, tears down

Example Usage

Save to File

let result = try await tts.generate(
    text: "Hello, world!",
    voice: "ryan"
)

let outputURL = try await AudioOutput.saveAudio(
    result.audio,
    toFolder: URL(fileURLWithPath: "/tmp"),
    filename: "output.m4a"
)
print("Saved to \(outputURL.path)")

Save with Metadata

let metadata = [
    AVMetadataItem.makeMetadata(.commonIdentifierTitle, value: "My Speech"),
    AVMetadataItem.makeMetadata(.commonIdentifierArtist, value: "TTSKit")
]

let outputURL = try await AudioOutput.saveAudio(
    result.audio,
    toFolder: outputFolder,
    filename: "speech.m4a",
    metadataProvider: { metadata }
)

Stream with Custom Buffer

let audioOutput = AudioOutput()
try audioOutput.startPlayback()
audioOutput.setBufferDuration(0.5)  // 500ms buffer

// Enqueue chunks as they're generated
for chunk in audioChunks {
    audioOutput.enqueueAudioChunk(chunk)
}

await audioOutput.stopPlayback(waitForCompletion: true)

Monitor Playback Position

try audioOutput.startPlayback()
audioOutput.setBufferDuration(0)

// Monitor playback position
Task {
    while audioOutput.currentPlaybackTime < totalDuration {
        print("Position: \(audioOutput.currentPlaybackTime)s")
        try await Task.sleep(for: .milliseconds(100))
    }
}

audioOutput.enqueueAudioChunk(samples)
await audioOutput.stopPlayback(waitForCompletion: true)

Crossfade Multiple Chunks

let chunk1 = try await tts.generate(text: "First sentence.", voice: "ryan")
let chunk2 = try await tts.generate(text: "Second sentence.", voice: "ryan")
let chunk3 = try await tts.generate(text: "Third sentence.", voice: "ryan")

let chunks = [chunk1.audio, chunk2.audio, chunk3.audio]
let fused = AudioOutput.crossfade(chunks, fadeLength: 2400)  // 100ms at 24kHz

try await AudioOutput.saveAudio(
    fused,
    toFolder: outputFolder,
    filename: "combined.m4a"
)

Build docs developers (and LLMs) love