Overview
TTSKit provides on-device text-to-speech using Qwen3 TTS models running entirely on Apple silicon with real-time streaming playback.
TTSKit requires macOS 15.0+ or iOS 18.0+
Quick Start
Basic Speech Generation
import TTSKit
Task {
let tts = try await TTSKit ()
let result = try await tts. generate ( text : "Hello from TTSKit!" )
print ( "Generated \( result. audioDuration ) s of audio at \( result. sampleRate ) Hz" )
}
TTSKit() automatically downloads the default 0.6B model on first run and loads all necessary components.
Real-Time Streaming Playback
Play audio as it’s being generated:
try await tts. play ( text : "This starts playing before generation finishes." )
The audio begins playing immediately, streaming frame-by-frame as it’s generated.
Model Selection
TTSKit offers two model sizes:
// Fast, works on all platforms (~1 GB)
let tts = try await TTSKit (
TTSKitConfig ( model : . qwen3TTS_0_6b )
)
// Higher quality, macOS only (~2.2 GB, supports style instructions)
let tts = try await TTSKit (
TTSKitConfig ( model : . qwen3TTS_1_7b )
)
Models are hosted on HuggingFace and cached locally after first download.
Voice Selection
Choose from 9 built-in voices:
let result = try await tts. generate (
text : "Hello world" ,
speaker : . ryan , // or .aiden, .onoAnna, .sohee, .eric, .dylan, .serena, .vivian, .uncleFu
language : . english
)
Available Voices
ryan - Clear, neutral male voice
aiden - Warm male voice
onoAnna - Professional female voice
sohee - Friendly female voice
eric - Authoritative male voice
dylan - Casual male voice
serena - Calm female voice
vivian - Energetic female voice
uncleFu - Character voice
Multi-Language Support
TTSKit supports 10 languages:
// Japanese
let result = try await tts. generate (
text : "こんにちは世界" ,
speaker : . onoAnna ,
language : . japanese
)
// Spanish
let result = try await tts. generate (
text : "Hola mundo" ,
speaker : . serena ,
language : . spanish
)
Supported Languages
.english, .chinese, .japanese, .korean, .german, .french, .russian, .portuguese, .spanish, .italian
Generation Options
Customize the generation behavior:
var options = GenerationOptions ()
options. temperature = 0.9
options. topK = 50
options. repetitionPenalty = 1.05
options. maxNewTokens = 245
// Auto-split long text at sentence boundaries
options. chunkingStrategy = . sentence
// Run all chunks concurrently (nil = auto)
options. concurrentWorkerCount = nil
let result = try await tts. generate (
text : longArticle,
speaker : . ryan ,
language : . english ,
options : options
)
Style Instructions (1.7B Model Only)
The 1.7B model accepts natural-language style instructions:
var options = GenerationOptions ()
options. instruction = "Speak slowly and warmly, like a storyteller."
let result = try await tts. generate (
text : "Once upon a time..." ,
speaker : . ryan ,
options : options
)
Style instructions only work with the 1.7B model . They are ignored by the 0.6B model.
Playback Strategies
Control how audio is buffered and played:
// Auto: Measures first step, buffers just enough to avoid underruns
try await tts. play (
text : "Long passage..." ,
playbackStrategy : . auto
)
// Stream: Immediate playback, no pre-buffer
try await tts. play (
text : "Real-time speech" ,
playbackStrategy : . stream
)
// Buffered: Fixed pre-buffer duration
try await tts. play (
text : "Smooth playback" ,
playbackStrategy : . buffered ( seconds : 2.0 )
)
// GenerateFirst: Generate all audio first, then play
try await tts. play (
text : "Complete before play" ,
playbackStrategy : . generateFirst
)
Saving Audio Files
Export generated audio to disk:
let result = try await tts. generate ( text : "Save me!" )
let outputDir = FileManager. default . urls (
for : . documentDirectory ,
in : . userDomainMask
)[ 0 ]
// Save as WAV
try await AudioOutput. saveAudio (
result. audio ,
toFolder : outputDir,
filename : "output" ,
format : . wav
)
// Save as M4A (AAC)
try await AudioOutput. saveAudio (
result. audio ,
toFolder : outputDir,
filename : "output" ,
format : . m4a
)
Progress Callbacks
Receive per-step audio during generation:
let result = try await tts. generate (
text : "Hello!"
) { progress in
print ( "Audio chunk: \( progress. audio . count ) samples" )
if let stepTime = progress.stepTime {
print ( "First step took \( stepTime ) s" )
}
// Return false to cancel early
return true
}
Complete SwiftUI Example
import SwiftUI
import TTSKit
struct TTSView : View {
@StateObject private var viewModel = TTSViewModel ()
@State private var inputText = "Hello from TTSKit!"
@State private var selectedSpeaker: Qwen3Speaker = . ryan
@State private var selectedLanguage: Qwen3Language = . english
var body: some View {
VStack ( spacing : 20 ) {
// Input
TextEditor ( text : $inputText)
. frame ( height : 100 )
. border (Color. gray , width : 1 )
// Voice selection
HStack {
Text ( "Speaker:" )
Picker ( "Speaker" , selection : $selectedSpeaker) {
Text ( "Ryan" ). tag (Qwen3Speaker. ryan )
Text ( "Aiden" ). tag (Qwen3Speaker. aiden )
Text ( "Ono Anna" ). tag (Qwen3Speaker. onoAnna )
Text ( "Sohee" ). tag (Qwen3Speaker. sohee )
}
}
// Language selection
HStack {
Text ( "Language:" )
Picker ( "Language" , selection : $selectedLanguage) {
Text ( "English" ). tag (Qwen3Language. english )
Text ( "Japanese" ). tag (Qwen3Language. japanese )
Text ( "Spanish" ). tag (Qwen3Language. spanish )
}
}
// Controls
HStack {
Button ( "Generate" ) {
Task {
await viewModel. generate (
text : inputText,
speaker : selectedSpeaker,
language : selectedLanguage
)
}
}
. disabled (viewModel. isGenerating )
Button ( "Play" ) {
Task {
await viewModel. playGenerated ()
}
}
. disabled (viewModel. audioSamples . isEmpty || viewModel. isPlaying )
}
// Waveform visualization
if ! viewModel.waveform. isEmpty {
WaveformView ( samples : viewModel. waveform )
. frame ( height : 100 )
}
// Status
Text (viewModel. statusMessage )
. foregroundColor (. secondary )
if viewModel.isGenerating {
ProgressView ()
}
}
. padding ()
. task {
await viewModel. loadModel ()
}
}
}
@MainActor
class TTSViewModel : ObservableObject {
@Published var statusMessage = "Loading model..."
@Published var isGenerating = false
@Published var isPlaying = false
@Published var audioSamples: [ Float ] = []
@Published var waveform: [ Float ] = []
@Published var audioDuration: TimeInterval = 0
private var ttsKit: TTSKit ?
private var audioPlayer: AVAudioPlayer ?
func loadModel () async {
do {
ttsKit = try await TTSKit ( TTSKitConfig ( model : . qwen3TTS_0_6b ))
statusMessage = "Ready"
} catch {
statusMessage = "Failed to load model: \( error ) "
}
}
func generate ( text : String , speaker : Qwen3Speaker, language : Qwen3Language) async {
guard let ttsKit = ttsKit else { return }
isGenerating = true
statusMessage = "Generating..."
waveform = []
do {
let result = try await ttsKit. generate (
text : text,
speaker : speaker,
language : language
) { progress in
// Collect waveform peaks
let peak = progress. audio . reduce ( Float ( 0 )) { max ( $0 , abs ( $1 )) }
Task { @MainActor in
self . waveform . append (peak)
}
return true
}
audioSamples = result. audio
audioDuration = result. audioDuration
statusMessage = String ( format : "Generated %.1fs (RTF: %.2f)" ,
result. audioDuration ,
result. timings . realTimeFactor )
} catch {
statusMessage = "Error: \( error ) "
}
isGenerating = false
}
func playGenerated () async {
guard let ttsKit = ttsKit, ! audioSamples. isEmpty else { return }
do {
// Save to temporary file
let tempURL = FileManager. default . temporaryDirectory
. appendingPathComponent ( "temp_audio.wav" )
try await AudioOutput. saveAudio (
audioSamples,
toFolder : tempURL. deletingLastPathComponent (),
filename : "temp_audio" ,
format : . wav
)
// Play with AVAudioPlayer
audioPlayer = try AVAudioPlayer ( contentsOf : tempURL)
audioPlayer ? . play ()
isPlaying = true
statusMessage = "Playing..."
// Wait for playback to finish
try await Task. sleep ( for : . seconds (audioDuration))
isPlaying = false
statusMessage = "Playback complete"
} catch {
statusMessage = "Playback error: \( error ) "
isPlaying = false
}
}
}
struct WaveformView : View {
let samples: [ Float ]
var body: some View {
GeometryReader { geometry in
HStack ( spacing : 1 ) {
ForEach (samples. indices , id : \. self ) { index in
RoundedRectangle ( cornerRadius : 2 )
. fill (Color. blue )
. frame (
width : 2 ,
height : CGFloat (samples[index]) * geometry. size . height
)
}
}
}
}
}
Command Line Usage
TTSKit is available through the whisperkit-cli tool:
# Generate and play speech
swift run whisperkit-cli tts \
--text "Hello from the command line" \
--play
# Save to file
swift run whisperkit-cli tts \
--text "Save to file" \
--output-path output.wav
# Japanese with specific speaker
swift run whisperkit-cli tts \
--text "日本語テスト" \
--speaker ono-anna \
--language japanese
# Use 1.7B model with style instruction
swift run whisperkit-cli tts \
--text-file article.txt \
--model 1.7b \
--instruction "Read cheerfully"
# See all options
swift run whisperkit-cli tts --help
Compute Units
Optimize for your device:
let config = TTSKitConfig (
model : . qwen3TTS_0_6b ,
computeOptions : ComputeOptions (
embedderComputeUnits : . cpuOnly ,
codeDecoderComputeUnits : . cpuAndNeuralEngine ,
multiCodeDecoderComputeUnits : . cpuAndNeuralEngine ,
speechDecoderComputeUnits : . cpuAndNeuralEngine
)
)
let tts = try await TTSKit (config)
Concurrent Workers
Adjust concurrency for long text:
var options = GenerationOptions ()
// Use 4 concurrent workers
options. concurrentWorkerCount = 4
// Or let TTSKit decide (recommended)
options. concurrentWorkerCount = nil
Demo App
The TTSKitExample app showcases:
Real-time streaming playback
Model management UI
Waveform visualization
Generation history
macOS and iOS support
Build and run to explore all features!
Next Steps
Basic Transcription Learn speech-to-text with WhisperKit
Real-Time Streaming Transcribe audio in real-time