Welcome to VibeVoice
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.Key Features
Real-Time TTS
Produces initial audible speech in ~300 milliseconds with streaming text input support
Long-Form Generation
Synthesizes conversational speech up to 90 minutes with up to 4 distinct speakers
High Fidelity
Ultra-low frame rate (7.5 Hz) continuous speech tokenizers preserve audio quality
Lightweight
0.5B parameter real-time model enables deployment-friendly applications
Model Variants
VibeVoice currently includes two model variants:VibeVoice-Realtime-0.5B
A lightweight real-time text-to-speech model supporting:- Streaming text input
- Robust long-form speech generation (~10 minutes)
- Real-time TTS with ~300 millisecond latency
- Single speaker support
- 8K context length
The realtime model uses an interleaved, windowed design that incrementally encodes incoming text chunks while continuing diffusion-based acoustic latent generation from prior context.
Long-Form Multi-Speaker Model
Synthesizes conversational or single-speaker speech:- Up to 90 minutes of continuous audio
- Up to 4 distinct speakers
- Natural turn-taking and speaker consistency
- Expressive and natural-sounding output
Core Innovation
VibeVoice employs a next-token diffusion framework, leveraging:- Large Language Model (LLM) - Understands textual context and dialogue flow (based on Qwen2.5)
- Diffusion Head - Generates high-fidelity acoustic details
- Continuous Speech Tokenizers - Acoustic and semantic tokenizers operating at 7.5 Hz for efficient processing
Performance
VibeVoice-Realtime-0.5B achieves competitive performance on benchmark datasets:| Benchmark | WER (%) | Speaker Similarity |
|---|---|---|
| LibriSpeech test-clean | 2.00 | 0.695 |
| SEED test-en | 2.05 | 0.633 |
Quick Links
Installation
Get started with installation and setup
Quickstart
Generate your first speech in minutes
GitHub Repository
View source code and contribute
Hugging Face
Download models and explore demos
Use Cases
- Real-time TTS services - Build applications with streaming text-to-speech
- Live data narration - Narrate live data streams and feeds
- LLM speech output - Let language models speak from their first tokens
- Podcast generation - Create multi-speaker conversational content
- Long-form content - Generate extended audio narration and conversations
What’s Next?
Install VibeVoice
Follow the installation guide to set up your environment
Run Your First Example
Try the quickstart tutorial to generate speech from text