Skip to main content

Welcome to VibeVoice

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

Key Features

Real-Time TTS

Produces initial audible speech in ~300 milliseconds with streaming text input support

Long-Form Generation

Synthesizes conversational speech up to 90 minutes with up to 4 distinct speakers

High Fidelity

Ultra-low frame rate (7.5 Hz) continuous speech tokenizers preserve audio quality

Lightweight

0.5B parameter real-time model enables deployment-friendly applications

Model Variants

VibeVoice currently includes two model variants:

VibeVoice-Realtime-0.5B

A lightweight real-time text-to-speech model supporting:
  • Streaming text input
  • Robust long-form speech generation (~10 minutes)
  • Real-time TTS with ~300 millisecond latency
  • Single speaker support
  • 8K context length
The realtime model uses an interleaved, windowed design that incrementally encodes incoming text chunks while continuing diffusion-based acoustic latent generation from prior context.

Long-Form Multi-Speaker Model

Synthesizes conversational or single-speaker speech:
  • Up to 90 minutes of continuous audio
  • Up to 4 distinct speakers
  • Natural turn-taking and speaker consistency
  • Expressive and natural-sounding output

Core Innovation

VibeVoice employs a next-token diffusion framework, leveraging:
  • Large Language Model (LLM) - Understands textual context and dialogue flow (based on Qwen2.5)
  • Diffusion Head - Generates high-fidelity acoustic details
  • Continuous Speech Tokenizers - Acoustic and semantic tokenizers operating at 7.5 Hz for efficient processing
This architecture significantly boosts computational efficiency while preserving audio fidelity for processing long sequences.

Performance

VibeVoice-Realtime-0.5B achieves competitive performance on benchmark datasets:
BenchmarkWER (%)Speaker Similarity
LibriSpeech test-clean2.000.695
SEED test-en2.050.633

Installation

Get started with installation and setup

Quickstart

Generate your first speech in minutes

GitHub Repository

View source code and contribute

Hugging Face

Download models and explore demos

Use Cases

  • Real-time TTS services - Build applications with streaming text-to-speech
  • Live data narration - Narrate live data streams and feeds
  • LLM speech output - Let language models speak from their first tokens
  • Podcast generation - Create multi-speaker conversational content
  • Long-form content - Generate extended audio narration and conversations
VibeVoice is intended for research and development purposes only. This model should not be used in commercial or real-world applications without further testing and development. Always disclose the use of AI when sharing AI-generated content.

What’s Next?

1

Install VibeVoice

Follow the installation guide to set up your environment
2

Run Your First Example

Try the quickstart tutorial to generate speech from text
3

Explore Advanced Features

Learn about different model variants and customization options

Build docs developers (and LLMs) love