Welcome to VibeVoice

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

Key Features

Real-Time TTS

Produces initial audible speech in ~300 milliseconds with streaming text input support

Long-Form Generation

Synthesizes conversational speech up to 90 minutes with up to 4 distinct speakers

High Fidelity

Ultra-low frame rate (7.5 Hz) continuous speech tokenizers preserve audio quality

Lightweight

0.5B parameter real-time model enables deployment-friendly applications

Model Variants

VibeVoice currently includes two model variants:

VibeVoice-Realtime-0.5B

A lightweight real-time text-to-speech model supporting:

Streaming text input
Robust long-form speech generation (~10 minutes)
Real-time TTS with ~300 millisecond latency
Single speaker support
8K context length

The realtime model uses an interleaved, windowed design that incrementally encodes incoming text chunks while continuing diffusion-based acoustic latent generation from prior context.

Long-Form Multi-Speaker Model

Synthesizes conversational or single-speaker speech:

Up to 90 minutes of continuous audio
Up to 4 distinct speakers
Natural turn-taking and speaker consistency
Expressive and natural-sounding output

Core Innovation

VibeVoice employs a next-token diffusion framework, leveraging:

Large Language Model (LLM) - Understands textual context and dialogue flow (based on Qwen2.5)
Diffusion Head - Generates high-fidelity acoustic details
Continuous Speech Tokenizers - Acoustic and semantic tokenizers operating at 7.5 Hz for efficient processing

This architecture significantly boosts computational efficiency while preserving audio fidelity for processing long sequences.

Performance

VibeVoice-Realtime-0.5B achieves competitive performance on benchmark datasets:

Benchmark	WER (%)	Speaker Similarity
LibriSpeech test-clean	2.00	0.695
SEED test-en	2.05	0.633

Quick Links

Installation

Get started with installation and setup

Quickstart

Generate your first speech in minutes

GitHub Repository

View source code and contribute

Hugging Face

Download models and explore demos

Use Cases

Real-time TTS services - Build applications with streaming text-to-speech
Live data narration - Narrate live data streams and feeds
LLM speech output - Let language models speak from their first tokens
Podcast generation - Create multi-speaker conversational content
Long-form content - Generate extended audio narration and conversations

VibeVoice is intended for research and development purposes only. This model should not be used in commercial or real-world applications without further testing and development. Always disclose the use of AI when sharing AI-generated content.

What’s Next?

Install VibeVoice

Follow the installation guide to set up your environment

Run Your First Example

Try the quickstart tutorial to generate speech from text

Explore Advanced Features

Learn about different model variants and customization options

Get Started

Models

Guides

Architecture

Resources

Introduction

Welcome to VibeVoice

Key Features

Real-Time TTS

Long-Form Generation

High Fidelity

Lightweight

Model Variants

VibeVoice-Realtime-0.5B

Long-Form Multi-Speaker Model

Core Innovation

Performance

Quick Links

Installation

Quickstart

GitHub Repository

Hugging Face

Use Cases

What’s Next?

Build docs developers (and LLMs) love

Get Started

Models

Guides

Architecture

Resources

​Welcome to VibeVoice

​Key Features

Real-Time TTS

Long-Form Generation

High Fidelity

Lightweight

​Model Variants

​VibeVoice-Realtime-0.5B

​Long-Form Multi-Speaker Model

​Core Innovation

​Performance

​Quick Links

Installation

Quickstart

GitHub Repository

Hugging Face

​Use Cases

​What’s Next?

Build docs developers (and LLMs) love

Welcome to VibeVoice

Key Features

Model Variants

VibeVoice-Realtime-0.5B

Long-Form Multi-Speaker Model

Core Innovation

Performance

Quick Links

Use Cases

What’s Next?