Long-Form Multi-Speaker Model

Overview

The VibeVoice long-form multi-speaker model is designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

This model can synthesize conversational or single-speaker speech up to 90 minutes with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

Model Specifications

Specification	Value
Base Model	Qwen2.5 1.5B
Max Duration	90 minutes
Max Speakers	4 distinct speakers
Frame Rate	7.5 Hz (ultra-low)
Primary Languages	English, Chinese
Use Cases	Podcasts, audiobooks, long conversations

Key Features

Extended Multi-Speaker Support

Generates speech with up to 4 distinct speakers, maintaining consistent voice characteristics and natural turn-taking throughout conversations. This surpasses traditional TTS systems limited to 1-2 speakers.

Ultra-Long Generation

Capable of generating up to 90 minutes of continuous, coherent speech without degradation in quality or speaker consistency. Perfect for podcasts, audiobooks, and extended narratives.

Expressive Speech

Produces natural, expressive speech with appropriate prosody, emotion, and conversational dynamics. The model understands dialogue flow and context to generate realistic turn-taking.

Spontaneous Capabilities

Can generate spontaneous singing and other expressive vocalizations when contextually appropriate, adding naturalness to generated content.

Architecture

The long-form model employs a novel framework using continuous speech tokenizers and next-token diffusion.

Dual Tokenizer System

Unlike the realtime variant, the long-form model uses both tokenizers:

Acoustic Tokenizer

vae_dim

integer

default:"64"

VAE latent dimension for acoustic features

encoder_ratios

array

default:"[8,5,5,4,2,2]"

Downsampling ratios for each encoder layer

encoder_depths

string

default:"3-3-3-3-3-3-8"

Depth configuration for encoder layers

encoder_n_filters

integer

default:"32"

Number of filters in encoder

causal

boolean

default:"true"

Whether to use causal convolutions

Semantic Tokenizer

vae_dim

integer

default:"64"

VAE latent dimension for semantic features (can differ from acoustic)

fix_std

float

default:"0"

Fixed standard deviation (0 for semantic tokenizer)

std_dist_type

string

default:"none"

Standard deviation distribution type (none for deterministic encoding)

Diffusion Head Configuration

hidden_size

integer

default:"768"

Hidden dimension size for the diffusion head

head_layers

integer

default:"4"

Number of transformer layers in diffusion head

head_ffn_ratio

float

default:"3.0"

Feed-forward network expansion ratio

latent_size

integer

default:"64"

Size of latent representation for diffusion

ddpm_num_steps

integer

default:"1000"

Number of diffusion training steps

ddpm_num_inference_steps

integer

default:"20"

Number of inference steps for speech generation

ddpm_beta_schedule

string

default:"cosine"

Beta schedule for diffusion process (cosine or linear)

ddpm_batch_mul

integer

default:"4"

Batch multiplier for diffusion training

Language Model Backbone

Built on Qwen2.5 1.5B with customizations for speech generation:

Context Understanding: LLM component understands textual context and dialogue flow
Speaker Modeling: Handles multi-speaker turn-taking and consistency
Diffusion Integration: Diffusion head generates high-fidelity acoustic details

Technical Innovation

Ultra-Low Frame Rate Tokenizers

A core innovation of VibeVoice is the use of continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. This design:

Preserves Audio Fidelity: Maintains high-quality audio despite low frame rate
Boosts Efficiency: Significantly reduces computational requirements for long sequences
Enables Long-Form: Makes 90-minute generation computationally feasible

Next-Token Diffusion Framework

Combines the strengths of:

LLM: Understands text, context, dialogue structure, and speaker turns
Diffusion Models: Generates high-fidelity, natural-sounding acoustic details

This hybrid approach outperforms pure autoregressive or pure diffusion models.

Performance

Quality Metrics

The model achieves superior performance compared to existing TTS systems:

Based on MOS (Mean Opinion Score) preference testing, VibeVoice demonstrates higher naturalness and quality ratings compared to baseline models.

Capabilities Demonstrated

English Conversations
Chinese Conversations
Cross-Lingual
Special Features

Natural multi-speaker dialogues
Appropriate prosody and emotion
Smooth turn-taking
Speaker consistency over long durations

Usage

Installation

# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/

# Install dependencies
pip install -e .

Basic Generation

from transformers import AutoModelForCausalLM
from vibevoice.modular.modular_vibevoice_text_tokenizer import VibeVoiceTextTokenizerFast

# Load the long-form model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/VibeVoice",  # Long-form model path
    trust_remote_code=True
)

tokenizer = VibeVoiceTextTokenizerFast.from_pretrained(
    "microsoft/VibeVoice"
)

# Prepare multi-speaker conversation
conversation = """
Speaker 1: Welcome to our podcast about AI.
Speaker 2: Thanks for having me! I'm excited to discuss this.
Speaker 1: Let's start with the basics...
"""

input_ids = tokenizer(conversation, return_tensors="pt").input_ids

# Generate speech
output = model.generate(
    inputs=input_ids,
    tokenizer=tokenizer,
    max_new_tokens=10000,  # For long-form generation
    return_speech=True
)

# Access generated audio
audio = output.speech_outputs[0]

Multi-Speaker Configuration

2 Speakers
4 Speakers

# Define speaker roles
speakers = {
    "host": "Speaker1",
    "guest": "Speaker2"
}

conversation = f"""
{speakers['host']}: Welcome to the show.
{speakers['guest']}: Thank you for the invitation.
"""

# Maximum supported speakers
speakers = {
    "host": "Speaker1",
    "guest1": "Speaker2",
    "guest2": "Speaker3",
    "moderator": "Speaker4"
}

panel_discussion = f"""
{speakers['host']}: Let's begin our panel discussion.
{speakers['guest1']}: I'd like to share my perspective.
{speakers['guest2']}: I have a different view on this.
{speakers['moderator']}: Both perspectives are valuable.
"""

Limitations

Research Model: Not recommended for commercial or real-world applications without further testing and development. This model is intended for research and development purposes only.

Language Support

Supported Languages

English: Full support
Chinese: Full support
Other Languages: May produce unexpected audio outputs

Technical Limitations

Non-Speech Audio

The model focuses solely on speech synthesis and does not handle:

Background noise
Music (except spontaneous singing)
Sound effects
Environmental sounds

Overlapping Speech

The current model does not explicitly model or generate overlapping speech segments in conversations. All speakers take distinct turns.

Speaker Limits

Maximum: 4 distinct speakers
Each speaker must maintain consistent role throughout
Cannot dynamically add new speakers mid-generation

Model Inheritance

VibeVoice inherits any biases, errors, or omissions from its base model (Qwen2.5 1.5B). The model may produce outputs that are unexpected, biased, or inaccurate.

Responsible AI Considerations

Deepfake and Disinformation Risks: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation.

User Responsibilities

Users must:

Ensure Transcript Reliability: Verify that input text is accurate and appropriate
Check Content Accuracy: Validate generated content before distribution
Avoid Misleading Use: Do not use generated content in deceptive ways
Disclose AI Usage: Best practice to disclose when content is AI-generated
Legal Compliance: Deploy in full compliance with all applicable laws and regulations
Lawful Deployment: Use generated content in a lawful manner

Ethical Guidelines

Transparency

Always disclose the use of AI when sharing AI-generated audio content. Listeners have the right to know when content is synthetic.

Verification

Implement verification mechanisms to prevent misuse, especially in contexts where voice authenticity is critical (e.g., news, official communications).

Testing

Conduct thorough testing for bias, accuracy, and safety before any deployment, especially in sensitive domains.

Model Architecture Details

Tokenizer Configuration

Both acoustic and semantic tokenizers share common parameters:

channels

integer

default:"1"

Number of audio channels (mono)

corpus_normalize

float

default:"0.0"

Corpus-level normalization factor

mixer_layer

string

default:"depthwise_conv"

Type of mixing layer (depthwise_conv)

conv_norm

string

default:"none"

Convolution normalization type

layernorm

string

default:"RMSNorm"

Layer normalization type

layernorm_eps

float

default:"1e-5"

Epsilon for layer normalization

conv_bias

boolean

default:"true"

Whether to use bias in convolutions

layer_scale_init_value

float

default:"1e-6"

Initial value for layer scaling

Use Cases

Podcast Generation

Ideal for creating synthetic podcasts:

Multiple hosts and guests
Natural conversational flow
Extended episode lengths (up to 90 minutes)
Consistent speaker identities

Audiobook Production

Generate audiobooks with:

Single narrator or multiple character voices
Long-form narration without quality degradation
Expressive reading with appropriate emotion

Educational Content

Create educational materials:

Dialogue-based learning scenarios
Multi-speaker explanations
Interview-style educational content

Content Creation

Support content creators with:

Synthetic voice-overs for videos
Character voices for animation
Placeholder audio for production workflows

Model Links

Next Steps

Explore the VibeVoice ecosystem:

VibeVoice-Realtime-0.5B - For real-time, streaming TTS applications
API Reference - Detailed API documentation
Guides - Inference examples and use cases

Get Started

Models

Guides

Architecture

Resources

​Overview

​Model Specifications

​Key Features

​Architecture

​Dual Tokenizer System

​Acoustic Tokenizer

​Semantic Tokenizer

​Diffusion Head Configuration

​Language Model Backbone

​Technical Innovation

​Ultra-Low Frame Rate Tokenizers

​Next-Token Diffusion Framework

​Performance

​Quality Metrics

​Capabilities Demonstrated

​Usage

​Installation

​Basic Generation

​Multi-Speaker Configuration

​Limitations

​Language Support

​Technical Limitations

​Model Inheritance

​Responsible AI Considerations

​User Responsibilities

​Ethical Guidelines

​Model Architecture Details

​Tokenizer Configuration

​Use Cases

​Podcast Generation

​Audiobook Production

​Educational Content

​Content Creation

​Model Links

​Next Steps

Build docs developers (and LLMs) love

Overview

Model Specifications

Key Features

Architecture

Dual Tokenizer System

Acoustic Tokenizer

Semantic Tokenizer

Diffusion Head Configuration

Language Model Backbone

Technical Innovation

Ultra-Low Frame Rate Tokenizers

Next-Token Diffusion Framework

Performance

Quality Metrics

Capabilities Demonstrated

Usage

Installation

Basic Generation

Multi-Speaker Configuration

Limitations

Language Support

Technical Limitations

Model Inheritance

Responsible AI Considerations

User Responsibilities

Ethical Guidelines

Model Architecture Details

Tokenizer Configuration

Use Cases

Podcast Generation

Audiobook Production

Educational Content

Content Creation

Model Links

Next Steps