Skip to main content

Overview

The VibeVoice long-form multi-speaker model is designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
This model can synthesize conversational or single-speaker speech up to 90 minutes with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

Model Specifications

SpecificationValue
Base ModelQwen2.5 1.5B
Max Duration90 minutes
Max Speakers4 distinct speakers
Frame Rate7.5 Hz (ultra-low)
Primary LanguagesEnglish, Chinese
Use CasesPodcasts, audiobooks, long conversations

Key Features

Generates speech with up to 4 distinct speakers, maintaining consistent voice characteristics and natural turn-taking throughout conversations. This surpasses traditional TTS systems limited to 1-2 speakers.
Capable of generating up to 90 minutes of continuous, coherent speech without degradation in quality or speaker consistency. Perfect for podcasts, audiobooks, and extended narratives.
Produces natural, expressive speech with appropriate prosody, emotion, and conversational dynamics. The model understands dialogue flow and context to generate realistic turn-taking.
Can generate spontaneous singing and other expressive vocalizations when contextually appropriate, adding naturalness to generated content.

Architecture

The long-form model employs a novel framework using continuous speech tokenizers and next-token diffusion.

Dual Tokenizer System

Unlike the realtime variant, the long-form model uses both tokenizers:

Acoustic Tokenizer

vae_dim
integer
default:"64"
VAE latent dimension for acoustic features
encoder_ratios
array
default:"[8,5,5,4,2,2]"
Downsampling ratios for each encoder layer
encoder_depths
string
default:"3-3-3-3-3-3-8"
Depth configuration for encoder layers
encoder_n_filters
integer
default:"32"
Number of filters in encoder
causal
boolean
default:"true"
Whether to use causal convolutions

Semantic Tokenizer

vae_dim
integer
default:"64"
VAE latent dimension for semantic features (can differ from acoustic)
fix_std
float
default:"0"
Fixed standard deviation (0 for semantic tokenizer)
std_dist_type
string
default:"none"
Standard deviation distribution type (none for deterministic encoding)

Diffusion Head Configuration

hidden_size
integer
default:"768"
Hidden dimension size for the diffusion head
head_layers
integer
default:"4"
Number of transformer layers in diffusion head
head_ffn_ratio
float
default:"3.0"
Feed-forward network expansion ratio
latent_size
integer
default:"64"
Size of latent representation for diffusion
ddpm_num_steps
integer
default:"1000"
Number of diffusion training steps
ddpm_num_inference_steps
integer
default:"20"
Number of inference steps for speech generation
ddpm_beta_schedule
string
default:"cosine"
Beta schedule for diffusion process (cosine or linear)
ddpm_batch_mul
integer
default:"4"
Batch multiplier for diffusion training

Language Model Backbone

Built on Qwen2.5 1.5B with customizations for speech generation:
  • Context Understanding: LLM component understands textual context and dialogue flow
  • Speaker Modeling: Handles multi-speaker turn-taking and consistency
  • Diffusion Integration: Diffusion head generates high-fidelity acoustic details

Technical Innovation

Ultra-Low Frame Rate Tokenizers

A core innovation of VibeVoice is the use of continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. This design:
  • Preserves Audio Fidelity: Maintains high-quality audio despite low frame rate
  • Boosts Efficiency: Significantly reduces computational requirements for long sequences
  • Enables Long-Form: Makes 90-minute generation computationally feasible

Next-Token Diffusion Framework

Combines the strengths of:
  • LLM: Understands text, context, dialogue structure, and speaker turns
  • Diffusion Models: Generates high-fidelity, natural-sounding acoustic details
This hybrid approach outperforms pure autoregressive or pure diffusion models.

Performance

Quality Metrics

The model achieves superior performance compared to existing TTS systems:
Based on MOS (Mean Opinion Score) preference testing, VibeVoice demonstrates higher naturalness and quality ratings compared to baseline models.

Capabilities Demonstrated

  • Natural multi-speaker dialogues
  • Appropriate prosody and emotion
  • Smooth turn-taking
  • Speaker consistency over long durations

Usage

Installation

# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/

# Install dependencies
pip install -e .

Basic Generation

from transformers import AutoModelForCausalLM
from vibevoice.modular.modular_vibevoice_text_tokenizer import VibeVoiceTextTokenizerFast

# Load the long-form model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/VibeVoice",  # Long-form model path
    trust_remote_code=True
)

tokenizer = VibeVoiceTextTokenizerFast.from_pretrained(
    "microsoft/VibeVoice"
)

# Prepare multi-speaker conversation
conversation = """
Speaker 1: Welcome to our podcast about AI.
Speaker 2: Thanks for having me! I'm excited to discuss this.
Speaker 1: Let's start with the basics...
"""

input_ids = tokenizer(conversation, return_tensors="pt").input_ids

# Generate speech
output = model.generate(
    inputs=input_ids,
    tokenizer=tokenizer,
    max_new_tokens=10000,  # For long-form generation
    return_speech=True
)

# Access generated audio
audio = output.speech_outputs[0]

Multi-Speaker Configuration

# Define speaker roles
speakers = {
    "host": "Speaker1",
    "guest": "Speaker2"
}

conversation = f"""
{speakers['host']}: Welcome to the show.
{speakers['guest']}: Thank you for the invitation.
"""

Limitations

Research Model: Not recommended for commercial or real-world applications without further testing and development. This model is intended for research and development purposes only.

Language Support

  • English: Full support
  • Chinese: Full support
  • Other Languages: May produce unexpected audio outputs

Technical Limitations

The model focuses solely on speech synthesis and does not handle:
  • Background noise
  • Music (except spontaneous singing)
  • Sound effects
  • Environmental sounds
The current model does not explicitly model or generate overlapping speech segments in conversations. All speakers take distinct turns.
  • Maximum: 4 distinct speakers
  • Each speaker must maintain consistent role throughout
  • Cannot dynamically add new speakers mid-generation

Model Inheritance

VibeVoice inherits any biases, errors, or omissions from its base model (Qwen2.5 1.5B). The model may produce outputs that are unexpected, biased, or inaccurate.

Responsible AI Considerations

Deepfake and Disinformation Risks: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation.

User Responsibilities

Users must:
  1. Ensure Transcript Reliability: Verify that input text is accurate and appropriate
  2. Check Content Accuracy: Validate generated content before distribution
  3. Avoid Misleading Use: Do not use generated content in deceptive ways
  4. Disclose AI Usage: Best practice to disclose when content is AI-generated
  5. Legal Compliance: Deploy in full compliance with all applicable laws and regulations
  6. Lawful Deployment: Use generated content in a lawful manner

Ethical Guidelines

Always disclose the use of AI when sharing AI-generated audio content. Listeners have the right to know when content is synthetic.
Implement verification mechanisms to prevent misuse, especially in contexts where voice authenticity is critical (e.g., news, official communications).
Conduct thorough testing for bias, accuracy, and safety before any deployment, especially in sensitive domains.

Model Architecture Details

Tokenizer Configuration

Both acoustic and semantic tokenizers share common parameters:
channels
integer
default:"1"
Number of audio channels (mono)
corpus_normalize
float
default:"0.0"
Corpus-level normalization factor
mixer_layer
string
default:"depthwise_conv"
Type of mixing layer (depthwise_conv)
conv_norm
string
default:"none"
Convolution normalization type
layernorm
string
default:"RMSNorm"
Layer normalization type
layernorm_eps
float
default:"1e-5"
Epsilon for layer normalization
conv_bias
boolean
default:"true"
Whether to use bias in convolutions
layer_scale_init_value
float
default:"1e-6"
Initial value for layer scaling

Use Cases

Podcast Generation

Ideal for creating synthetic podcasts:
  • Multiple hosts and guests
  • Natural conversational flow
  • Extended episode lengths (up to 90 minutes)
  • Consistent speaker identities

Audiobook Production

Generate audiobooks with:
  • Single narrator or multiple character voices
  • Long-form narration without quality degradation
  • Expressive reading with appropriate emotion

Educational Content

Create educational materials:
  • Dialogue-based learning scenarios
  • Multi-speaker explanations
  • Interview-style educational content

Content Creation

Support content creators with:
  • Synthetic voice-overs for videos
  • Character voices for animation
  • Placeholder audio for production workflows

Next Steps

Explore the VibeVoice ecosystem:

Build docs developers (and LLMs) love