Overview
The VibeVoice long-form multi-speaker model is designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.This model can synthesize conversational or single-speaker speech up to 90 minutes with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.
Model Specifications
| Specification | Value |
|---|---|
| Base Model | Qwen2.5 1.5B |
| Max Duration | 90 minutes |
| Max Speakers | 4 distinct speakers |
| Frame Rate | 7.5 Hz (ultra-low) |
| Primary Languages | English, Chinese |
| Use Cases | Podcasts, audiobooks, long conversations |
Key Features
Extended Multi-Speaker Support
Extended Multi-Speaker Support
Generates speech with up to 4 distinct speakers, maintaining consistent voice characteristics and natural turn-taking throughout conversations. This surpasses traditional TTS systems limited to 1-2 speakers.
Ultra-Long Generation
Ultra-Long Generation
Capable of generating up to 90 minutes of continuous, coherent speech without degradation in quality or speaker consistency. Perfect for podcasts, audiobooks, and extended narratives.
Expressive Speech
Expressive Speech
Produces natural, expressive speech with appropriate prosody, emotion, and conversational dynamics. The model understands dialogue flow and context to generate realistic turn-taking.
Spontaneous Capabilities
Spontaneous Capabilities
Can generate spontaneous singing and other expressive vocalizations when contextually appropriate, adding naturalness to generated content.
Architecture
The long-form model employs a novel framework using continuous speech tokenizers and next-token diffusion.Dual Tokenizer System
Unlike the realtime variant, the long-form model uses both tokenizers:Acoustic Tokenizer
VAE latent dimension for acoustic features
Downsampling ratios for each encoder layer
Depth configuration for encoder layers
Number of filters in encoder
Whether to use causal convolutions
Semantic Tokenizer
VAE latent dimension for semantic features (can differ from acoustic)
Fixed standard deviation (0 for semantic tokenizer)
Standard deviation distribution type (none for deterministic encoding)
Diffusion Head Configuration
Hidden dimension size for the diffusion head
Number of transformer layers in diffusion head
Feed-forward network expansion ratio
Size of latent representation for diffusion
Number of diffusion training steps
Number of inference steps for speech generation
Beta schedule for diffusion process (cosine or linear)
Batch multiplier for diffusion training
Language Model Backbone
Built on Qwen2.5 1.5B with customizations for speech generation:- Context Understanding: LLM component understands textual context and dialogue flow
- Speaker Modeling: Handles multi-speaker turn-taking and consistency
- Diffusion Integration: Diffusion head generates high-fidelity acoustic details
Technical Innovation
Ultra-Low Frame Rate Tokenizers
A core innovation of VibeVoice is the use of continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. This design:- Preserves Audio Fidelity: Maintains high-quality audio despite low frame rate
- Boosts Efficiency: Significantly reduces computational requirements for long sequences
- Enables Long-Form: Makes 90-minute generation computationally feasible
Next-Token Diffusion Framework
Combines the strengths of:- LLM: Understands text, context, dialogue structure, and speaker turns
- Diffusion Models: Generates high-fidelity, natural-sounding acoustic details
Performance
Quality Metrics
The model achieves superior performance compared to existing TTS systems:Based on MOS (Mean Opinion Score) preference testing, VibeVoice demonstrates higher naturalness and quality ratings compared to baseline models.
Capabilities Demonstrated
- English Conversations
- Chinese Conversations
- Cross-Lingual
- Special Features
- Natural multi-speaker dialogues
- Appropriate prosody and emotion
- Smooth turn-taking
- Speaker consistency over long durations
Usage
Installation
Basic Generation
Multi-Speaker Configuration
- 2 Speakers
- 4 Speakers
Limitations
Language Support
Supported Languages
Supported Languages
- English: Full support
- Chinese: Full support
- Other Languages: May produce unexpected audio outputs
Technical Limitations
Non-Speech Audio
Non-Speech Audio
The model focuses solely on speech synthesis and does not handle:
- Background noise
- Music (except spontaneous singing)
- Sound effects
- Environmental sounds
Overlapping Speech
Overlapping Speech
The current model does not explicitly model or generate overlapping speech segments in conversations. All speakers take distinct turns.
Speaker Limits
Speaker Limits
- Maximum: 4 distinct speakers
- Each speaker must maintain consistent role throughout
- Cannot dynamically add new speakers mid-generation
Model Inheritance
VibeVoice inherits any biases, errors, or omissions from its base model (Qwen2.5 1.5B). The model may produce outputs that are unexpected, biased, or inaccurate.
Responsible AI Considerations
User Responsibilities
Users must:- Ensure Transcript Reliability: Verify that input text is accurate and appropriate
- Check Content Accuracy: Validate generated content before distribution
- Avoid Misleading Use: Do not use generated content in deceptive ways
- Disclose AI Usage: Best practice to disclose when content is AI-generated
- Legal Compliance: Deploy in full compliance with all applicable laws and regulations
- Lawful Deployment: Use generated content in a lawful manner
Ethical Guidelines
Transparency
Transparency
Always disclose the use of AI when sharing AI-generated audio content. Listeners have the right to know when content is synthetic.
Verification
Verification
Implement verification mechanisms to prevent misuse, especially in contexts where voice authenticity is critical (e.g., news, official communications).
Testing
Testing
Conduct thorough testing for bias, accuracy, and safety before any deployment, especially in sensitive domains.
Model Architecture Details
Tokenizer Configuration
Both acoustic and semantic tokenizers share common parameters:Number of audio channels (mono)
Corpus-level normalization factor
Type of mixing layer (depthwise_conv)
Convolution normalization type
Layer normalization type
Epsilon for layer normalization
Whether to use bias in convolutions
Initial value for layer scaling
Use Cases
Podcast Generation
Ideal for creating synthetic podcasts:- Multiple hosts and guests
- Natural conversational flow
- Extended episode lengths (up to 90 minutes)
- Consistent speaker identities
Audiobook Production
Generate audiobooks with:- Single narrator or multiple character voices
- Long-form narration without quality degradation
- Expressive reading with appropriate emotion
Educational Content
Create educational materials:- Dialogue-based learning scenarios
- Multi-speaker explanations
- Interview-style educational content
Content Creation
Support content creators with:- Synthetic voice-overs for videos
- Character voices for animation
- Placeholder audio for production workflows
Model Links
Next Steps
Explore the VibeVoice ecosystem:- VibeVoice-Realtime-0.5B - For real-time, streaming TTS applications
- API Reference - Detailed API documentation
- Guides - Inference examples and use cases