Performance Benchmarks

VibeVoice delivers state-of-the-art performance across multiple benchmarks while maintaining efficient real-time generation capabilities.

Real-Time Performance

VibeVoice-Realtime-0.5B is optimized for low-latency streaming applications:

The model produces initial audible speech in approximately 300 milliseconds (hardware dependent), enabling true real-time conversational experiences.

Hardware Requirements

Tested configurations: NVIDIA T4, Mac M4 Pro
Parameter size: 0.5B (deployment-friendly)
Frame rate: 7.5 Hz ultra-low frame rate for efficient processing
Context length: 8K tokens
Generation length: Up to 10 minutes

Due to network latency, the time when audio playback is heard may exceed the ~300 ms first speech chunk generation latency.

Zero-Shot TTS Benchmarks

VibeVoice-Realtime achieves competitive performance on standard TTS benchmarks while focusing on long-form speech generation.

LibriSpeech test-clean

Performance on the LibriSpeech test-clean dataset:

Model	WER (%) ↓	Speaker Similarity ↑
VALL-E 2	2.40	0.643
Voicebox	1.90	0.662
MELLE	2.10	0.625
VibeVoice-Realtime-0.5B	2.00	0.695

Understanding WER

Word Error Rate (WER) measures transcription accuracy. Lower values indicate better intelligibility and pronunciation quality. VibeVoice achieves 2.00% WER, competitive with much larger models.

Understanding Speaker Similarity

Speaker Similarity measures how closely the generated voice matches the target speaker. VibeVoice achieves 0.695, the highest score among compared models, indicating excellent voice cloning capability.

SEED test-en

Performance on the SEED test-en dataset:

Model	WER (%) ↓	Speaker Similarity ↑
MaskGCT	2.62	0.714
Seed-TTS	2.25	0.762
FireRedTTS	3.82	0.460
SparkTTS	1.98	0.584
CosyVoice2	2.57	0.652
VibeVoice-Realtime-0.5B	2.05	0.633

Scalability & Long-Form Generation

VibeVoice’s architecture enables unprecedented scalability for long-form content:

Long-Form Multi-Speaker Model

Generation length: Up to 90 minutes of continuous speech
Speaker support: Up to 4 distinct speakers
Use cases: Podcasts, audiobooks, conversational content

Technical Innovations

VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz, efficiently preserving audio fidelity while significantly boosting computational efficiency.

Key architectural features:

Next-token diffusion framework: Combines LLM understanding with diffusion-based acoustic generation
Streaming support: Real-time model supports streaming text input
Interleaved windowed design: Incrementally encodes text chunks while generating audio from prior context

Model Variants Comparison

Feature	Realtime-0.5B	Long-Form Multi-Speaker
Parameter Size	0.5B	1.5B (Qwen2.5 base)
Latency	~300ms	Standard
Max Speakers	1	4
Max Duration	~10 min	90 min
Streaming Input	Yes	No
Primary Use Case	Real-time TTS, live narration	Podcasts, conversations

Performance Optimization Tips

For optimal performance, use NVIDIA Deep Learning Container to manage the CUDA environment. Containers 24.07, 24.10, and 24.12 are verified compatible.

Recommended Setup

Use GPU acceleration (NVIDIA T4 or better)
Enable Flash Attention for improved inference speed
Optimize batch sizes based on available VRAM
Consider network latency for WebSocket deployments

Quality Metrics Summary

VibeVoice excels across multiple quality dimensions:

Intelligibility: 2.00-2.05% WER on standard benchmarks
Voice Similarity: 0.633-0.695 speaker similarity scores
Naturalness: High MOS preference scores (see project page)
Expressiveness: Supports spontaneous singing and emotional speech
Consistency: Maintains speaker identity across long-form content

The model is optimized for short-sentence benchmarks but focuses primarily on long-form speech generation. Performance on very short inputs (three words or fewer) may degrade.

Get Started

Models

Guides

Architecture

Resources

Performance Benchmarks

Real-Time Performance

Hardware Requirements

Zero-Shot TTS Benchmarks

LibriSpeech test-clean

SEED test-en

Scalability & Long-Form Generation

Long-Form Multi-Speaker Model

Technical Innovations

Model Variants Comparison

Performance Optimization Tips

Recommended Setup

Quality Metrics Summary

Build docs developers (and LLMs) love

Get Started

Models

Guides

Architecture

Resources

​Real-Time Performance

​Hardware Requirements

​Zero-Shot TTS Benchmarks

​LibriSpeech test-clean

​SEED test-en

​Scalability & Long-Form Generation

​Long-Form Multi-Speaker Model

​Technical Innovations

​Model Variants Comparison

​Performance Optimization Tips

​Recommended Setup

​Quality Metrics Summary

Build docs developers (and LLMs) love

Real-Time Performance

Hardware Requirements

Zero-Shot TTS Benchmarks

LibriSpeech test-clean

SEED test-en

Scalability & Long-Form Generation

Long-Form Multi-Speaker Model

Technical Innovations

Model Variants Comparison

Performance Optimization Tips

Recommended Setup

Quality Metrics Summary