Skip to main content
VibeVoice delivers state-of-the-art performance across multiple benchmarks while maintaining efficient real-time generation capabilities.

Real-Time Performance

VibeVoice-Realtime-0.5B is optimized for low-latency streaming applications:
The model produces initial audible speech in approximately 300 milliseconds (hardware dependent), enabling true real-time conversational experiences.

Hardware Requirements

  • Tested configurations: NVIDIA T4, Mac M4 Pro
  • Parameter size: 0.5B (deployment-friendly)
  • Frame rate: 7.5 Hz ultra-low frame rate for efficient processing
  • Context length: 8K tokens
  • Generation length: Up to 10 minutes
Due to network latency, the time when audio playback is heard may exceed the ~300 ms first speech chunk generation latency.

Zero-Shot TTS Benchmarks

VibeVoice-Realtime achieves competitive performance on standard TTS benchmarks while focusing on long-form speech generation.

LibriSpeech test-clean

Performance on the LibriSpeech test-clean dataset:
ModelWER (%) ↓Speaker Similarity ↑
VALL-E 22.400.643
Voicebox1.900.662
MELLE2.100.625
VibeVoice-Realtime-0.5B2.000.695
Word Error Rate (WER) measures transcription accuracy. Lower values indicate better intelligibility and pronunciation quality. VibeVoice achieves 2.00% WER, competitive with much larger models.
Speaker Similarity measures how closely the generated voice matches the target speaker. VibeVoice achieves 0.695, the highest score among compared models, indicating excellent voice cloning capability.

SEED test-en

Performance on the SEED test-en dataset:
ModelWER (%) ↓Speaker Similarity ↑
MaskGCT2.620.714
Seed-TTS2.250.762
FireRedTTS3.820.460
SparkTTS1.980.584
CosyVoice22.570.652
VibeVoice-Realtime-0.5B2.050.633

Scalability & Long-Form Generation

VibeVoice’s architecture enables unprecedented scalability for long-form content:

Long-Form Multi-Speaker Model

  • Generation length: Up to 90 minutes of continuous speech
  • Speaker support: Up to 4 distinct speakers
  • Use cases: Podcasts, audiobooks, conversational content

Technical Innovations

VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz, efficiently preserving audio fidelity while significantly boosting computational efficiency.
Key architectural features:
  • Next-token diffusion framework: Combines LLM understanding with diffusion-based acoustic generation
  • Streaming support: Real-time model supports streaming text input
  • Interleaved windowed design: Incrementally encodes text chunks while generating audio from prior context

Model Variants Comparison

FeatureRealtime-0.5BLong-Form Multi-Speaker
Parameter Size0.5B1.5B (Qwen2.5 base)
Latency~300msStandard
Max Speakers14
Max Duration~10 min90 min
Streaming InputYesNo
Primary Use CaseReal-time TTS, live narrationPodcasts, conversations

Performance Optimization Tips

For optimal performance, use NVIDIA Deep Learning Container to manage the CUDA environment. Containers 24.07, 24.10, and 24.12 are verified compatible.
  1. Use GPU acceleration (NVIDIA T4 or better)
  2. Enable Flash Attention for improved inference speed
  3. Optimize batch sizes based on available VRAM
  4. Consider network latency for WebSocket deployments

Quality Metrics Summary

VibeVoice excels across multiple quality dimensions:
  • Intelligibility: 2.00-2.05% WER on standard benchmarks
  • Voice Similarity: 0.633-0.695 speaker similarity scores
  • Naturalness: High MOS preference scores (see project page)
  • Expressiveness: Supports spontaneous singing and emotional speech
  • Consistency: Maintains speaker identity across long-form content
The model is optimized for short-sentence benchmarks but focuses primarily on long-form speech generation. Performance on very short inputs (three words or fewer) may degrade.

Build docs developers (and LLMs) love