Overview
VibeVoice-Realtime-0.5B is a lightweight real-time text-to-speech model supporting streaming text input and robust long-form speech generation. The model produces initial audible speech in approximately 300 milliseconds (hardware dependent) and can generate speech up to 10 minutes in length.This model is optimized for real-time applications and supports streaming text input, allowing LLMs to start speaking from their first tokens before generating a complete response.
Model Specifications
| Specification | Value |
|---|---|
| Parameter Size | 0.5B |
| Base Model | Qwen2.5 0.5B |
| Context Length | 8K tokens |
| Max Generation Length | ~10 minutes |
| First Chunk Latency | ~300ms |
| Frame Rate | 7.5 Hz (ultra-low) |
| Speakers | Single speaker |
| Primary Language | English |
Key Features
Streaming Text Input
Streaming Text Input
The model supports incremental text encoding, allowing you to feed text chunks while audio is being generated. This enables real-time TTS services and live data stream narration.
Ultra-Low Latency
Ultra-Low Latency
Produces initial audible speech in ~300ms using an interleaved, windowed design that processes text chunks in parallel with diffusion-based acoustic generation.
Long-Form Generation
Long-Form Generation
Unlike traditional TTS models, VibeVoice-Realtime can generate robust long-form speech up to 10 minutes, maintaining consistency throughout the entire generation.
Efficient Architecture
Efficient Architecture
Uses only an acoustic tokenizer (no semantic tokenizer) operating at 7.5 Hz frame rate, making it deployment-friendly with just 0.5B parameters.
Architecture
The model uses an interleaved, windowed design with the following components:Speech Tokenizer
- Acoustic Tokenizer: Ultra-low frame rate (7.5 Hz) continuous speech tokenizer
- VAE Dimension: 64
- No Semantic Tokenizer: Removed for efficiency in streaming scenarios
Text Backbone
The decoder is divided into two components:- Lower Transformer Layers: Used exclusively for encoding text
- Upper Transformer Layers (
tts_backbone_num_hidden_layers=20): Used for encoding text and generating speech
Number of upper Transformer layers dedicated to TTS generation
Diffusion Head
Hidden dimension size for the diffusion head
Number of layers in the diffusion prediction head
Number of diffusion training steps
Number of inference steps for speech generation
Beta schedule type for diffusion process
Type of prediction used in diffusion (v_prediction or epsilon)
Performance Benchmarks
The model achieves competitive performance on standard TTS benchmarks while being optimized for long-form generation.LibriSpeech test-clean (Zero-shot)
| Model | WER (%) ↓ | Speaker Similarity ↑ |
|---|---|---|
| VALL-E 2 | 2.40 | 0.643 |
| Voicebox | 1.90 | 0.662 |
| MELLE | 2.10 | 0.625 |
| VibeVoice-Realtime-0.5B | 2.00 | 0.695 |
SEED test-en (Zero-shot)
| Model | WER (%) ↓ | Speaker Similarity ↑ |
|---|---|---|
| MaskGCT | 2.62 | 0.714 |
| Seed-TTS | 2.25 | 0.762 |
| FireRedTTS | 3.82 | 0.460 |
| SparkTTS | 1.98 | 0.584 |
| CosyVoice2 | 2.57 | 0.652 |
| VibeVoice-Realtime-0.5B | 2.05 | 0.633 |
The model achieves satisfactory performance on short-sentence benchmarks while being specifically optimized for long-form speech generation.
Usage
Installation
Real-time WebSocket Demo
- Launch Demo
- Inference from File
- Google Colab
NVIDIA T4 / Mac M4 Pro achieve realtime performance in tests. Other devices may require optimization.
Python API
Configuration Parameters
Generation Parameters
Classifier-free guidance scale for speech diffusion. Higher values increase adherence to text conditioning but may reduce diversity.
Whether to decode and return speech audio. Set to
false to only return token sequences.Maximum number of tokens to generate. Defaults to
max_position_embeddings - input_length.Display progress bar during generation showing text/speech token counts.
Windowing Parameters
The model uses fixed window sizes for streaming:Number of text tokens processed in each window step.
Number of speech tokens generated per text window.
Limitations
Language Support
- Primary: English only for production use
- Experimental: DE, FR, IT, JP, KR, NL, PL, PT, ES (untested, use with caution)
- Unsupported: Other languages may produce unexpected results
Content Limitations
Non-Speech Audio
Non-Speech Audio
The model focuses solely on speech synthesis and does not handle:
- Background noise
- Music
- Sound effects
Special Characters
Special Characters
Currently does not support reading:
- Code snippets
- Mathematical formulas
- Uncommon symbols
Very Short Inputs
Very Short Inputs
When input text is extremely short (three words or fewer), the model’s stability may degrade.
Technical Limitations
- Single Speaker: Only supports one speaker (unlike the multi-speaker long-form variant)
- No Overlapping Speech: Does not model simultaneous speakers
- Batch Size: Current implementation only supports batch size = 1
- Voice Customization: Voice prompts are embedded; custom voices require contacting the team
Responsible AI Considerations
Model Biases
VibeVoice-Realtime inherits any biases, errors, or omissions from its base model (Qwen2.5 0.5B). Outputs may be:- Unexpected
- Biased
- Inaccurate