Skip to main content

Overview

VibeVoice-Realtime-0.5B is a lightweight real-time text-to-speech model supporting streaming text input and robust long-form speech generation. The model produces initial audible speech in approximately 300 milliseconds (hardware dependent) and can generate speech up to 10 minutes in length.
This model is optimized for real-time applications and supports streaming text input, allowing LLMs to start speaking from their first tokens before generating a complete response.

Model Specifications

SpecificationValue
Parameter Size0.5B
Base ModelQwen2.5 0.5B
Context Length8K tokens
Max Generation Length~10 minutes
First Chunk Latency~300ms
Frame Rate7.5 Hz (ultra-low)
SpeakersSingle speaker
Primary LanguageEnglish
The model is primarily built for English. While experimental multilingual support exists for 9 additional languages (DE, FR, IT, JP, KR, NL, PL, PT, ES), these have not been extensively tested and should be used with caution.

Key Features

The model supports incremental text encoding, allowing you to feed text chunks while audio is being generated. This enables real-time TTS services and live data stream narration.
Produces initial audible speech in ~300ms using an interleaved, windowed design that processes text chunks in parallel with diffusion-based acoustic generation.
Unlike traditional TTS models, VibeVoice-Realtime can generate robust long-form speech up to 10 minutes, maintaining consistency throughout the entire generation.
Uses only an acoustic tokenizer (no semantic tokenizer) operating at 7.5 Hz frame rate, making it deployment-friendly with just 0.5B parameters.

Architecture

The model uses an interleaved, windowed design with the following components:

Speech Tokenizer

  • Acoustic Tokenizer: Ultra-low frame rate (7.5 Hz) continuous speech tokenizer
  • VAE Dimension: 64
  • No Semantic Tokenizer: Removed for efficiency in streaming scenarios

Text Backbone

The decoder is divided into two components:
  1. Lower Transformer Layers: Used exclusively for encoding text
  2. Upper Transformer Layers (tts_backbone_num_hidden_layers=20): Used for encoding text and generating speech
tts_backbone_num_hidden_layers
integer
default:"20"
Number of upper Transformer layers dedicated to TTS generation

Diffusion Head

hidden_size
integer
default:"768"
Hidden dimension size for the diffusion head
head_layers
integer
default:"4"
Number of layers in the diffusion prediction head
ddpm_num_steps
integer
default:"1000"
Number of diffusion training steps
ddpm_num_inference_steps
integer
default:"20"
Number of inference steps for speech generation
ddpm_beta_schedule
string
default:"cosine"
Beta schedule type for diffusion process
prediction_type
string
default:"v_prediction"
Type of prediction used in diffusion (v_prediction or epsilon)

Performance Benchmarks

The model achieves competitive performance on standard TTS benchmarks while being optimized for long-form generation.

LibriSpeech test-clean (Zero-shot)

ModelWER (%) ↓Speaker Similarity ↑
VALL-E 22.400.643
Voicebox1.900.662
MELLE2.100.625
VibeVoice-Realtime-0.5B2.000.695

SEED test-en (Zero-shot)

ModelWER (%) ↓Speaker Similarity ↑
MaskGCT2.620.714
Seed-TTS2.250.762
FireRedTTS3.820.460
SparkTTS1.980.584
CosyVoice22.570.652
VibeVoice-Realtime-0.5B2.050.633
The model achieves satisfactory performance on short-sentence benchmarks while being specifically optimized for long-form speech generation.

Usage

Installation

# Launch NVIDIA PyTorch Container (24.07 / 24.10 / 24.12 verified)
sudo docker run --privileged --net=host --ipc=host \
  --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all \
  --rm -it nvcr.io/nvidia/pytorch:24.07-py3

# Install flash attention if needed
pip install flash-attn --no-build-isolation

# Clone and install VibeVoice
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .

Real-time WebSocket Demo

python demo/vibevoice_realtime_demo.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B
NVIDIA T4 / Mac M4 Pro achieve realtime performance in tests. Other devices may require optimization.

Python API

from transformers import AutoModelForCausalLM
from vibevoice.modular.modular_vibevoice_text_tokenizer import VibeVoiceTextTokenizerFast

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    trust_remote_code=True
)
tokenizer = VibeVoiceTextTokenizerFast.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Set inference steps
model.set_ddpm_inference_steps(20)  # Default: 20

# Generate speech with streaming
text = "Your text to synthesize goes here."
input_ids = tokenizer(text, return_tensors="pt").input_ids

output = model.generate(
    inputs=input_ids,
    tokenizer=tokenizer,
    cfg_scale=3.0,  # Classifier-free guidance scale
    return_speech=True,
    show_progress_bar=True
)

# Access generated audio
audio_waveform = output.speech_outputs[0]

Configuration Parameters

Generation Parameters

cfg_scale
float
default:"3.0"
Classifier-free guidance scale for speech diffusion. Higher values increase adherence to text conditioning but may reduce diversity.
return_speech
boolean
default:"true"
Whether to decode and return speech audio. Set to false to only return token sequences.
max_new_tokens
integer
default:"auto"
Maximum number of tokens to generate. Defaults to max_position_embeddings - input_length.
show_progress_bar
boolean
default:"true"
Display progress bar during generation showing text/speech token counts.

Windowing Parameters

The model uses fixed window sizes for streaming:
TTS_TEXT_WINDOW_SIZE
integer
default:"5"
Number of text tokens processed in each window step.
TTS_SPEECH_WINDOW_SIZE
integer
default:"6"
Number of speech tokens generated per text window.

Limitations

Research Model: Not recommended for commercial or real-world applications without further testing and development. Intended for research purposes only.

Language Support

  • Primary: English only for production use
  • Experimental: DE, FR, IT, JP, KR, NL, PL, PT, ES (untested, use with caution)
  • Unsupported: Other languages may produce unexpected results

Content Limitations

The model focuses solely on speech synthesis and does not handle:
  • Background noise
  • Music
  • Sound effects
Currently does not support reading:
  • Code snippets
  • Mathematical formulas
  • Uncommon symbols
Pre-process input text to remove or normalize such content.
When input text is extremely short (three words or fewer), the model’s stability may degrade.

Technical Limitations

  • Single Speaker: Only supports one speaker (unlike the multi-speaker long-form variant)
  • No Overlapping Speech: Does not model simultaneous speakers
  • Batch Size: Current implementation only supports batch size = 1
  • Voice Customization: Voice prompts are embedded; custom voices require contacting the team

Responsible AI Considerations

Deepfake Risk: High-quality synthetic speech can be misused for impersonation, fraud, or disinformation. Users must:
  • Ensure transcripts are reliable
  • Check content accuracy
  • Avoid misleading use of generated content
  • Disclose AI usage when sharing generated content
  • Deploy in compliance with applicable laws and regulations

Model Biases

VibeVoice-Realtime inherits any biases, errors, or omissions from its base model (Qwen2.5 0.5B). Outputs may be:
  • Unexpected
  • Biased
  • Inaccurate
Thorough testing and validation is required before any production deployment.

Build docs developers (and LLMs) love