VibeVoice-Realtime-0.5B

Overview

VibeVoice-Realtime-0.5B is a lightweight real-time text-to-speech model supporting streaming text input and robust long-form speech generation. The model produces initial audible speech in approximately 300 milliseconds (hardware dependent) and can generate speech up to 10 minutes in length.

This model is optimized for real-time applications and supports streaming text input, allowing LLMs to start speaking from their first tokens before generating a complete response.

Model Specifications

Specification	Value
Parameter Size	0.5B
Base Model	Qwen2.5 0.5B
Context Length	8K tokens
Max Generation Length	~10 minutes
First Chunk Latency	~300ms
Frame Rate	7.5 Hz (ultra-low)
Speakers	Single speaker
Primary Language	English

The model is primarily built for English. While experimental multilingual support exists for 9 additional languages (DE, FR, IT, JP, KR, NL, PL, PT, ES), these have not been extensively tested and should be used with caution.

Key Features

Streaming Text Input

The model supports incremental text encoding, allowing you to feed text chunks while audio is being generated. This enables real-time TTS services and live data stream narration.

Ultra-Low Latency

Produces initial audible speech in ~300ms using an interleaved, windowed design that processes text chunks in parallel with diffusion-based acoustic generation.

Long-Form Generation

Unlike traditional TTS models, VibeVoice-Realtime can generate robust long-form speech up to 10 minutes, maintaining consistency throughout the entire generation.

Efficient Architecture

Uses only an acoustic tokenizer (no semantic tokenizer) operating at 7.5 Hz frame rate, making it deployment-friendly with just 0.5B parameters.

Architecture

The model uses an interleaved, windowed design with the following components:

Speech Tokenizer

Acoustic Tokenizer: Ultra-low frame rate (7.5 Hz) continuous speech tokenizer
VAE Dimension: 64
No Semantic Tokenizer: Removed for efficiency in streaming scenarios

Text Backbone

The decoder is divided into two components:

Lower Transformer Layers: Used exclusively for encoding text
Upper Transformer Layers (tts_backbone_num_hidden_layers=20): Used for encoding text and generating speech

tts_backbone_num_hidden_layers

integer

default:"20"

Number of upper Transformer layers dedicated to TTS generation

Diffusion Head

hidden_size

integer

default:"768"

Hidden dimension size for the diffusion head

head_layers

integer

default:"4"

Number of layers in the diffusion prediction head

ddpm_num_steps

integer

default:"1000"

Number of diffusion training steps

ddpm_num_inference_steps

integer

default:"20"

Number of inference steps for speech generation

ddpm_beta_schedule

string

default:"cosine"

Beta schedule type for diffusion process

prediction_type

string

default:"v_prediction"

Type of prediction used in diffusion (v_prediction or epsilon)

Performance Benchmarks

The model achieves competitive performance on standard TTS benchmarks while being optimized for long-form generation.

LibriSpeech test-clean (Zero-shot)

Model	WER (%) ↓	Speaker Similarity ↑
VALL-E 2	2.40	0.643
Voicebox	1.90	0.662
MELLE	2.10	0.625
VibeVoice-Realtime-0.5B	2.00	0.695

SEED test-en (Zero-shot)

Model	WER (%) ↓	Speaker Similarity ↑
MaskGCT	2.62	0.714
Seed-TTS	2.25	0.762
FireRedTTS	3.82	0.460
SparkTTS	1.98	0.584
CosyVoice2	2.57	0.652
VibeVoice-Realtime-0.5B	2.05	0.633

The model achieves satisfactory performance on short-sentence benchmarks while being specifically optimized for long-form speech generation.

Usage

Installation

# Launch NVIDIA PyTorch Container (24.07 / 24.10 / 24.12 verified)
sudo docker run --privileged --net=host --ipc=host \
  --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all \
  --rm -it nvcr.io/nvidia/pytorch:24.07-py3

# Install flash attention if needed
pip install flash-attn --no-build-isolation

# Clone and install VibeVoice
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .

Real-time WebSocket Demo

Launch Demo
Inference from File
Google Colab

python demo/vibevoice_realtime_demo.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B

NVIDIA T4 / Mac M4 Pro achieve realtime performance in tests. Other devices may require optimization.

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Carter

Python API

from transformers import AutoModelForCausalLM
from vibevoice.modular.modular_vibevoice_text_tokenizer import VibeVoiceTextTokenizerFast

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    trust_remote_code=True
)
tokenizer = VibeVoiceTextTokenizerFast.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Set inference steps
model.set_ddpm_inference_steps(20)  # Default: 20

# Generate speech with streaming
text = "Your text to synthesize goes here."
input_ids = tokenizer(text, return_tensors="pt").input_ids

output = model.generate(
    inputs=input_ids,
    tokenizer=tokenizer,
    cfg_scale=3.0,  # Classifier-free guidance scale
    return_speech=True,
    show_progress_bar=True
)

# Access generated audio
audio_waveform = output.speech_outputs[0]

Configuration Parameters

Generation Parameters

cfg_scale

float

default:"3.0"

Classifier-free guidance scale for speech diffusion. Higher values increase adherence to text conditioning but may reduce diversity.

return_speech

boolean

default:"true"

Whether to decode and return speech audio. Set to false to only return token sequences.

max_new_tokens

integer

default:"auto"

Maximum number of tokens to generate. Defaults to max_position_embeddings - input_length.

show_progress_bar

boolean

default:"true"

Display progress bar during generation showing text/speech token counts.

Windowing Parameters

The model uses fixed window sizes for streaming:

TTS_TEXT_WINDOW_SIZE

integer

default:"5"

Number of text tokens processed in each window step.

TTS_SPEECH_WINDOW_SIZE

integer

default:"6"

Number of speech tokens generated per text window.

Limitations

Research Model: Not recommended for commercial or real-world applications without further testing and development. Intended for research purposes only.

Language Support

Primary: English only for production use
Experimental: DE, FR, IT, JP, KR, NL, PL, PT, ES (untested, use with caution)
Unsupported: Other languages may produce unexpected results

Content Limitations

Non-Speech Audio

The model focuses solely on speech synthesis and does not handle:

Background noise
Music
Sound effects

Special Characters

Currently does not support reading:

Code snippets
Mathematical formulas
Uncommon symbols

Pre-process input text to remove or normalize such content.

Very Short Inputs

When input text is extremely short (three words or fewer), the model’s stability may degrade.

Technical Limitations

Single Speaker: Only supports one speaker (unlike the multi-speaker long-form variant)
No Overlapping Speech: Does not model simultaneous speakers
Batch Size: Current implementation only supports batch size = 1
Voice Customization: Voice prompts are embedded; custom voices require contacting the team

Responsible AI Considerations

Deepfake Risk: High-quality synthetic speech can be misused for impersonation, fraud, or disinformation. Users must:

Ensure transcripts are reliable
Check content accuracy
Avoid misleading use of generated content
Disclose AI usage when sharing generated content
Deploy in compliance with applicable laws and regulations

Model Biases

VibeVoice-Realtime inherits any biases, errors, or omissions from its base model (Qwen2.5 0.5B). Outputs may be:

Unexpected
Biased
Inaccurate

Thorough testing and validation is required before any production deployment.

Get Started

Models

Guides

Architecture

Resources

VibeVoice-Realtime-0.5B

Overview

Model Specifications

Key Features

Architecture

Speech Tokenizer

Text Backbone

Diffusion Head

Performance Benchmarks

LibriSpeech test-clean (Zero-shot)

SEED test-en (Zero-shot)

Usage

Installation

Real-time WebSocket Demo

Python API

Configuration Parameters

Generation Parameters

Windowing Parameters

Limitations

Language Support

Content Limitations

Technical Limitations

Responsible AI Considerations

Model Biases

Model Links

Build docs developers (and LLMs) love

Get Started

Models

Guides

Architecture

Resources

​Overview

​Model Specifications

​Key Features

​Architecture

​Speech Tokenizer

​Text Backbone

​Diffusion Head

​Performance Benchmarks

​LibriSpeech test-clean (Zero-shot)

​SEED test-en (Zero-shot)

​Usage

​Installation

​Real-time WebSocket Demo

​Python API

​Configuration Parameters

​Generation Parameters

​Windowing Parameters

​Limitations

​Language Support

​Content Limitations

​Technical Limitations

​Responsible AI Considerations

​Model Biases

​Model Links

Build docs developers (and LLMs) love

Overview

Model Specifications

Key Features

Architecture

Speech Tokenizer

Text Backbone

Diffusion Head

Performance Benchmarks

LibriSpeech test-clean (Zero-shot)

SEED test-en (Zero-shot)

Usage

Installation

Real-time WebSocket Demo

Python API

Configuration Parameters

Generation Parameters

Windowing Parameters

Limitations

Language Support

Content Limitations

Technical Limitations

Responsible AI Considerations

Model Biases

Model Links