Quickstart

Get up and running with VibeVoice to generate high-quality speech from text in just a few minutes.

Before starting, make sure you’ve installed VibeVoice.

Choose Your Method

WebSocket Demo

Real-time streaming TTS with low latency

File Inference

Generate speech from text files

WebSocket Demo

Launch a real-time WebSocket server for streaming TTS:

Start the server

python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5B

The server will start on http://localhost:3000 by default.

Open the web interface

Navigate to http://localhost:3000 in your browser to access the interactive demo.

Generate speech

Type your text and click Generate to hear real-time speech synthesis with ~300ms first-chunk latency.

The WebSocket demo supports streaming text input - you can start generating speech before you finish typing!

File Inference

Generate speech from text files for longer content:

Prepare your text

Create a text file with your content, or use the provided examples:

# Example text files are available in demo/text_examples/
ls demo/text_examples/

Run inference

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Carter \
  --output_dir ./outputs

Find your output

The generated audio will be saved to ./outputs/ as a WAV file.

Command-Line Arguments

Argument	Default	Description
`--model_path`	`microsoft/VibeVoice-Realtime-0.5B`	HuggingFace model path
`--txt_path`	`demo/text_examples/1p_vibevoice.txt`	Input text file path
`--speaker_name`	`Wayne`	Voice preset name
`--output_dir`	`./outputs`	Output directory for audio files
`--device`	Auto-detected	Device: `cuda`, `mps`, or `cpu`
`--cfg_scale`	`1.5`	Classifier-Free Guidance scale

Python API Usage

Use VibeVoice directly in your Python code:

import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load model and processor
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)

# Set inference steps
model.set_ddpm_inference_steps(num_steps=5)

# Prepare text input
text = "Welcome to VibeVoice, an open-source frontier voice AI framework."

# Load voice prompt
voice_sample = torch.load("demo/voices/streaming_model/Carter.pt")

# Process inputs
inputs = processor.process_input_with_cached_prompt(
    text=text,
    cached_prompt=voice_sample,
    padding=True,
    return_tensors="pt"
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate speech
outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    all_prefilled_outputs=voice_sample
)

# Save audio
processor.save_audio(
    outputs.speech_outputs[0],
    output_path="output.wav"
)

Understanding the Output

After generation, VibeVoice provides performance metrics:

Generation Metrics

Generation time: Total time to generate audio
Audio duration: Length of generated audio
RTF (Real-Time Factor): Ratio of generation time to audio duration
- RTF < 1.0 means faster than real-time
- RTF = 1.0 means real-time
- RTF > 1.0 means slower than real-time

Token Metrics

Prefilling text tokens: Number of input text tokens
Generated speech tokens: Number of acoustic tokens generated
Total tokens: Sum of all tokens processed

Advanced Configuration

Adjusting CFG Scale

Control the strength of classifier-free guidance:

# Higher CFG = stronger conditioning (more expressive but less stable)
python demo/realtime_model_inference_from_file.py --cfg_scale 2.0

# Lower CFG = weaker conditioning (more stable but less expressive)
python demo/realtime_model_inference_from_file.py --cfg_scale 1.0

Changing Diffusion Steps

Adjust the number of diffusion inference steps:

# More steps = higher quality but slower
model.set_ddpm_inference_steps(num_steps=10)

# Fewer steps = faster but lower quality
model.set_ddpm_inference_steps(num_steps=3)

The default of 5 steps provides a good balance between quality and speed.

Troubleshooting

Out of memory errors

Use CPU instead of GPU: --device cpu
Reduce batch size or text length
Use float32 instead of bfloat16 on MPS devices

Slow generation

Ensure you’re using CUDA with flash_attention_2
Reduce diffusion steps: model.set_ddpm_inference_steps(num_steps=3)
Check that your GPU drivers are up to date

Voice file not found

Check available voices in demo/voices/streaming_model/
Use exact voice name from the .pt files
Default voices include: Carter, Wayne, and others

Next Steps

WebSocket Guide

Build real-time TTS applications

Custom Voices

Learn about voice prompts

API Reference

Explore the full API

Advanced Config

Fine-tune your setup

Get Started

Models

Guides

Architecture

Resources

Choose Your Method

WebSocket Demo

File Inference

WebSocket Demo

File Inference

Command-Line Arguments

Python API Usage

Understanding the Output

Advanced Configuration

Adjusting CFG Scale

Changing Diffusion Steps

Troubleshooting

Next Steps

WebSocket Guide

Custom Voices

API Reference

Advanced Config

Build docs developers (and LLMs) love

Get Started

Models

Guides

Architecture

Resources

​Choose Your Method

WebSocket Demo

File Inference

​WebSocket Demo

​File Inference

​Command-Line Arguments

​Python API Usage

​Understanding the Output

​Advanced Configuration

​Adjusting CFG Scale

​Changing Diffusion Steps

​Troubleshooting

​Next Steps

WebSocket Guide

Custom Voices

API Reference

Advanced Config

Build docs developers (and LLMs) love

Choose Your Method

WebSocket Demo

File Inference

Command-Line Arguments

Python API Usage

Understanding the Output

Advanced Configuration

Adjusting CFG Scale

Changing Diffusion Steps

Troubleshooting

Next Steps