Skip to main content
Get up and running with VibeVoice to generate high-quality speech from text in just a few minutes.
Before starting, make sure you’ve installed VibeVoice.

Choose Your Method

WebSocket Demo

Real-time streaming TTS with low latency

File Inference

Generate speech from text files

WebSocket Demo

Launch a real-time WebSocket server for streaming TTS:
1

Start the server

python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5B
The server will start on http://localhost:3000 by default.
2

Open the web interface

Navigate to http://localhost:3000 in your browser to access the interactive demo.
3

Generate speech

Type your text and click Generate to hear real-time speech synthesis with ~300ms first-chunk latency.
The WebSocket demo supports streaming text input - you can start generating speech before you finish typing!

File Inference

Generate speech from text files for longer content:
1

Prepare your text

Create a text file with your content, or use the provided examples:
# Example text files are available in demo/text_examples/
ls demo/text_examples/
2

Run inference

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Carter \
  --output_dir ./outputs
3

Find your output

The generated audio will be saved to ./outputs/ as a WAV file.

Command-Line Arguments

ArgumentDefaultDescription
--model_pathmicrosoft/VibeVoice-Realtime-0.5BHuggingFace model path
--txt_pathdemo/text_examples/1p_vibevoice.txtInput text file path
--speaker_nameWayneVoice preset name
--output_dir./outputsOutput directory for audio files
--deviceAuto-detectedDevice: cuda, mps, or cpu
--cfg_scale1.5Classifier-Free Guidance scale

Python API Usage

Use VibeVoice directly in your Python code:
import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load model and processor
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)

# Set inference steps
model.set_ddpm_inference_steps(num_steps=5)

# Prepare text input
text = "Welcome to VibeVoice, an open-source frontier voice AI framework."

# Load voice prompt
voice_sample = torch.load("demo/voices/streaming_model/Carter.pt")

# Process inputs
inputs = processor.process_input_with_cached_prompt(
    text=text,
    cached_prompt=voice_sample,
    padding=True,
    return_tensors="pt"
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate speech
outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    all_prefilled_outputs=voice_sample
)

# Save audio
processor.save_audio(
    outputs.speech_outputs[0],
    output_path="output.wav"
)

Understanding the Output

After generation, VibeVoice provides performance metrics:
  • Generation time: Total time to generate audio
  • Audio duration: Length of generated audio
  • RTF (Real-Time Factor): Ratio of generation time to audio duration
    • RTF < 1.0 means faster than real-time
    • RTF = 1.0 means real-time
    • RTF > 1.0 means slower than real-time
  • Prefilling text tokens: Number of input text tokens
  • Generated speech tokens: Number of acoustic tokens generated
  • Total tokens: Sum of all tokens processed

Advanced Configuration

Adjusting CFG Scale

Control the strength of classifier-free guidance:
# Higher CFG = stronger conditioning (more expressive but less stable)
python demo/realtime_model_inference_from_file.py --cfg_scale 2.0

# Lower CFG = weaker conditioning (more stable but less expressive)
python demo/realtime_model_inference_from_file.py --cfg_scale 1.0

Changing Diffusion Steps

Adjust the number of diffusion inference steps:
# More steps = higher quality but slower
model.set_ddpm_inference_steps(num_steps=10)

# Fewer steps = faster but lower quality
model.set_ddpm_inference_steps(num_steps=3)
The default of 5 steps provides a good balance between quality and speed.

Troubleshooting

  • Use CPU instead of GPU: --device cpu
  • Reduce batch size or text length
  • Use float32 instead of bfloat16 on MPS devices
  • Ensure you’re using CUDA with flash_attention_2
  • Reduce diffusion steps: model.set_ddpm_inference_steps(num_steps=3)
  • Check that your GPU drivers are up to date
  • Check available voices in demo/voices/streaming_model/
  • Use exact voice name from the .pt files
  • Default voices include: Carter, Wayne, and others

Next Steps

WebSocket Guide

Build real-time TTS applications

Custom Voices

Learn about voice prompts

API Reference

Explore the full API

Advanced Config

Fine-tune your setup

Build docs developers (and LLMs) love