Running Inference

This guide shows you how to generate speech from text using the VibeVoice streaming model.

Prerequisites

Before running inference, ensure you have:

Installed VibeVoice and its dependencies
Downloaded or have access to a model (e.g., microsoft/VibeVoice-Realtime-0.5B)
Voice prompt files in .pt format (located in demo/voices/streaming_model/)

Basic Usage

Prepare Your Text File

Create a text file with the content you want to convert to speech:

demo/text_examples/1p_vibevoice.txt

Hello, this is a test of the VibeVoice text-to-speech system.

Run Inference

Use the realtime_model_inference_from_file.py script:

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Wayne \
  --output_dir ./outputs

Check Output

The generated audio will be saved to the output directory:

ls outputs/
# 1p_vibevoice_generated.wav

Command-Line Arguments

model_path

string

default:"microsoft/VibeVoice-Realtime-0.5B"

Path to the HuggingFace model directory or model ID

txt_path

string

default:"demo/text_examples/1p_vibevoice.txt"

Path to the text file containing the script to synthesize

speaker_name

string

default:"Wayne"

Name of the speaker voice to use. Must match a voice file in demo/voices/streaming_model/

output_dir

string

default:"./outputs"

Directory where the generated audio files will be saved

device

string

default:"auto"

Device for inference. Options: cuda, mps, or cpu. Defaults to CUDA if available, otherwise MPS or CPU

cfg_scale

float

default:"1.5"

CFG (Classifier-Free Guidance) scale for generation. Higher values increase adherence to the input prompt

Device-Specific Configuration

CUDA (NVIDIA GPUs)

python demo/realtime_model_inference_from_file.py \
  --device cuda \
  --txt_path demo/text_examples/1p_vibevoice.txt

CUDA devices use bfloat16 dtype and flash_attention_2 for optimal performance.

MPS (Apple Silicon)

python demo/realtime_model_inference_from_file.py \
  --device mps \
  --txt_path demo/text_examples/1p_vibevoice.txt

MPS requires float32 dtype and uses SDPA attention implementation as flash_attention_2 is not supported.

CPU

python demo/realtime_model_inference_from_file.py \
  --device cpu \
  --txt_path demo/text_examples/1p_vibevoice.txt

CPU inference is significantly slower than GPU inference and should only be used for testing.

Understanding the Output

After generation completes, you’ll see a summary with performance metrics:

==================================================
GENERATION SUMMARY
==================================================
Input file: demo/text_examples/1p_vibevoice.txt
Output file: ./outputs/1p_vibevoice_generated.wav
Speaker names: Wayne
Prefilling text tokens: 42
Generated speech tokens: 1250
Total tokens: 1292
Generation time: 3.45 seconds
Audio duration: 5.20 seconds
RTF (Real Time Factor): 0.66x
==================================================

Key Metrics

Prefilling text tokens: Number of input text tokens processed
Generated speech tokens: Number of speech tokens generated by the model
RTF (Real Time Factor): Generation time divided by audio duration. Values < 1.0 indicate faster than real-time generation

Advanced Configuration

Adjusting CFG Scale

The CFG scale controls how closely the model follows the input prompt:

python demo/realtime_model_inference_from_file.py \
  --cfg_scale 1.0 \
  --txt_path demo/text_examples/1p_vibevoice.txt

Start with the default CFG scale of 1.5 and adjust based on your audio quality preferences.

Python API Usage

You can also use VibeVoice programmatically:

import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load processor and model
processor = VibeVoiceStreamingProcessor.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
model.eval()
model.set_ddpm_inference_steps(num_steps=5)

# Load voice prompt
voice_prompt = torch.load("demo/voices/streaming_model/Wayne.pt", map_location="cuda", weights_only=False)

# Prepare inputs
text = "Hello, this is VibeVoice."
inputs = processor.process_input_with_cached_prompt(
    text=text,
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    verbose=True,
    all_prefilled_outputs=voice_prompt
)

# Save audio
processor.save_audio(outputs.speech_outputs[0], output_path="output.wav")

Troubleshooting

Flash Attention Errors

If you encounter errors with flash_attention_2, the model will automatically fall back to SDPA:

Error loading the model. Trying to use SDPA. However, note that only 
flash_attention_2 has been fully tested, and using SDPA may result in 
lower audio quality.

For best results, install flash-attention: pip install flash-attn --no-build-isolation

Voice File Not Found

If your specified speaker name doesn’t match any voice files:

Warning: No voice preset found for 'InvalidName', using default voice

List available voices by checking demo/voices/streaming_model/ directory.

Get Started

Models

Guides

Architecture

Resources

Prerequisites

Basic Usage

Command-Line Arguments

Device-Specific Configuration

CUDA (NVIDIA GPUs)

MPS (Apple Silicon)

CPU

Understanding the Output

Key Metrics

Advanced Configuration

Adjusting CFG Scale

Python API Usage

Troubleshooting

Flash Attention Errors

Voice File Not Found

Build docs developers (and LLMs) love

Get Started

Models

Guides

Architecture

Resources

​Prerequisites

​Basic Usage

​Command-Line Arguments

​Device-Specific Configuration

​CUDA (NVIDIA GPUs)

​MPS (Apple Silicon)

​CPU

​Understanding the Output

​Key Metrics

​Advanced Configuration

​Adjusting CFG Scale

​Python API Usage

​Troubleshooting

​Flash Attention Errors

​Voice File Not Found

Build docs developers (and LLMs) love

Prerequisites

Basic Usage

Command-Line Arguments

Device-Specific Configuration

CUDA (NVIDIA GPUs)

MPS (Apple Silicon)

CPU

Understanding the Output

Key Metrics

Advanced Configuration

Adjusting CFG Scale

Python API Usage

Troubleshooting

Flash Attention Errors

Voice File Not Found