Skip to main content
This guide shows you how to generate speech from text using the VibeVoice streaming model.

Prerequisites

Before running inference, ensure you have:
  • Installed VibeVoice and its dependencies
  • Downloaded or have access to a model (e.g., microsoft/VibeVoice-Realtime-0.5B)
  • Voice prompt files in .pt format (located in demo/voices/streaming_model/)

Basic Usage

1

Prepare Your Text File

Create a text file with the content you want to convert to speech:
demo/text_examples/1p_vibevoice.txt
Hello, this is a test of the VibeVoice text-to-speech system.
2

Run Inference

Use the realtime_model_inference_from_file.py script:
python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Wayne \
  --output_dir ./outputs
3

Check Output

The generated audio will be saved to the output directory:
ls outputs/
# 1p_vibevoice_generated.wav

Command-Line Arguments

model_path
string
default:"microsoft/VibeVoice-Realtime-0.5B"
Path to the HuggingFace model directory or model ID
txt_path
string
default:"demo/text_examples/1p_vibevoice.txt"
Path to the text file containing the script to synthesize
speaker_name
string
default:"Wayne"
Name of the speaker voice to use. Must match a voice file in demo/voices/streaming_model/
output_dir
string
default:"./outputs"
Directory where the generated audio files will be saved
device
string
default:"auto"
Device for inference. Options: cuda, mps, or cpu. Defaults to CUDA if available, otherwise MPS or CPU
cfg_scale
float
default:"1.5"
CFG (Classifier-Free Guidance) scale for generation. Higher values increase adherence to the input prompt

Device-Specific Configuration

CUDA (NVIDIA GPUs)

python demo/realtime_model_inference_from_file.py \
  --device cuda \
  --txt_path demo/text_examples/1p_vibevoice.txt
CUDA devices use bfloat16 dtype and flash_attention_2 for optimal performance.

MPS (Apple Silicon)

python demo/realtime_model_inference_from_file.py \
  --device mps \
  --txt_path demo/text_examples/1p_vibevoice.txt
MPS requires float32 dtype and uses SDPA attention implementation as flash_attention_2 is not supported.

CPU

python demo/realtime_model_inference_from_file.py \
  --device cpu \
  --txt_path demo/text_examples/1p_vibevoice.txt
CPU inference is significantly slower than GPU inference and should only be used for testing.

Understanding the Output

After generation completes, you’ll see a summary with performance metrics:
==================================================
GENERATION SUMMARY
==================================================
Input file: demo/text_examples/1p_vibevoice.txt
Output file: ./outputs/1p_vibevoice_generated.wav
Speaker names: Wayne
Prefilling text tokens: 42
Generated speech tokens: 1250
Total tokens: 1292
Generation time: 3.45 seconds
Audio duration: 5.20 seconds
RTF (Real Time Factor): 0.66x
==================================================

Key Metrics

  • Prefilling text tokens: Number of input text tokens processed
  • Generated speech tokens: Number of speech tokens generated by the model
  • RTF (Real Time Factor): Generation time divided by audio duration. Values < 1.0 indicate faster than real-time generation

Advanced Configuration

Adjusting CFG Scale

The CFG scale controls how closely the model follows the input prompt:
python demo/realtime_model_inference_from_file.py \
  --cfg_scale 1.0 \
  --txt_path demo/text_examples/1p_vibevoice.txt
Start with the default CFG scale of 1.5 and adjust based on your audio quality preferences.

Python API Usage

You can also use VibeVoice programmatically:
import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load processor and model
processor = VibeVoiceStreamingProcessor.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
model.eval()
model.set_ddpm_inference_steps(num_steps=5)

# Load voice prompt
voice_prompt = torch.load("demo/voices/streaming_model/Wayne.pt", map_location="cuda", weights_only=False)

# Prepare inputs
text = "Hello, this is VibeVoice."
inputs = processor.process_input_with_cached_prompt(
    text=text,
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    verbose=True,
    all_prefilled_outputs=voice_prompt
)

# Save audio
processor.save_audio(outputs.speech_outputs[0], output_path="output.wav")

Troubleshooting

Flash Attention Errors

If you encounter errors with flash_attention_2, the model will automatically fall back to SDPA:
Error loading the model. Trying to use SDPA. However, note that only 
flash_attention_2 has been fully tested, and using SDPA may result in 
lower audio quality.
For best results, install flash-attention: pip install flash-attn --no-build-isolation

Voice File Not Found

If your specified speaker name doesn’t match any voice files:
Warning: No voice preset found for 'InvalidName', using default voice
List available voices by checking demo/voices/streaming_model/ directory.

Build docs developers (and LLMs) love