This guide shows you how to generate speech from text using the VibeVoice streaming model.
Prerequisites
Before running inference, ensure you have:
Installed VibeVoice and its dependencies
Downloaded or have access to a model (e.g., microsoft/VibeVoice-Realtime-0.5B)
Voice prompt files in .pt format (located in demo/voices/streaming_model/)
Basic Usage
Prepare Your Text File
Create a text file with the content you want to convert to speech: demo/text_examples/1p_vibevoice.txt
Hello, this is a test of the VibeVoice text-to-speech system.
Run Inference
Use the realtime_model_inference_from_file.py script: python demo/realtime_model_inference_from_file.py \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--txt_path demo/text_examples/1p_vibevoice.txt \
--speaker_name Wayne \
--output_dir ./outputs
Check Output
The generated audio will be saved to the output directory: ls outputs/
# 1p_vibevoice_generated.wav
Command-Line Arguments
model_path
string
default: "microsoft/VibeVoice-Realtime-0.5B"
Path to the HuggingFace model directory or model ID
txt_path
string
default: "demo/text_examples/1p_vibevoice.txt"
Path to the text file containing the script to synthesize
Name of the speaker voice to use. Must match a voice file in demo/voices/streaming_model/
output_dir
string
default: "./outputs"
Directory where the generated audio files will be saved
Device for inference. Options: cuda, mps, or cpu. Defaults to CUDA if available, otherwise MPS or CPU
CFG (Classifier-Free Guidance) scale for generation. Higher values increase adherence to the input prompt
Device-Specific Configuration
CUDA (NVIDIA GPUs)
python demo/realtime_model_inference_from_file.py \
--device cuda \
--txt_path demo/text_examples/1p_vibevoice.txt
CUDA devices use bfloat16 dtype and flash_attention_2 for optimal performance.
MPS (Apple Silicon)
python demo/realtime_model_inference_from_file.py \
--device mps \
--txt_path demo/text_examples/1p_vibevoice.txt
MPS requires float32 dtype and uses SDPA attention implementation as flash_attention_2 is not supported.
CPU
python demo/realtime_model_inference_from_file.py \
--device cpu \
--txt_path demo/text_examples/1p_vibevoice.txt
CPU inference is significantly slower than GPU inference and should only be used for testing.
Understanding the Output
After generation completes, you’ll see a summary with performance metrics:
==================================================
GENERATION SUMMARY
==================================================
Input file: demo/text_examples/1p_vibevoice.txt
Output file: ./outputs/1p_vibevoice_generated.wav
Speaker names: Wayne
Prefilling text tokens: 42
Generated speech tokens: 1250
Total tokens: 1292
Generation time: 3.45 seconds
Audio duration: 5.20 seconds
RTF (Real Time Factor): 0.66x
==================================================
Key Metrics
Prefilling text tokens : Number of input text tokens processed
Generated speech tokens : Number of speech tokens generated by the model
RTF (Real Time Factor) : Generation time divided by audio duration. Values < 1.0 indicate faster than real-time generation
Advanced Configuration
Adjusting CFG Scale
The CFG scale controls how closely the model follows the input prompt:
Conservative (Lower adherence)
Balanced (Default)
Aggressive (Higher adherence)
python demo/realtime_model_inference_from_file.py \
--cfg_scale 1.0 \
--txt_path demo/text_examples/1p_vibevoice.txt
Start with the default CFG scale of 1.5 and adjust based on your audio quality preferences.
Python API Usage
You can also use VibeVoice programmatically:
import torch
from vibevoice import (
VibeVoiceStreamingForConditionalGenerationInference,
VibeVoiceStreamingProcessor
)
# Load processor and model
processor = VibeVoiceStreamingProcessor.from_pretrained( "microsoft/VibeVoice-Realtime-0.5B" )
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
"microsoft/VibeVoice-Realtime-0.5B" ,
torch_dtype = torch.bfloat16,
device_map = "cuda" ,
attn_implementation = "flash_attention_2"
)
model.eval()
model.set_ddpm_inference_steps( num_steps = 5 )
# Load voice prompt
voice_prompt = torch.load( "demo/voices/streaming_model/Wayne.pt" , map_location = "cuda" , weights_only = False )
# Prepare inputs
text = "Hello, this is VibeVoice."
inputs = processor.process_input_with_cached_prompt(
text = text,
cached_prompt = voice_prompt,
padding = True ,
return_tensors = "pt" ,
return_attention_mask = True
)
# Move to device
for k, v in inputs.items():
if torch.is_tensor(v):
inputs[k] = v.to( "cuda" )
# Generate
outputs = model.generate(
** inputs,
max_new_tokens = None ,
cfg_scale = 1.5 ,
tokenizer = processor.tokenizer,
generation_config = { 'do_sample' : False },
verbose = True ,
all_prefilled_outputs = voice_prompt
)
# Save audio
processor.save_audio(outputs.speech_outputs[ 0 ], output_path = "output.wav" )
Troubleshooting
Flash Attention Errors
If you encounter errors with flash_attention_2, the model will automatically fall back to SDPA:
Error loading the model. Trying to use SDPA. However, note that only
flash_attention_2 has been fully tested, and using SDPA may result in
lower audio quality.
For best results, install flash-attention: pip install flash-attn --no-build-isolation
Voice File Not Found
If your specified speaker name doesn’t match any voice files:
Warning: No voice preset found for 'InvalidName', using default voice
List available voices by checking demo/voices/streaming_model/ directory.