Before starting, make sure you’ve installed VibeVoice.
Choose Your Method
WebSocket Demo
Real-time streaming TTS with low latency
File Inference
Generate speech from text files
WebSocket Demo
Launch a real-time WebSocket server for streaming TTS:Open the web interface
Navigate to
http://localhost:3000 in your browser to access the interactive demo.File Inference
Generate speech from text files for longer content:Command-Line Arguments
| Argument | Default | Description |
|---|---|---|
--model_path | microsoft/VibeVoice-Realtime-0.5B | HuggingFace model path |
--txt_path | demo/text_examples/1p_vibevoice.txt | Input text file path |
--speaker_name | Wayne | Voice preset name |
--output_dir | ./outputs | Output directory for audio files |
--device | Auto-detected | Device: cuda, mps, or cpu |
--cfg_scale | 1.5 | Classifier-Free Guidance scale |
Python API Usage
Use VibeVoice directly in your Python code:Understanding the Output
After generation, VibeVoice provides performance metrics:Generation Metrics
Generation Metrics
- Generation time: Total time to generate audio
- Audio duration: Length of generated audio
- RTF (Real-Time Factor): Ratio of generation time to audio duration
- RTF < 1.0 means faster than real-time
- RTF = 1.0 means real-time
- RTF > 1.0 means slower than real-time
Token Metrics
Token Metrics
- Prefilling text tokens: Number of input text tokens
- Generated speech tokens: Number of acoustic tokens generated
- Total tokens: Sum of all tokens processed
Advanced Configuration
Adjusting CFG Scale
Control the strength of classifier-free guidance:Changing Diffusion Steps
Adjust the number of diffusion inference steps:The default of 5 steps provides a good balance between quality and speed.
Troubleshooting
Out of memory errors
Out of memory errors
- Use CPU instead of GPU:
--device cpu - Reduce batch size or text length
- Use float32 instead of bfloat16 on MPS devices
Slow generation
Slow generation
- Ensure you’re using CUDA with flash_attention_2
- Reduce diffusion steps:
model.set_ddpm_inference_steps(num_steps=3) - Check that your GPU drivers are up to date
Voice file not found
Voice file not found
- Check available voices in
demo/voices/streaming_model/ - Use exact voice name from the
.ptfiles - Default voices include: Carter, Wayne, and others
Next Steps
WebSocket Guide
Build real-time TTS applications
Custom Voices
Learn about voice prompts
API Reference
Explore the full API
Advanced Config
Fine-tune your setup