Skip to main content

Installation

This guide will walk you through installing VibeVoice and its dependencies for real-time text-to-speech generation.

Prerequisites

VibeVoice requires Python 3.9 or higher and is optimized for NVIDIA GPUs with CUDA support. It also supports Apple Silicon (MPS) and CPU inference.

System Requirements

  • Python: 3.9 or higher
  • Operating System: Linux, macOS, or Windows
  • Hardware:
    • Recommended: NVIDIA GPU (T4 or better)
    • Supported: Apple Silicon (M4 Pro or better), CPU
  • CUDA: For GPU acceleration (optional but recommended)

Installation Methods

1

Choose Your Environment

Select the appropriate installation method for your system.For GPU users, we recommend using NVIDIA Deep Learning Container to manage the CUDA environment:
# NVIDIA PyTorch Container 24.07 / 24.10 / 24.12 verified
# Later versions are also compatible
sudo docker run --privileged --net=host --ipc=host \
  --ulimit memlock=-1:-1 --ulimit stack=-1:-1 \
  --gpus all --rm -it \
  nvcr.io/nvidia/pytorch:24.07-py3
If flash attention is not included in your docker environment, install it manually:
pip install flash-attn --no-build-isolation
See flash-attention for more details.

Option B: Local Environment

For local installations without Docker, ensure you have:
  • Python 3.9+
  • PyTorch with appropriate CUDA support (if using GPU)
  • pip package manager
2

Clone the Repository

Clone the VibeVoice repository from GitHub:
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
3

Install Dependencies

Install VibeVoice and all required dependencies:
pip install -e .
This will install the following key dependencies:
  • torch - PyTorch deep learning framework
  • transformers==4.51.3 - Hugging Face Transformers (specific version required)
  • accelerate==1.6.0 - Model acceleration utilities
  • diffusers - Diffusion model components
  • gradio - Web UI components
  • librosa, scipy, numpy - Audio processing
  • fastapi, uvicorn - Web server for real-time demos
VibeVoice is developed with transformers==4.51.3. Later versions may not be compatible.
4

Verify Installation

Verify that VibeVoice is installed correctly:
import vibevoice
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

print("VibeVoice installed successfully!")

Device-Specific Configuration

For optimal performance with NVIDIA GPUs:
# Install flash-attention for better performance
pip install flash-attn --no-build-isolation
The model will automatically use:
  • torch.bfloat16 precision
  • flash_attention_2 implementation
  • CUDA device mapping
NVIDIA T4 GPUs and better achieve real-time performance (~300ms first chunk latency).

Download Models

VibeVoice models are hosted on Hugging Face and will be automatically downloaded when you run inference:
from vibevoice import VibeVoiceStreamingProcessor

# Model will be downloaded automatically
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

VibeVoice-Realtime-0.5B

Real-time TTS model (0.5B parameters)

Model Collection

Browse all VibeVoice models

Troubleshooting

Flash Attention Installation

If you encounter errors with flash attention:
  1. Ensure you have a compatible CUDA version
  2. Try installing without build isolation:
    pip install flash-attn --no-build-isolation
    
  3. If flash attention fails, the model will fall back to SDPA (may reduce quality)

Transformers Version

If you experience compatibility issues:
# Ensure exact version is installed
pip install transformers==4.51.3 --force-reinstall

MPS Device Issues

On macOS, if MPS is not detected:
import torch
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")
Ensure you have PyTorch with MPS support installed.

Next Steps

Quickstart Tutorial

Generate your first speech with VibeVoice

Build docs developers (and LLMs) love