Skip to main content
VibeVoice uses cached voice prompts to generate speech with specific voice characteristics. This guide explains how voice prompts work and how to use them effectively.

Understanding Voice Prompts

Voice prompts in VibeVoice are pre-computed representations of voice characteristics stored as .pt (PyTorch) files. These files contain:
  • tts_lm: Language model key-value cache for the voice
  • lm: Base language model hidden states
These cached prompts enable fast, consistent voice cloning without recomputing voice embeddings for each request.

Voice File Structure

Voice files are stored in the following location:
demo/
└── voices/
    └── streaming_model/
        ├── Wayne.pt
        ├── en-WHTest_man.pt
        ├── Speaker01.pt
        └── Speaker02.pt

Using Existing Voices

Command-Line Usage

Specify a voice using the --speaker_name argument:
python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Wayne

Voice Name Resolution

The VoiceMapper class handles voice name resolution with flexible matching:
1

Exact Match

First, it tries to find an exact match for the voice name:
if speaker_name in self.voice_presets:
    return self.voice_presets[speaker_name]
2

Partial Match

If no exact match, it performs case-insensitive partial matching:
speaker_lower = speaker_name.lower()
for preset_name, path in self.voice_presets.items():
    if preset_name.lower() in speaker_lower or speaker_lower in preset_name.lower():
        return path
3

Default Fallback

If no match is found, it uses the first available voice:
default_voice = list(self.voice_presets.values())[0]
print(f"Warning: No voice preset found for '{speaker_name}', using default voice")

Listing Available Voices

The voice mapper automatically scans and lists available voices on startup:
voice_mapper = VoiceMapper()
# Output: Found 4 voice files in demo/voices/streaming_model
# Output: Available voices: Wayne, en-WHTest_man, Speaker01, Speaker02

Loading Voice Prompts Programmatically

Basic Loading

import torch

# Load voice prompt
voice_prompt = torch.load(
    "demo/voices/streaming_model/Wayne.pt",
    map_location="cuda",
    weights_only=False
)

# The voice_prompt contains:
# - voice_prompt['tts_lm']['last_hidden_state']: TTS language model cache
# - voice_prompt['lm']['last_hidden_state']: Base language model cache

Device-Aware Loading

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

voice_prompt = torch.load(
    "demo/voices/streaming_model/Wayne.pt",
    map_location=device,
    weights_only=False
)
Always use weights_only=False when loading voice prompts, as they contain complex nested structures beyond simple tensors.

Using Voice Prompts with the Processor

The processor’s process_input_with_cached_prompt method handles voice prompts:
from vibevoice import VibeVoiceStreamingProcessor
import torch

processor = VibeVoiceStreamingProcessor.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")

# Load voice prompt
voice_prompt = torch.load("demo/voices/streaming_model/Wayne.pt", map_location="cuda", weights_only=False)

# Process input with cached prompt
inputs = processor.process_input_with_cached_prompt(
    text="Hello, this is a test.",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True
)

Understanding the Processing

The processor extracts information from the cached prompt:
# Get input lengths from cached prompt
input_id_length = cached_prompt['lm']['last_hidden_state'].size(1)
tts_lm_input_id_length = cached_prompt['tts_lm']['last_hidden_state'].size(1)

# Create pseudo input IDs (actual values don't matter, only length)
input_ids = [processor.tokenizer.pad_id] * input_id_length
tts_lm_input_ids = [processor.tokenizer.pad_id] * tts_lm_input_id_length

Voice Caching in the WebSocket Demo

The WebSocket demo implements intelligent voice caching:
class StreamingTTSService:
    def __init__(self, model_path, device="cuda", inference_steps=5):
        self._voice_cache: Dict[str, Tuple[object, Path, str]] = {}
        # ...
    
    def _ensure_voice_cached(self, key: str):
        """Load and cache voice prompt if not already cached"""
        if key not in self._voice_cache:
            preset_path = self.voice_presets[key]
            print(f"Loading voice preset {key} from {preset_path}")
            prefilled_outputs = torch.load(
                preset_path,
                map_location=self._torch_device,
                weights_only=False
            )
            self._voice_cache[key] = prefilled_outputs
        
        return self._voice_cache[key]
Voice caching significantly improves performance by avoiding repeated file I/O for the same voice.

Default Voice Selection

The WebSocket demo uses this priority for selecting the default voice:
1

Environment Variable

Check for VOICE_PRESET environment variable:
preset_name = os.environ.get("VOICE_PRESET")
2

Hardcoded Default

Fall back to en-WHTest_man if it exists:
default_key = "en-WHTest_man"
if default_key in self.voice_presets:
    return default_key
3

First Available

Use the first voice in the sorted list:
first_key = next(iter(self.voice_presets))
return first_key

Generation with Voice Prompts

When generating, always pass a deep copy of the voice prompt:
import copy

outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    all_prefilled_outputs=copy.deepcopy(voice_prompt)
)
The copy.deepcopy() ensures the cached voice prompt isn’t modified during generation, allowing it to be reused.

WebSocket Voice Selection

Select a voice via the WebSocket API using the voice query parameter:
const ws = new WebSocket(
  'ws://localhost:3000/stream?text=Hello&voice=Wayne'
);
Retrieve available voices from the config endpoint:
curl http://localhost:3000/config
{
  "voices": ["Wayne", "en-WHTest_man", "Speaker01", "Speaker02"],
  "default_voice": "en-WHTest_man"
}

Voice Prompt File Format

Voice prompts are PyTorch checkpoint files with this structure:
{
    'tts_lm': {
        'last_hidden_state': torch.Tensor,  # Shape: [1, seq_len, hidden_dim]
        # Additional TTS language model states...
    },
    'lm': {
        'last_hidden_state': torch.Tensor,  # Shape: [1, seq_len, hidden_dim]
        # Additional language model states...
    }
}

Best Practices

Use Descriptive Names

Name voice files descriptively (e.g., en-male-young.pt, es-female-professional.pt)

Cache Voices

Cache frequently used voices in memory to avoid repeated disk I/O

Deep Copy Prompts

Always use copy.deepcopy() when passing prompts to generation

Organize by Language

Group voice files by language or accent for easier management

Troubleshooting

Voice File Not Loading

# Check if file exists
import os
voice_path = "demo/voices/streaming_model/Wayne.pt"
if not os.path.exists(voice_path):
    print(f"Voice file not found: {voice_path}")

Device Mismatch

Ensure the voice prompt is on the same device as the model:
# Load voice to correct device
device = "cuda"
voice_prompt = torch.load(
    "demo/voices/streaming_model/Wayne.pt",
    map_location=device,  # Match model device
    weights_only=False
)

Voice Directory Not Found

voices_dir = "demo/voices/streaming_model"
if not os.path.exists(voices_dir):
    print(f"Warning: Voices directory not found at {voices_dir}")
    # Create directory or update path

Build docs developers (and LLMs) love