Skip to main content

VibeVoiceStreamingForConditionalGenerationInference

The main inference model for VibeVoice streaming text-to-speech generation. This model enables real-time streaming of speech output with interleaved text processing and audio generation.

Class Signature

class VibeVoiceStreamingForConditionalGenerationInference(
    VibeVoiceStreamingPreTrainedModel, 
    GenerationMixin
)

Initialization

from vibevoice import VibeVoiceStreamingForConditionalGenerationInference

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
config
VibeVoiceStreamingConfig
Configuration object containing model architecture settings

Key Properties

noise_scheduler
DPMSolverMultistepScheduler
The noise scheduler used for diffusion-based speech generation
prediction_head
VibeVoiceDiffusionHead
The diffusion head that predicts noise during speech token sampling
speech_scaling_factor
torch.Tensor
Scaling factor applied to speech latents before decoding
speech_bias_factor
torch.Tensor
Bias factor applied to speech latents before decoding
acoustic_tokenizer
VibeVoiceAcousticTokenizer
The acoustic tokenizer that decodes speech latents to audio waveforms

Methods

from_pretrained

Load a pretrained model from HuggingFace Hub or local directory.
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
pretrained_model_name_or_path
str
required
Path to pretrained model or model identifier from huggingface.co/models
torch_dtype
torch.dtype
Data type for model weights. Use torch.bfloat16 for CUDA, torch.float32 for MPS/CPU
device_map
str
Device placement strategy. Options: "cuda", "cpu", "mps", or "auto"
attn_implementation
str
Attention implementation. Options: "flash_attention_2" (recommended for CUDA), "sdpa"

generate

Generate speech from text inputs with streaming support.
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    all_prefilled_outputs=cached_prompt,
    verbose=True
)
inputs
torch.Tensor
Prompt input IDs (typically from processor output)
generation_config
GenerationConfig
Configuration for generation. Set do_sample=False for deterministic output
audio_streamer
AudioStreamer | AsyncAudioStreamer
Optional streamer to receive audio chunks during generation
tts_text_ids
torch.LongTensor
Full text tokens to stream in windows during generation
cfg_scale
float
default:"1.0"
Classifier-free guidance scale for speech diffusion. Higher values (1.5-3.0) increase adherence to conditioning
return_speech
bool
default:"True"
Whether to concatenate and return speech audio tensors
stop_check_fn
Callable[[], bool]
Optional callback function that returns True to halt generation early
tokenizer
VibeVoiceTextTokenizer
required
Tokenizer instance (from processor.tokenizer)
all_prefilled_outputs
Dict[str, Any]
Cached prompt outputs containing KV caches for lm, tts_lm, neg_lm, and neg_tts_lm
max_new_tokens
int
Maximum number of new tokens to generate. If None, uses max_position_embeddings
verbose
bool
default:"False"
Whether to print generation progress information
VibeVoiceGenerationOutput
object
Generation output containing:
  • sequences (torch.LongTensor): Generated token IDs
  • speech_outputs (List[torch.FloatTensor]): List of audio waveforms for each sample
  • reach_max_step_sample (torch.BoolTensor): Flags indicating samples that reached max length

set_ddpm_inference_steps

Set the number of diffusion denoising steps for speech generation.
model.set_ddpm_inference_steps(num_steps=5)
num_steps
int
Number of inference steps for diffusion sampling. Default is from config. Lower values (5) are faster but may reduce quality

set_speech_tokenizers

Set the acoustic tokenizer used for encoding and decoding speech.
model.set_speech_tokenizers(acoustic_tokenizer=custom_tokenizer)
acoustic_tokenizer
VibeVoiceAcousticTokenizer
Custom acoustic tokenizer instance

forward_lm

Single forward pass through the base language model (text encoding).
outputs = model.forward_lm(
    input_ids=input_ids,
    attention_mask=attention_mask,
    past_key_values=past_key_values,
    use_cache=True
)
input_ids
torch.LongTensor
Input token IDs of shape (batch_size, sequence_length)
attention_mask
torch.Tensor
Attention mask of shape (batch_size, sequence_length)
past_key_values
Tuple[Tuple[torch.FloatTensor]]
Cached key-value states from previous forward passes
use_cache
bool
Whether to return key-value cache for next iteration
cache_position
torch.LongTensor
Positions for cached tokens
BaseModelOutputWithPast
object
Output containing:
  • last_hidden_state (torch.FloatTensor): Hidden states from final layer
  • past_key_values (Tuple): Cached attention states
  • attentions (Tuple, optional): Attention weights

forward_tts_lm

Single forward pass through the TTS language model (text + speech encoding).
outputs = model.forward_tts_lm(
    input_ids=tts_input_ids,
    attention_mask=attention_mask,
    lm_last_hidden_state=lm_hidden_states,
    tts_text_masks=text_masks,
    past_key_values=past_key_values,
    use_cache=True
)
input_ids
torch.LongTensor
Input token IDs of shape (batch_size, sequence_length)
lm_last_hidden_state
torch.FloatTensor
Hidden states from base LM to splice into input embeddings, shape (batch_size, K, hidden_size)
tts_text_masks
torch.BoolTensor
Mask indicating text (1) vs speech (0) positions, shape (batch_size, 1)
attention_mask
torch.Tensor
Attention mask of shape (batch_size, sequence_length)
past_key_values
Tuple[Tuple[torch.FloatTensor]]
Cached key-value states from previous forward passes
VibeVoiceCausalLMOutputWithPast
object
Output containing:
  • logits (torch.FloatTensor): EOS prediction logits from binary classifier
  • last_hidden_state (torch.FloatTensor): Hidden states from final layer
  • past_key_values (Tuple): Cached attention states

sample_speech_tokens

Sample speech latent tokens using diffusion with classifier-free guidance.
speech_latent = model.sample_speech_tokens(
    condition=positive_condition,
    neg_condition=negative_condition,
    cfg_scale=1.5
)
condition
torch.Tensor
Positive conditioning from TTS LM hidden states
neg_condition
torch.Tensor
Negative (unconditional) conditioning from TTS LM
cfg_scale
float
default:"3.0"
Classifier-free guidance scale
speech_tokens
torch.Tensor
Sampled speech latent vectors of shape (batch_size, acoustic_vae_dim)

Usage Example

import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load model and processor
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Set inference steps
model.eval()
model.set_ddpm_inference_steps(num_steps=5)

# Load voice prompt
voice_prompt = torch.load("voice_prompt.pt", map_location="cuda")

# Process input
inputs = processor.process_input_with_cached_prompt(
    text="Hello, this is a test of VibeVoice streaming synthesis.",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt"
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate speech
outputs = model.generate(
    **inputs,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    all_prefilled_outputs=voice_prompt,
    verbose=True
)

# Save audio
processor.save_audio(
    outputs.speech_outputs[0],
    output_path="output.wav"
)

Notes

  • The model currently only supports batch size of 1
  • Text is processed in windows of 5 tokens (TTS_TEXT_WINDOW_SIZE)
  • Speech is generated in windows of 6 tokens (TTS_SPEECH_WINDOW_SIZE)
  • The forward() method is intentionally disabled - use forward_lm(), forward_tts_lm(), or generate() instead
  • For CUDA, use flash_attention_2 and torch.bfloat16 for best performance
  • For MPS (Apple Silicon), use sdpa attention and torch.float32
  • For CPU, use sdpa attention and torch.float32

VibeVoiceGenerationOutput

Output dataclass returned by the generate() method.

Fields

sequences
torch.LongTensor
Generated token sequences of shape (batch_size, sequence_length) containing both input and generated tokens
speech_outputs
List[torch.FloatTensor]
List of generated speech waveforms. Each tensor is of shape (1, num_samples) containing the audio at 24kHz sample rate. Returns None if return_speech=False
reach_max_step_sample
torch.BoolTensor
Boolean flags of shape (batch_size,) indicating which samples stopped due to reaching maximum generation length

Example

outputs = model.generate(**inputs, ...)

# Access generated sequences
token_ids = outputs.sequences  # torch.LongTensor

# Access generated audio
audio_waveform = outputs.speech_outputs[0]  # First batch item
sample_rate = 24000
audio_duration = audio_waveform.shape[-1] / sample_rate

# Check if generation was truncated
if outputs.reach_max_step_sample[0]:
    print("Generation reached maximum length")

Build docs developers (and LLMs) love