Models

VibeVoiceStreamingForConditionalGenerationInference

The main inference model for VibeVoice streaming text-to-speech generation. This model enables real-time streaming of speech output with interleaved text processing and audio generation.

Class Signature

class VibeVoiceStreamingForConditionalGenerationInference(
    VibeVoiceStreamingPreTrainedModel, 
    GenerationMixin
)

Initialization

from vibevoice import VibeVoiceStreamingForConditionalGenerationInference

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)

config

VibeVoiceStreamingConfig

Configuration object containing model architecture settings

Key Properties

noise_scheduler

DPMSolverMultistepScheduler

The noise scheduler used for diffusion-based speech generation

prediction_head

VibeVoiceDiffusionHead

The diffusion head that predicts noise during speech token sampling

speech_scaling_factor

torch.Tensor

Scaling factor applied to speech latents before decoding

speech_bias_factor

torch.Tensor

Bias factor applied to speech latents before decoding

acoustic_tokenizer

VibeVoiceAcousticTokenizer

The acoustic tokenizer that decodes speech latents to audio waveforms

Methods

from_pretrained

Load a pretrained model from HuggingFace Hub or local directory.

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)

pretrained_model_name_or_path

str

required

Path to pretrained model or model identifier from huggingface.co/models

torch_dtype

torch.dtype

Data type for model weights. Use torch.bfloat16 for CUDA, torch.float32 for MPS/CPU

device_map

str

Device placement strategy. Options: "cuda", "cpu", "mps", or "auto"

attn_implementation

str

Attention implementation. Options: "flash_attention_2" (recommended for CUDA), "sdpa"

generate

Generate speech from text inputs with streaming support.

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    all_prefilled_outputs=cached_prompt,
    verbose=True
)

inputs

torch.Tensor

Prompt input IDs (typically from processor output)

generation_config

GenerationConfig

Configuration for generation. Set do_sample=False for deterministic output

audio_streamer

AudioStreamer | AsyncAudioStreamer

Optional streamer to receive audio chunks during generation

tts_text_ids

torch.LongTensor

Full text tokens to stream in windows during generation

cfg_scale

float

default:"1.0"

Classifier-free guidance scale for speech diffusion. Higher values (1.5-3.0) increase adherence to conditioning

return_speech

bool

default:"True"

Whether to concatenate and return speech audio tensors

stop_check_fn

Callable[[], bool]

Optional callback function that returns True to halt generation early

tokenizer

VibeVoiceTextTokenizer

required

Tokenizer instance (from processor.tokenizer)

all_prefilled_outputs

Dict[str, Any]

Cached prompt outputs containing KV caches for lm, tts_lm, neg_lm, and neg_tts_lm

max_new_tokens

int

Maximum number of new tokens to generate. If None, uses max_position_embeddings

verbose

bool

default:"False"

Whether to print generation progress information

VibeVoiceGenerationOutput

object

Generation output containing:

sequences (torch.LongTensor): Generated token IDs
speech_outputs (List[torch.FloatTensor]): List of audio waveforms for each sample
reach_max_step_sample (torch.BoolTensor): Flags indicating samples that reached max length

set_ddpm_inference_steps

Set the number of diffusion denoising steps for speech generation.

model.set_ddpm_inference_steps(num_steps=5)

num_steps

int

Number of inference steps for diffusion sampling. Default is from config. Lower values (5) are faster but may reduce quality

set_speech_tokenizers

Set the acoustic tokenizer used for encoding and decoding speech.

model.set_speech_tokenizers(acoustic_tokenizer=custom_tokenizer)

acoustic_tokenizer

VibeVoiceAcousticTokenizer

Custom acoustic tokenizer instance

forward_lm

Single forward pass through the base language model (text encoding).

outputs = model.forward_lm(
    input_ids=input_ids,
    attention_mask=attention_mask,
    past_key_values=past_key_values,
    use_cache=True
)

input_ids

torch.LongTensor

Input token IDs of shape (batch_size, sequence_length)

attention_mask

torch.Tensor

Attention mask of shape (batch_size, sequence_length)

past_key_values

Tuple[Tuple[torch.FloatTensor]]

Cached key-value states from previous forward passes

use_cache

bool

Whether to return key-value cache for next iteration

cache_position

torch.LongTensor

Positions for cached tokens

BaseModelOutputWithPast

object

Output containing:

last_hidden_state (torch.FloatTensor): Hidden states from final layer
past_key_values (Tuple): Cached attention states
attentions (Tuple, optional): Attention weights

forward_tts_lm

Single forward pass through the TTS language model (text + speech encoding).

outputs = model.forward_tts_lm(
    input_ids=tts_input_ids,
    attention_mask=attention_mask,
    lm_last_hidden_state=lm_hidden_states,
    tts_text_masks=text_masks,
    past_key_values=past_key_values,
    use_cache=True
)

input_ids

torch.LongTensor

Input token IDs of shape (batch_size, sequence_length)

lm_last_hidden_state

torch.FloatTensor

Hidden states from base LM to splice into input embeddings, shape (batch_size, K, hidden_size)

tts_text_masks

torch.BoolTensor

Mask indicating text (1) vs speech (0) positions, shape (batch_size, 1)

attention_mask

torch.Tensor

Attention mask of shape (batch_size, sequence_length)

past_key_values

Tuple[Tuple[torch.FloatTensor]]

Cached key-value states from previous forward passes

VibeVoiceCausalLMOutputWithPast

object

Output containing:

logits (torch.FloatTensor): EOS prediction logits from binary classifier
last_hidden_state (torch.FloatTensor): Hidden states from final layer
past_key_values (Tuple): Cached attention states

sample_speech_tokens

Sample speech latent tokens using diffusion with classifier-free guidance.

speech_latent = model.sample_speech_tokens(
    condition=positive_condition,
    neg_condition=negative_condition,
    cfg_scale=1.5
)

condition

torch.Tensor

Positive conditioning from TTS LM hidden states

neg_condition

torch.Tensor

Negative (unconditional) conditioning from TTS LM

cfg_scale

float

default:"3.0"

Classifier-free guidance scale

speech_tokens

torch.Tensor

Sampled speech latent vectors of shape (batch_size, acoustic_vae_dim)

Usage Example

import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load model and processor
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Set inference steps
model.eval()
model.set_ddpm_inference_steps(num_steps=5)

# Load voice prompt
voice_prompt = torch.load("voice_prompt.pt", map_location="cuda")

# Process input
inputs = processor.process_input_with_cached_prompt(
    text="Hello, this is a test of VibeVoice streaming synthesis.",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt"
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate speech
outputs = model.generate(
    **inputs,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    all_prefilled_outputs=voice_prompt,
    verbose=True
)

# Save audio
processor.save_audio(
    outputs.speech_outputs[0],
    output_path="output.wav"
)

Notes

The model currently only supports batch size of 1
Text is processed in windows of 5 tokens (TTS_TEXT_WINDOW_SIZE)
Speech is generated in windows of 6 tokens (TTS_SPEECH_WINDOW_SIZE)
The forward() method is intentionally disabled - use forward_lm(), forward_tts_lm(), or generate() instead
For CUDA, use flash_attention_2 and torch.bfloat16 for best performance
For MPS (Apple Silicon), use sdpa attention and torch.float32
For CPU, use sdpa attention and torch.float32

VibeVoiceGenerationOutput

Output dataclass returned by the generate() method.

Fields

sequences

torch.LongTensor

Generated token sequences of shape (batch_size, sequence_length) containing both input and generated tokens

speech_outputs

List[torch.FloatTensor]

List of generated speech waveforms. Each tensor is of shape (1, num_samples) containing the audio at 24kHz sample rate. Returns None if return_speech=False

reach_max_step_sample

torch.BoolTensor

Boolean flags of shape (batch_size,) indicating which samples stopped due to reaching maximum generation length

Example

outputs = model.generate(**inputs, ...)

# Access generated sequences
token_ids = outputs.sequences  # torch.LongTensor

# Access generated audio
audio_waveform = outputs.speech_outputs[0]  # First batch item
sample_rate = 24000
audio_duration = audio_waveform.shape[-1] / sample_rate

# Check if generation was truncated
if outputs.reach_max_step_sample[0]:
    print("Generation reached maximum length")

Core Components

Utilities

VibeVoiceStreamingForConditionalGenerationInference

Class Signature

Initialization

Key Properties

Methods

from_pretrained

generate

set_ddpm_inference_steps

set_speech_tokenizers

forward_lm

forward_tts_lm

sample_speech_tokens

Usage Example

Notes

VibeVoiceGenerationOutput

Fields

Example

Build docs developers (and LLMs) love

Core Components

Utilities

​VibeVoiceStreamingForConditionalGenerationInference

​Class Signature

​Initialization

​Key Properties

​Methods

​from_pretrained

​generate

​set_ddpm_inference_steps

​set_speech_tokenizers

​forward_lm

​forward_tts_lm

​sample_speech_tokens

​Usage Example

​Notes

​VibeVoiceGenerationOutput

​Fields

​Example

Build docs developers (and LLMs) love

VibeVoiceStreamingForConditionalGenerationInference

Class Signature

Initialization

Key Properties

Methods

from_pretrained

generate

set_ddpm_inference_steps

set_speech_tokenizers

forward_lm

forward_tts_lm

sample_speech_tokens

Usage Example

Notes

VibeVoiceGenerationOutput

Fields

Example