Skip to main content

VibeVoiceStreamingProcessor

Processor that wraps tokenizer and audio processor into a single interface for VibeVoice streaming models.

Class Signature

class VibeVoiceStreamingProcessor:
    def __init__(
        self,
        tokenizer=None,
        audio_processor=None,
        speech_tok_compress_ratio=3200,
        db_normalize=True,
        **kwargs
    )

Initialization

tokenizer
VibeVoiceTextTokenizer | VibeVoiceTextTokenizerFast
The tokenizer for text processing
audio_processor
VibeVoiceTokenizerProcessor
The audio processor for speech processing
speech_tok_compress_ratio
int
default:"3200"
Compression ratio for speech tokenization (samples per token)
db_normalize
bool
default:"True"
Whether to apply decibel normalization to audio inputs

Methods

from_pretrained

Load processor from a pretrained model directory.
from vibevoice import VibeVoiceStreamingProcessor

processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)
pretrained_model_name_or_path
str
required
Model identifier from huggingface.co/models or path to local directory
processor
VibeVoiceStreamingProcessor
Initialized processor instance with loaded tokenizer and audio processor

save_pretrained

Save processor configuration to a directory.
processor.save_pretrained("./my_processor")
save_directory
str | os.PathLike
required
Directory where the processor configuration will be saved

process_input_with_cached_prompt

Main method to process text input with a cached voice prompt. Currently supports single examples only.
inputs = processor.process_input_with_cached_prompt(
    text="Hello world!",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt"
)
text
str
required
The input text to process
cached_prompt
Dict[str, Any]
required
Cached prompt dictionary containing KV cache of the voice prompt. Must include keys: 'lm', 'tts_lm', 'neg_lm', 'neg_tts_lm'
padding
bool | str | PaddingStrategy
default:"True"
Whether to pad sequences to the same length
truncation
bool | str | TruncationStrategy
default:"False"
Whether to truncate sequences
max_length
int
Maximum length of returned sequences
return_tensors
str | TensorType
Type of tensors to return. Use "pt" for PyTorch tensors
return_attention_mask
bool
default:"True"
Whether to return attention masks
BatchEncoding
object
A BatchEncoding with the following fields:
  • input_ids: Token IDs for base LM
  • attention_mask: Attention mask for base LM
  • tts_lm_input_ids: Token IDs for TTS LM
  • tts_lm_attention_mask: Attention mask for TTS LM
  • tts_text_ids: Token IDs for TTS text input (to be streamed)
  • speech_tensors: Padded speech inputs (if voice samples provided)
  • speech_masks: Speech masks (if voice samples provided)
  • speech_input_mask: Boolean masks indicating speech token positions

prepare_speech_inputs

Prepare speech inputs for model consumption with proper padding.
speech_dict = processor.prepare_speech_inputs(
    speech_inputs=[audio_array],
    return_tensors="pt"
)
speech_inputs
List[np.ndarray]
required
List of speech arrays
return_tensors
str | TensorType
Output tensor type. Use "pt" for PyTorch
device
str | torch.device
Device to place tensors on
dtype
torch.dtype
Data type for tensors
dict
object
Dictionary with keys:
  • padded_speeches: Padded audio arrays
  • speech_masks: Boolean masks for valid speech regions

save_audio

Save generated audio to a WAV file.
processor.save_audio(
    audio=outputs.speech_outputs[0],
    output_path="output.wav",
    sampling_rate=24000
)
audio
torch.Tensor | np.ndarray | List
required
Audio data to save. Can be a single tensor/array or list of them
output_path
str
default:"output.wav"
Path where the audio file will be saved
sampling_rate
int
Sampling rate for the audio. If None, uses processor’s default (24000 Hz)
normalize
bool
default:"False"
Whether to normalize audio before saving
batch_prefix
str
default:"audio_"
Prefix for batch audio files when saving multiple files
output_path
str
Path to the saved audio file

decode

Decode token IDs back to text.
text = processor.decode(token_ids, skip_special_tokens=True)
Forwards all arguments to the tokenizer’s decode() method.

batch_decode

Decode multiple sequences of token IDs.
texts = processor.batch_decode(token_ids_batch, skip_special_tokens=True)
Forwards all arguments to the tokenizer’s batch_decode() method.

Properties

model_input_names
List[str]
List of input names accepted by the model, combining tokenizer and audio processor inputs plus "speech_inputs" and "speech_input_mask"

Usage Example

from vibevoice import VibeVoiceStreamingProcessor
import torch

# Load processor
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Load cached voice prompt
voice_prompt = torch.load("voice_prompt.pt")

# Process text with voice prompt
inputs = processor.process_input_with_cached_prompt(
    text="This is a test of VibeVoice.",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True
)

# Inputs are ready for model.generate()
print(inputs.keys())
# dict_keys(['input_ids', 'attention_mask', 'tts_lm_input_ids', 
#            'tts_lm_attention_mask', 'tts_text_ids', 'speech_tensors',
#            'speech_masks', 'speech_input_mask'])

VibeVoiceTokenizerProcessor

Audio processor for VibeVoice acoustic tokenizer models. Handles audio preprocessing including format conversion and normalization.

Class Signature

class VibeVoiceTokenizerProcessor(FeatureExtractionMixin):
    def __init__(
        self,
        sampling_rate: int = 24000,
        normalize_audio: bool = True,
        target_dB_FS: float = -25,
        eps: float = 1e-6,
        **kwargs
    )

Initialization

sampling_rate
int
default:"24000"
Expected sampling rate for audio inputs
normalize_audio
bool
default:"True"
Whether to normalize audio to target dB FS level
target_dB_FS
float
default:"-25"
Target dB FS level for audio normalization
eps
float
default:"1e-6"
Small value for numerical stability in normalization

Methods

call

Process audio for VibeVoice models.
processed = audio_processor(
    audio=audio_array,
    sampling_rate=24000,
    return_tensors="pt"
)
audio
str | np.ndarray | List[float] | List[np.ndarray]
required
Audio input(s) to process. Can be:
  • Path to audio file (str)
  • NumPy array
  • List of floats
  • List of arrays (batch)
sampling_rate
int
Sampling rate of input audio. If None, uses processor’s default
return_tensors
str
Type of tensors to return ("pt" for PyTorch, "np" for NumPy)

Features

  • Stereo to Mono Conversion: Automatically converts stereo audio to mono
  • Audio Normalization: Normalizes audio to target dB FS level while avoiding clipping
  • Streaming Support: Designed to support infinite-length audio streams
  • Batch Processing: Can process multiple audio files simultaneously

AudioNormalizer

Helper class for audio normalization used by VibeVoiceTokenizerProcessor.
class AudioNormalizer:
    def __init__(self, target_dB_FS: float = -25, eps: float = 1e-6)
target_dB_FS
float
default:"-25"
Target dB FS level for audio normalization
eps
float
default:"1e-6"
Small value to avoid division by zero

Methods

tailor_dB_FS(audio): Adjust audio to target dB FS level avoid_clipping(audio, scalar): Prevent audio clipping by scaling call(audio): Apply full normalization pipeline

Usage Example

from vibevoice.processor import VibeVoiceTokenizerProcessor
import numpy as np

# Create processor
audio_processor = VibeVoiceTokenizerProcessor(
    sampling_rate=24000,
    normalize_audio=True,
    target_dB_FS=-25
)

# Load audio (example)
audio_array = np.random.randn(24000)  # 1 second at 24kHz

# Process audio
processed = audio_processor(
    audio=audio_array,
    return_tensors="pt"
)

print(processed['input_features'].shape)

Build docs developers (and LLMs) love