Processors

VibeVoiceStreamingProcessor

Processor that wraps tokenizer and audio processor into a single interface for VibeVoice streaming models.

Class Signature

class VibeVoiceStreamingProcessor:
    def __init__(
        self,
        tokenizer=None,
        audio_processor=None,
        speech_tok_compress_ratio=3200,
        db_normalize=True,
        **kwargs
    )

Initialization

tokenizer

VibeVoiceTextTokenizer | VibeVoiceTextTokenizerFast

The tokenizer for text processing

audio_processor

VibeVoiceTokenizerProcessor

The audio processor for speech processing

speech_tok_compress_ratio

int

default:"3200"

Compression ratio for speech tokenization (samples per token)

db_normalize

bool

default:"True"

Whether to apply decibel normalization to audio inputs

Methods

from_pretrained

Load processor from a pretrained model directory.

from vibevoice import VibeVoiceStreamingProcessor

processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

pretrained_model_name_or_path

str

required

Model identifier from huggingface.co/models or path to local directory

processor

VibeVoiceStreamingProcessor

Initialized processor instance with loaded tokenizer and audio processor

save_pretrained

Save processor configuration to a directory.

processor.save_pretrained("./my_processor")

save_directory

str | os.PathLike

required

Directory where the processor configuration will be saved

process_input_with_cached_prompt

Main method to process text input with a cached voice prompt. Currently supports single examples only.

inputs = processor.process_input_with_cached_prompt(
    text="Hello world!",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt"
)

text

str

required

The input text to process

cached_prompt

Dict[str, Any]

required

Cached prompt dictionary containing KV cache of the voice prompt. Must include keys: 'lm', 'tts_lm', 'neg_lm', 'neg_tts_lm'

padding

bool | str | PaddingStrategy

default:"True"

Whether to pad sequences to the same length

truncation

bool | str | TruncationStrategy

default:"False"

Whether to truncate sequences

max_length

int

Maximum length of returned sequences

return_tensors

str | TensorType

Type of tensors to return. Use "pt" for PyTorch tensors

return_attention_mask

bool

default:"True"

Whether to return attention masks

BatchEncoding

object

A BatchEncoding with the following fields:

input_ids: Token IDs for base LM
attention_mask: Attention mask for base LM
tts_lm_input_ids: Token IDs for TTS LM
tts_lm_attention_mask: Attention mask for TTS LM
tts_text_ids: Token IDs for TTS text input (to be streamed)
speech_tensors: Padded speech inputs (if voice samples provided)
speech_masks: Speech masks (if voice samples provided)
speech_input_mask: Boolean masks indicating speech token positions

prepare_speech_inputs

Prepare speech inputs for model consumption with proper padding.

speech_dict = processor.prepare_speech_inputs(
    speech_inputs=[audio_array],
    return_tensors="pt"
)

speech_inputs

List[np.ndarray]

required

List of speech arrays

return_tensors

str | TensorType

Output tensor type. Use "pt" for PyTorch

device

str | torch.device

Device to place tensors on

dtype

torch.dtype

Data type for tensors

dict

object

Dictionary with keys:

padded_speeches: Padded audio arrays
speech_masks: Boolean masks for valid speech regions

save_audio

Save generated audio to a WAV file.

processor.save_audio(
    audio=outputs.speech_outputs[0],
    output_path="output.wav",
    sampling_rate=24000
)

audio

torch.Tensor | np.ndarray | List

required

Audio data to save. Can be a single tensor/array or list of them

output_path

str

default:"output.wav"

Path where the audio file will be saved

sampling_rate

int

Sampling rate for the audio. If None, uses processor’s default (24000 Hz)

normalize

bool

default:"False"

Whether to normalize audio before saving

batch_prefix

str

default:"audio_"

Prefix for batch audio files when saving multiple files

output_path

str

Path to the saved audio file

decode

Decode token IDs back to text.

text = processor.decode(token_ids, skip_special_tokens=True)

Forwards all arguments to the tokenizer’s decode() method.

batch_decode

Decode multiple sequences of token IDs.

texts = processor.batch_decode(token_ids_batch, skip_special_tokens=True)

Forwards all arguments to the tokenizer’s batch_decode() method.

Properties

model_input_names

List[str]

List of input names accepted by the model, combining tokenizer and audio processor inputs plus "speech_inputs" and "speech_input_mask"

Usage Example

from vibevoice import VibeVoiceStreamingProcessor
import torch

# Load processor
processor = VibeVoiceStreamingProcessor.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Load cached voice prompt
voice_prompt = torch.load("voice_prompt.pt")

# Process text with voice prompt
inputs = processor.process_input_with_cached_prompt(
    text="This is a test of VibeVoice.",
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True
)

# Inputs are ready for model.generate()
print(inputs.keys())
# dict_keys(['input_ids', 'attention_mask', 'tts_lm_input_ids', 
#            'tts_lm_attention_mask', 'tts_text_ids', 'speech_tensors',
#            'speech_masks', 'speech_input_mask'])

VibeVoiceTokenizerProcessor

Audio processor for VibeVoice acoustic tokenizer models. Handles audio preprocessing including format conversion and normalization.

Class Signature

class VibeVoiceTokenizerProcessor(FeatureExtractionMixin):
    def __init__(
        self,
        sampling_rate: int = 24000,
        normalize_audio: bool = True,
        target_dB_FS: float = -25,
        eps: float = 1e-6,
        **kwargs
    )

Initialization

sampling_rate

int

default:"24000"

Expected sampling rate for audio inputs

normalize_audio

bool

default:"True"

Whether to normalize audio to target dB FS level

target_dB_FS

float

default:"-25"

Target dB FS level for audio normalization

eps

float

default:"1e-6"

Small value for numerical stability in normalization

Methods

call

Process audio for VibeVoice models.

processed = audio_processor(
    audio=audio_array,
    sampling_rate=24000,
    return_tensors="pt"
)

audio

str | np.ndarray | List[float] | List[np.ndarray]

required

Audio input(s) to process. Can be:

Path to audio file (str)
NumPy array
List of floats
List of arrays (batch)

sampling_rate

int

Sampling rate of input audio. If None, uses processor’s default

return_tensors

str

Type of tensors to return ("pt" for PyTorch, "np" for NumPy)

Features

Stereo to Mono Conversion: Automatically converts stereo audio to mono
Audio Normalization: Normalizes audio to target dB FS level while avoiding clipping
Streaming Support: Designed to support infinite-length audio streams
Batch Processing: Can process multiple audio files simultaneously

AudioNormalizer

Helper class for audio normalization used by VibeVoiceTokenizerProcessor.

class AudioNormalizer:
    def __init__(self, target_dB_FS: float = -25, eps: float = 1e-6)

target_dB_FS

float

default:"-25"

Target dB FS level for audio normalization

eps

float

default:"1e-6"

Small value to avoid division by zero

Methods

tailor_dB_FS(audio): Adjust audio to target dB FS level avoid_clipping(audio, scalar): Prevent audio clipping by scaling call(audio): Apply full normalization pipeline

Usage Example

from vibevoice.processor import VibeVoiceTokenizerProcessor
import numpy as np

# Create processor
audio_processor = VibeVoiceTokenizerProcessor(
    sampling_rate=24000,
    normalize_audio=True,
    target_dB_FS=-25
)

# Load audio (example)
audio_array = np.random.randn(24000)  # 1 second at 24kHz

# Process audio
processed = audio_processor(
    audio=audio_array,
    return_tensors="pt"
)

print(processed['input_features'].shape)

Core Components

Utilities

VibeVoiceStreamingProcessor

Class Signature

Initialization

Methods

from_pretrained

save_pretrained

process_input_with_cached_prompt

prepare_speech_inputs

save_audio

decode

batch_decode

Properties

Usage Example

VibeVoiceTokenizerProcessor

Class Signature

Initialization

Methods

call

Features

AudioNormalizer

Methods

Usage Example

Build docs developers (and LLMs) love

Core Components

Utilities

​VibeVoiceStreamingProcessor

​Class Signature

​Initialization

​Methods

​from_pretrained

​save_pretrained

​process_input_with_cached_prompt

​prepare_speech_inputs

​save_audio

​decode

​batch_decode

​Properties

​Usage Example

​VibeVoiceTokenizerProcessor

​Class Signature

​Initialization

​Methods

​call

​Features

​AudioNormalizer

​Methods

​Usage Example

Build docs developers (and LLMs) love

VibeVoiceStreamingProcessor

Class Signature

Initialization

Methods

from_pretrained

save_pretrained

process_input_with_cached_prompt

prepare_speech_inputs

save_audio

decode

batch_decode

Properties

Usage Example

VibeVoiceTokenizerProcessor

Class Signature

Initialization

Methods

call

Features

AudioNormalizer

Methods

Usage Example