VibeVoiceStreamingProcessor
Processor that wraps tokenizer and audio processor into a single interface for VibeVoice streaming models.Class Signature
Initialization
The tokenizer for text processing
The audio processor for speech processing
Compression ratio for speech tokenization (samples per token)
Whether to apply decibel normalization to audio inputs
Methods
from_pretrained
Load processor from a pretrained model directory.Model identifier from huggingface.co/models or path to local directory
Initialized processor instance with loaded tokenizer and audio processor
save_pretrained
Save processor configuration to a directory.Directory where the processor configuration will be saved
process_input_with_cached_prompt
Main method to process text input with a cached voice prompt. Currently supports single examples only.The input text to process
Cached prompt dictionary containing KV cache of the voice prompt. Must include keys:
'lm', 'tts_lm', 'neg_lm', 'neg_tts_lm'Whether to pad sequences to the same length
Whether to truncate sequences
Maximum length of returned sequences
Type of tensors to return. Use
"pt" for PyTorch tensorsWhether to return attention masks
A BatchEncoding with the following fields:
input_ids: Token IDs for base LMattention_mask: Attention mask for base LMtts_lm_input_ids: Token IDs for TTS LMtts_lm_attention_mask: Attention mask for TTS LMtts_text_ids: Token IDs for TTS text input (to be streamed)speech_tensors: Padded speech inputs (if voice samples provided)speech_masks: Speech masks (if voice samples provided)speech_input_mask: Boolean masks indicating speech token positions
prepare_speech_inputs
Prepare speech inputs for model consumption with proper padding.List of speech arrays
Output tensor type. Use
"pt" for PyTorchDevice to place tensors on
Data type for tensors
Dictionary with keys:
padded_speeches: Padded audio arraysspeech_masks: Boolean masks for valid speech regions
save_audio
Save generated audio to a WAV file.Audio data to save. Can be a single tensor/array or list of them
Path where the audio file will be saved
Sampling rate for the audio. If None, uses processor’s default (24000 Hz)
Whether to normalize audio before saving
Prefix for batch audio files when saving multiple files
Path to the saved audio file
decode
Decode token IDs back to text.decode() method.
batch_decode
Decode multiple sequences of token IDs.batch_decode() method.
Properties
List of input names accepted by the model, combining tokenizer and audio processor inputs plus
"speech_inputs" and "speech_input_mask"Usage Example
VibeVoiceTokenizerProcessor
Audio processor for VibeVoice acoustic tokenizer models. Handles audio preprocessing including format conversion and normalization.Class Signature
Initialization
Expected sampling rate for audio inputs
Whether to normalize audio to target dB FS level
Target dB FS level for audio normalization
Small value for numerical stability in normalization
Methods
call
Process audio for VibeVoice models.Audio input(s) to process. Can be:
- Path to audio file (str)
- NumPy array
- List of floats
- List of arrays (batch)
Sampling rate of input audio. If None, uses processor’s default
Type of tensors to return (
"pt" for PyTorch, "np" for NumPy)Features
- Stereo to Mono Conversion: Automatically converts stereo audio to mono
- Audio Normalization: Normalizes audio to target dB FS level while avoiding clipping
- Streaming Support: Designed to support infinite-length audio streams
- Batch Processing: Can process multiple audio files simultaneously
AudioNormalizer
Helper class for audio normalization used by VibeVoiceTokenizerProcessor.Target dB FS level for audio normalization
Small value to avoid division by zero