This guide covers advanced configuration options for VibeVoice, including model settings, generation parameters, and performance tuning.
Model Configuration
VibeVoiceStreamingConfig
The main configuration class for VibeVoice streaming models:
from vibevoice import VibeVoiceStreamingConfig
config = VibeVoiceStreamingConfig(
acoustic_tokenizer_config = None , # Uses defaults
decoder_config = None , # Uses defaults
diffusion_head_config = None , # Uses defaults
tts_backbone_num_hidden_layers = 20
)
Configuration Components
The configuration is composed of three sub-configurations:
Acoustic Tokenizer Handles audio encoding and decoding with VAE architecture
Decoder Language model backbone (Qwen2) for text and speech processing
Diffusion Head Diffusion-based audio generation component
Processor Configuration
VibeVoiceStreamingProcessor
The processor handles text and audio preprocessing:
from vibevoice import VibeVoiceStreamingProcessor
processor = VibeVoiceStreamingProcessor(
tokenizer = tokenizer,
audio_processor = audio_processor,
speech_tok_compress_ratio = 3200 ,
db_normalize = True
)
speech_tok_compress_ratio
Compression ratio for speech tokenization. Determines how many audio samples map to one speech token.
Whether to apply decibel normalization to audio inputs for consistent volume levels.
Audio Processor Settings
from vibevoice import VibeVoiceTokenizerProcessor
audio_processor = VibeVoiceTokenizerProcessor(
sampling_rate = 24000 ,
normalize_audio = True ,
target_dB_FS =- 25 ,
eps = 1e-6
)
Audio sampling rate in Hz. VibeVoice uses 24kHz by default.
Enable audio normalization to target dB level.
Target decibel Full Scale for audio normalization. Controls output volume.
Small epsilon value for numerical stability in normalization.
Generation Parameters
Diffusion Inference Steps
Control the quality/speed tradeoff:
model.set_ddpm_inference_steps( num_steps = 5 )
Fast (Lower Quality)
Balanced (Default)
High Quality (Slower)
model.set_ddpm_inference_steps( num_steps = 3 )
More steps generally improve quality but increase generation latency. 5 steps provides a good balance for real-time applications.
Noise Scheduler Configuration
Customize the diffusion noise scheduler:
model.model.noise_scheduler = model.model.noise_scheduler.from_config(
model.model.noise_scheduler.config,
algorithm_type = "sde-dpmsolver++" ,
beta_schedule = "squaredcos_cap_v2"
)
algorithm_type
string
default: "sde-dpmsolver++"
Diffusion solver algorithm. Options include sde-dpmsolver++, dpmsolver, euler.
beta_schedule
string
default: "squaredcos_cap_v2"
Noise schedule for the diffusion process. Affects generation characteristics.
CFG Scale (Classifier-Free Guidance)
outputs = model.generate(
** inputs,
cfg_scale = 1.5 , # Default
# ...
)
CFG Scale Effect Use Case 1.0 Minimal guidance More creative, diverse outputs 1.5 Balanced (default) General purpose 2.0 Strong guidance Higher prompt adherence 2.5+ Very strong Maximum control, may reduce naturalness
Sampling Parameters
outputs = model.generate(
** inputs,
generation_config = {
'do_sample' : True , # Enable sampling
'temperature' : 0.9 , # Randomness (0.0-1.0)
'top_p' : 0.9 , # Nucleus sampling threshold
},
# ...
)
Whether to use sampling or greedy decoding. False for deterministic output.
Sampling temperature. Higher values increase randomness. Only used if do_sample=True.
Nucleus sampling threshold. Only tokens with cumulative probability > top_p are considered. Only used if do_sample=True.
For deterministic, reproducible output, use do_sample=False (default).
Device-Specific Optimization
CUDA Configuration
import torch
device = "cuda"
load_dtype = torch.bfloat16
attn_implementation = "flash_attention_2"
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
model_path,
torch_dtype = load_dtype,
device_map = "cuda" ,
attn_implementation = attn_implementation
)
For NVIDIA GPUs with compute capability ≥ 8.0 (A100, RTX 30/40 series), use flash_attention_2 for optimal performance.
MPS Configuration (Apple Silicon)
import torch
device = "mps"
load_dtype = torch.float32 # MPS requires float32
attn_implementation = "sdpa" # flash_attention_2 not supported
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
model_path,
torch_dtype = load_dtype,
attn_implementation = attn_implementation,
device_map = None # Don't use device_map with MPS
)
model.to( "mps" )
MPS requires float32 dtype. Using bfloat16 will cause errors.
CPU Configuration
import torch
device = "cpu"
load_dtype = torch.float32
attn_implementation = "sdpa"
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
model_path,
torch_dtype = load_dtype,
device_map = "cpu" ,
attn_implementation = attn_implementation
)
Attention Implementation
Flash Attention 2 (Recommended)
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
model_path,
attn_implementation = "flash_attention_2"
)
Benefits:
Faster inference
Lower memory usage
Better audio quality (fully tested)
Requirements:
CUDA-compatible GPU
Flash Attention installed: pip install flash-attn --no-build-isolation
SDPA (Scaled Dot-Product Attention)
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
model_path,
attn_implementation = "sdpa"
)
Use Cases:
CPU inference
MPS (Apple Silicon)
Systems without Flash Attention support
SDPA is the automatic fallback if Flash Attention 2 fails to load.
TTS Backbone Configuration
Layer Partitioning
The decoder is divided into text encoding and TTS layers:
config = VibeVoiceStreamingConfig(
tts_backbone_num_hidden_layers = 20
)
tts_backbone_num_hidden_layers
Number of upper Transformer layers used for TTS. Lower layers are text-only encoding.
Architecture:
Lower layers: Text encoding only
Upper tts_backbone_num_hidden_layers: Text + speech generation
Streaming Configuration
Audio Streamer Settings
from vibevoice.modular.streamer import AudioStreamer
audio_streamer = AudioStreamer(
batch_size = 1 ,
stop_signal = None ,
timeout = None
)
Stop Event Handling
import threading
stop_event = threading.Event()
outputs = model.generate(
** inputs,
stop_check_fn = stop_event.is_set,
# ...
)
# To stop generation:
stop_event.set()
Use stop events to implement user-cancellable generation in interactive applications.
Advanced Generation Options
Refresh Negative Prompt
outputs = model.generate(
** inputs,
refresh_negative = True , # Regenerate negative prompt each time
# ...
)
Whether to regenerate the negative prompt for CFG. Set to False for faster repeated generations.
Verbose Output
outputs = model.generate(
** inputs,
verbose = True , # Print generation progress
# ...
)
Max New Tokens
outputs = model.generate(
** inputs,
max_new_tokens = None , # Generate until natural stopping point
# max_new_tokens=2000, # Or limit token count
# ...
)
Setting max_new_tokens=None lets the model determine the appropriate length based on input text.
Saving and Loading Configurations
Save Processor Configuration
processor.save_pretrained( "./my_processor" )
This saves:
preprocessor_config.json with all processor settings
Load Custom Configuration
processor = VibeVoiceStreamingProcessor.from_pretrained( "./my_processor" )
Configuration File Example
{
"processor_class" : "VibeVoiceStreamingProcessor" ,
"speech_tok_compress_ratio" : 3200 ,
"db_normalize" : true ,
"audio_processor" : {
"feature_extractor_type" : "VibeVoiceTokenizerProcessor" ,
"sampling_rate" : 24000 ,
"normalize_audio" : true ,
"target_dB_FS" : -25 ,
"eps" : 1e-6
}
}
Optimize for Latency
# Minimal inference steps
model.set_ddpm_inference_steps( num_steps = 3 )
# Disable sampling
generation_config = { 'do_sample' : False }
# Lower CFG scale
cfg_scale = 1.0
Optimize for Quality
# More inference steps
model.set_ddpm_inference_steps( num_steps = 10 )
# Enable sampling with moderate temperature
generation_config = {
'do_sample' : True ,
'temperature' : 0.7 ,
'top_p' : 0.9
}
# Higher CFG scale
cfg_scale = 2.0
Optimize for Memory
# Use gradient checkpointing (if training)
model.gradient_checkpointing_enable()
# Use smaller batch sizes
batch_size = 1
# Clear CUDA cache between generations
import torch
torch.cuda.empty_cache()
Troubleshooting
Out of Memory
Reduce Inference Steps
model.set_ddpm_inference_steps( num_steps = 3 )
Use Lower Precision
# Use float16 instead of bfloat16 (if supported)
torch_dtype = torch.float16
Slow Generation
Install Flash Attention 2 for CUDA GPUs
Reduce num_steps in diffusion inference
Use do_sample=False for deterministic generation
Ensure model is on GPU, not CPU
Poor Audio Quality
Increase diffusion inference steps to 7-10
Ensure Flash Attention 2 is being used (check logs)
Adjust CFG scale (try 1.5-2.0 range)
Verify voice prompt is appropriate for target language