Skip to main content
This guide covers advanced configuration options for VibeVoice, including model settings, generation parameters, and performance tuning.

Model Configuration

VibeVoiceStreamingConfig

The main configuration class for VibeVoice streaming models:
from vibevoice import VibeVoiceStreamingConfig

config = VibeVoiceStreamingConfig(
    acoustic_tokenizer_config=None,  # Uses defaults
    decoder_config=None,              # Uses defaults
    diffusion_head_config=None,       # Uses defaults
    tts_backbone_num_hidden_layers=20
)

Configuration Components

The configuration is composed of three sub-configurations:

Acoustic Tokenizer

Handles audio encoding and decoding with VAE architecture

Decoder

Language model backbone (Qwen2) for text and speech processing

Diffusion Head

Diffusion-based audio generation component

Processor Configuration

VibeVoiceStreamingProcessor

The processor handles text and audio preprocessing:
from vibevoice import VibeVoiceStreamingProcessor

processor = VibeVoiceStreamingProcessor(
    tokenizer=tokenizer,
    audio_processor=audio_processor,
    speech_tok_compress_ratio=3200,
    db_normalize=True
)
speech_tok_compress_ratio
integer
default:"3200"
Compression ratio for speech tokenization. Determines how many audio samples map to one speech token.
db_normalize
boolean
default:"true"
Whether to apply decibel normalization to audio inputs for consistent volume levels.

Audio Processor Settings

from vibevoice import VibeVoiceTokenizerProcessor

audio_processor = VibeVoiceTokenizerProcessor(
    sampling_rate=24000,
    normalize_audio=True,
    target_dB_FS=-25,
    eps=1e-6
)
sampling_rate
integer
default:"24000"
Audio sampling rate in Hz. VibeVoice uses 24kHz by default.
normalize_audio
boolean
default:"true"
Enable audio normalization to target dB level.
target_dB_FS
float
default:"-25"
Target decibel Full Scale for audio normalization. Controls output volume.
eps
float
default:"1e-6"
Small epsilon value for numerical stability in normalization.

Generation Parameters

Diffusion Inference Steps

Control the quality/speed tradeoff:
model.set_ddpm_inference_steps(num_steps=5)
model.set_ddpm_inference_steps(num_steps=3)
More steps generally improve quality but increase generation latency. 5 steps provides a good balance for real-time applications.

Noise Scheduler Configuration

Customize the diffusion noise scheduler:
model.model.noise_scheduler = model.model.noise_scheduler.from_config(
    model.model.noise_scheduler.config,
    algorithm_type="sde-dpmsolver++",
    beta_schedule="squaredcos_cap_v2"
)
algorithm_type
string
default:"sde-dpmsolver++"
Diffusion solver algorithm. Options include sde-dpmsolver++, dpmsolver, euler.
beta_schedule
string
default:"squaredcos_cap_v2"
Noise schedule for the diffusion process. Affects generation characteristics.

CFG Scale (Classifier-Free Guidance)

outputs = model.generate(
    **inputs,
    cfg_scale=1.5,  # Default
    # ...
)
CFG ScaleEffectUse Case
1.0Minimal guidanceMore creative, diverse outputs
1.5Balanced (default)General purpose
2.0Strong guidanceHigher prompt adherence
2.5+Very strongMaximum control, may reduce naturalness

Sampling Parameters

outputs = model.generate(
    **inputs,
    generation_config={
        'do_sample': True,        # Enable sampling
        'temperature': 0.9,       # Randomness (0.0-1.0)
        'top_p': 0.9,            # Nucleus sampling threshold
    },
    # ...
)
do_sample
boolean
default:"false"
Whether to use sampling or greedy decoding. False for deterministic output.
temperature
float
default:"0.9"
Sampling temperature. Higher values increase randomness. Only used if do_sample=True.
top_p
float
default:"0.9"
Nucleus sampling threshold. Only tokens with cumulative probability > top_p are considered. Only used if do_sample=True.
For deterministic, reproducible output, use do_sample=False (default).

Device-Specific Optimization

CUDA Configuration

import torch

device = "cuda"
load_dtype = torch.bfloat16
attn_implementation = "flash_attention_2"

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    torch_dtype=load_dtype,
    device_map="cuda",
    attn_implementation=attn_implementation
)
For NVIDIA GPUs with compute capability ≥ 8.0 (A100, RTX 30/40 series), use flash_attention_2 for optimal performance.

MPS Configuration (Apple Silicon)

import torch

device = "mps"
load_dtype = torch.float32  # MPS requires float32
attn_implementation = "sdpa"  # flash_attention_2 not supported

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    torch_dtype=load_dtype,
    attn_implementation=attn_implementation,
    device_map=None  # Don't use device_map with MPS
)
model.to("mps")
MPS requires float32 dtype. Using bfloat16 will cause errors.

CPU Configuration

import torch

device = "cpu"
load_dtype = torch.float32
attn_implementation = "sdpa"

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    torch_dtype=load_dtype,
    device_map="cpu",
    attn_implementation=attn_implementation
)

Attention Implementation

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2"
)
Benefits:
  • Faster inference
  • Lower memory usage
  • Better audio quality (fully tested)
Requirements:
  • CUDA-compatible GPU
  • Flash Attention installed: pip install flash-attn --no-build-isolation

SDPA (Scaled Dot-Product Attention)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    model_path,
    attn_implementation="sdpa"
)
Use Cases:
  • CPU inference
  • MPS (Apple Silicon)
  • Systems without Flash Attention support
SDPA is the automatic fallback if Flash Attention 2 fails to load.

TTS Backbone Configuration

Layer Partitioning

The decoder is divided into text encoding and TTS layers:
config = VibeVoiceStreamingConfig(
    tts_backbone_num_hidden_layers=20
)
tts_backbone_num_hidden_layers
integer
default:"20"
Number of upper Transformer layers used for TTS. Lower layers are text-only encoding.
Architecture:
  • Lower layers: Text encoding only
  • Upper tts_backbone_num_hidden_layers: Text + speech generation

Streaming Configuration

Audio Streamer Settings

from vibevoice.modular.streamer import AudioStreamer

audio_streamer = AudioStreamer(
    batch_size=1,
    stop_signal=None,
    timeout=None
)

Stop Event Handling

import threading

stop_event = threading.Event()

outputs = model.generate(
    **inputs,
    stop_check_fn=stop_event.is_set,
    # ...
)

# To stop generation:
stop_event.set()
Use stop events to implement user-cancellable generation in interactive applications.

Advanced Generation Options

Refresh Negative Prompt

outputs = model.generate(
    **inputs,
    refresh_negative=True,  # Regenerate negative prompt each time
    # ...
)
refresh_negative
boolean
default:"true"
Whether to regenerate the negative prompt for CFG. Set to False for faster repeated generations.

Verbose Output

outputs = model.generate(
    **inputs,
    verbose=True,  # Print generation progress
    # ...
)

Max New Tokens

outputs = model.generate(
    **inputs,
    max_new_tokens=None,  # Generate until natural stopping point
    # max_new_tokens=2000,  # Or limit token count
    # ...
)
Setting max_new_tokens=None lets the model determine the appropriate length based on input text.

Saving and Loading Configurations

Save Processor Configuration

processor.save_pretrained("./my_processor")
This saves:
  • preprocessor_config.json with all processor settings

Load Custom Configuration

processor = VibeVoiceStreamingProcessor.from_pretrained("./my_processor")

Configuration File Example

preprocessor_config.json
{
  "processor_class": "VibeVoiceStreamingProcessor",
  "speech_tok_compress_ratio": 3200,
  "db_normalize": true,
  "audio_processor": {
    "feature_extractor_type": "VibeVoiceTokenizerProcessor",
    "sampling_rate": 24000,
    "normalize_audio": true,
    "target_dB_FS": -25,
    "eps": 1e-6
  }
}

Performance Tuning

Optimize for Latency

# Minimal inference steps
model.set_ddpm_inference_steps(num_steps=3)

# Disable sampling
generation_config = {'do_sample': False}

# Lower CFG scale
cfg_scale = 1.0

Optimize for Quality

# More inference steps
model.set_ddpm_inference_steps(num_steps=10)

# Enable sampling with moderate temperature
generation_config = {
    'do_sample': True,
    'temperature': 0.7,
    'top_p': 0.9
}

# Higher CFG scale
cfg_scale = 2.0

Optimize for Memory

# Use gradient checkpointing (if training)
model.gradient_checkpointing_enable()

# Use smaller batch sizes
batch_size = 1

# Clear CUDA cache between generations
import torch
torch.cuda.empty_cache()

Troubleshooting

Out of Memory

1

Reduce Inference Steps

model.set_ddpm_inference_steps(num_steps=3)
2

Use Lower Precision

# Use float16 instead of bfloat16 (if supported)
torch_dtype = torch.float16
3

Clear Cache

torch.cuda.empty_cache()

Slow Generation

  • Install Flash Attention 2 for CUDA GPUs
  • Reduce num_steps in diffusion inference
  • Use do_sample=False for deterministic generation
  • Ensure model is on GPU, not CPU

Poor Audio Quality

  • Increase diffusion inference steps to 7-10
  • Ensure Flash Attention 2 is being used (check logs)
  • Adjust CFG scale (try 1.5-2.0 range)
  • Verify voice prompt is appropriate for target language

Build docs developers (and LLMs) love