VibeVoiceStreamingConfig
Configuration class for VibeVoice streaming inference models. This is a composition configuration that combines acoustic tokenizer, decoder (language model), and diffusion head configurations.Class Signature
Initialization
Configuration for the acoustic tokenizer VAE. Can be a config object or dictionary
Configuration for the decoder language model. Currently supports Qwen2. Can be a config object or dictionary
Configuration for the diffusion prediction head. Can be a config object or dictionary
Number of upper Transformer layers used for TTS. The decoder is split into two components:
- Lower layers: Text encoding only
- Upper layers (TTS backbone): Text encoding + speech generation
Sub-Configurations
The config manages three sub-configurations:Dictionary mapping configuration keys to their config classes:
"acoustic_tokenizer_config": VibeVoiceAcousticTokenizerConfig"decoder_config": Qwen2Config"diffusion_head_config": VibeVoiceDiffusionHeadConfig
Properties
Dimension of acoustic VAE latent space. Read from
acoustic_tokenizer_config.vae_dim (default: 64)Number of upper decoder layers used for TTS generation (default: 20)
Tensor parallel plan for base Qwen2 model, defining how to shard attention and MLP layers across devices:
layers.*.self_attn.{q,k,v}_proj: Column-wise parallellayers.*.self_attn.o_proj: Row-wise parallellayers.*.mlp.{gate,up}_proj: Column-wise parallellayers.*.mlp.down_proj: Row-wise parallel
Methods
from_pretrained
Load configuration from a pretrained model.Model identifier from huggingface.co/models or path to directory containing config.json
Loaded configuration instance
save_pretrained
Save configuration to a directory.Directory where config.json will be saved
to_dict
Convert configuration to a dictionary.Dictionary representation of the configuration including all sub-configs
Configuration Structure
The configuration has a hierarchical structure:Usage Example
- Load from Pretrained
- Create Custom Config
- Modify Existing Config
Architecture Notes
The VibeVoice architecture uses a two-stage decoder approach:
- Base LM (Lower Layers): Encodes text only, produces hidden states
- TTS LM (Upper Layers): Encodes both text and speech, using
tts_backbone_num_hidden_layerslayers
Attention Implementation
The configuration sets_attn_implementation_autoset = False to prevent automatic attention implementation selection. Users should explicitly specify attention type:
flash_attention_2: Recommended for CUDA (best performance)sdpa: For MPS and CPU, or when Flash Attention is unavailable