Skip to main content

VibeVoiceStreamingConfig

Configuration class for VibeVoice streaming inference models. This is a composition configuration that combines acoustic tokenizer, decoder (language model), and diffusion head configurations.

Class Signature

class VibeVoiceStreamingConfig(PretrainedConfig):
    model_type = "vibevoice_streaming"
    is_composition = True

Initialization

from vibevoice import VibeVoiceStreamingConfig

config = VibeVoiceStreamingConfig(
    acoustic_tokenizer_config=acoustic_config,
    decoder_config=decoder_config,
    diffusion_head_config=diffusion_config,
    tts_backbone_num_hidden_layers=20
)
acoustic_tokenizer_config
VibeVoiceAcousticTokenizerConfig | dict
Configuration for the acoustic tokenizer VAE. Can be a config object or dictionary
decoder_config
Qwen2Config | dict
Configuration for the decoder language model. Currently supports Qwen2. Can be a config object or dictionary
diffusion_head_config
VibeVoiceDiffusionHeadConfig | dict
Configuration for the diffusion prediction head. Can be a config object or dictionary
tts_backbone_num_hidden_layers
int
default:"20"
Number of upper Transformer layers used for TTS. The decoder is split into two components:
  • Lower layers: Text encoding only
  • Upper layers (TTS backbone): Text encoding + speech generation

Sub-Configurations

The config manages three sub-configurations:
sub_configs
dict
Dictionary mapping configuration keys to their config classes:
  • "acoustic_tokenizer_config": VibeVoiceAcousticTokenizerConfig
  • "decoder_config": Qwen2Config
  • "diffusion_head_config": VibeVoiceDiffusionHeadConfig

Properties

acoustic_vae_dim
int
Dimension of acoustic VAE latent space. Read from acoustic_tokenizer_config.vae_dim (default: 64)
tts_backbone_num_hidden_layers
int
Number of upper decoder layers used for TTS generation (default: 20)
base_model_tp_plan
dict
Tensor parallel plan for base Qwen2 model, defining how to shard attention and MLP layers across devices:
  • layers.*.self_attn.{q,k,v}_proj: Column-wise parallel
  • layers.*.self_attn.o_proj: Row-wise parallel
  • layers.*.mlp.{gate,up}_proj: Column-wise parallel
  • layers.*.mlp.down_proj: Row-wise parallel

Methods

from_pretrained

Load configuration from a pretrained model.
config = VibeVoiceStreamingConfig.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)
pretrained_model_name_or_path
str
required
Model identifier from huggingface.co/models or path to directory containing config.json
config
VibeVoiceStreamingConfig
Loaded configuration instance

save_pretrained

Save configuration to a directory.
config.save_pretrained("./my_model")
save_directory
str | os.PathLike
required
Directory where config.json will be saved

to_dict

Convert configuration to a dictionary.
config_dict = config.to_dict()
dict
dict
Dictionary representation of the configuration including all sub-configs

Configuration Structure

The configuration has a hierarchical structure:
VibeVoiceStreamingConfig
├── acoustic_tokenizer_config (VibeVoiceAcousticTokenizerConfig)
│   ├── vae_dim: 64
│   ├── encoder_config: {...}
│   └── decoder_config: {...}
├── decoder_config (Qwen2Config)
│   ├── hidden_size: 1536
│   ├── num_hidden_layers: 28
│   ├── num_attention_heads: 12
│   ├── max_position_embeddings: 32768
│   └── ...
├── diffusion_head_config (VibeVoiceDiffusionHeadConfig)
│   ├── ddpm_num_inference_steps: 5
│   ├── prediction_type: "epsilon"
│   └── ...
└── tts_backbone_num_hidden_layers: 20

Usage Example

from vibevoice import VibeVoiceStreamingConfig

# Load existing config
config = VibeVoiceStreamingConfig.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B"
)

# Inspect config
print(f"Acoustic VAE dim: {config.acoustic_vae_dim}")
print(f"Decoder hidden size: {config.decoder_config.hidden_size}")
print(f"TTS layers: {config.tts_backbone_num_hidden_layers}")

Architecture Notes

The VibeVoice architecture uses a two-stage decoder approach:
  1. Base LM (Lower Layers): Encodes text only, produces hidden states
  2. TTS LM (Upper Layers): Encodes both text and speech, using tts_backbone_num_hidden_layers layers
The split allows efficient text processing before entering the more complex TTS generation phase.

Attention Implementation

The configuration sets _attn_implementation_autoset = False to prevent automatic attention implementation selection. Users should explicitly specify attention type:
  • flash_attention_2: Recommended for CUDA (best performance)
  • sdpa: For MPS and CPU, or when Flash Attention is unavailable
from vibevoice import VibeVoiceStreamingForConditionalGenerationInference

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    attn_implementation="flash_attention_2"  # Explicit selection
)

Configuration File Format

When saved, the configuration is stored as a JSON file with nested sub-configurations:
{
  "model_type": "vibevoice_streaming",
  "tts_backbone_num_hidden_layers": 20,
  "acoustic_vae_dim": 64,
  "acoustic_tokenizer_config": {
    "model_type": "vibevoice_acoustic_tokenizer",
    "vae_dim": 64,
    ...
  },
  "decoder_config": {
    "model_type": "qwen2",
    "hidden_size": 1536,
    "num_hidden_layers": 28,
    ...
  },
  "diffusion_head_config": {
    "model_type": "vibevoice_diffusion_head",
    "ddpm_num_inference_steps": 5,
    ...
  }
}

Build docs developers (and LLMs) love