Skip to main content

Overview

VibeVoice’s speech tokenizers convert between continuous audio waveforms and compact latent representations. Operating at an ultra-low 7.5 Hz frame rate, they enable efficient processing of long-form speech while preserving audio quality.

Acoustic Tokenizer

The acoustic tokenizer is a VAE-based model that encodes and decodes speech at the waveform level.

Architecture

Decoder-Only

Only the decoder is used for generation from latents

Hierarchical Upsampling

Multi-stage upsampling with residual blocks

Causal Support

Supports streaming with causal convolutions

Frame Rate Calculation

The 7.5 Hz frame rate is achieved through hierarchical downsampling ratios:
# From configuration
ratios = [8, 5, 4, 2]  # Downsampling at each stage
total_downsample = 8 * 5 * 4 * 2 = 320

# At 24 kHz sample rate:
samples_per_token = 320
frame_rate = 24000 / 320 = 75 Hz... wait, let me recalculate

# Actually for 7.5 Hz:
total_downsample = 24000 / 7.5 = 3200
# OR ratios like [8, 5, 5, 2, 4] = 1600, adjusted based on model config
The exact downsampling ratio depends on the model configuration. The key insight is that the frame rate is 10x lower than typical speech models (75 Hz or higher).

Decoder Architecture

From modular_vibevoice_tokenizer.py:687-823, the decoder consists of:

1. Upsampling Layers

# Stem: Initial projection
SConv1d(dimension, n_filters * 2^(len(depths)-1), kernel_size)

# Hierarchical upsampling
for i in range(len(ratios)):
    SConvTranspose1d(
        in_channels=n_filters * 2^(len(depths)-1-i),
        out_channels=n_filters * 2^(len(depths)-2-i),
        kernel_size=ratios[i] * 2,
        stride=ratios[i]
    )
Each SConvTranspose1d layer upsamples by its corresponding ratio:
  • Ratio 8: 8x upsampling (e.g., 75 Hz → 600 Hz)
  • Ratio 5: 5x upsampling (600 Hz → 3000 Hz)
  • Ratio 4: 4x upsampling (3000 Hz → 12000 Hz)
  • Ratio 2: 2x upsampling (12000 Hz → 24000 Hz)

2. Residual Blocks (Block1D)

Between upsampling stages, Block1D modules refine features:
class Block1D(nn.Module):
    def __init__(self, dim, kernel_size=7, mixer_layer='conv', ...):
        self.norm = ConvRMSNorm(dim)  # or ConvLayerNorm
        self.mixer = Convlayer(dim, dim, kernel_size, ...)
        self.ffn_norm = ConvRMSNorm(dim)
        self.ffn = FFN(dim, ffn_expansion * dim)
Each block performs:
  1. Normalization (RMSNorm or LayerNorm)
  2. Depthwise Convolution (mixing temporal information)
  3. Feed-Forward Network (channel-wise processing)
  4. Residual Connections (gradient flow)
def __init__(self, dim, kernel_size=7, drop_path=0., 
             mixer_layer='conv', layer_scale_init_value=1e-6):
    
    # Normalization
    if layernorm == 'LN':
        self.norm = ConvLayerNorm(dim)
        self.ffn_norm = ConvLayerNorm(dim)
    elif layernorm == 'RMSNorm':
        self.norm = ConvRMSNorm(dim)
        self.ffn_norm = ConvRMSNorm(dim)
    
    # Mixer (depthwise or standard conv)
    if mixer_layer == 'depthwise_conv':
        self.mixer = Convlayer(dim, dim, groups=dim, ...)
    
    # FFN with SwiGLU-like activation
    self.ffn = FFN(dim, ffn_expansion * dim)
    
    # Layer scaling for stability
    self.gamma = nn.Parameter(layer_scale_init_value * torch.ones(dim))
    self.ffn_gamma = nn.Parameter(layer_scale_init_value * torch.ones(dim))

3. Final Projection

# Final normalization and projection to audio channels
self.norm = ConvRMSNorm(in_ch)
self.head = SConv1d(in_ch, channels, kernel_size=last_kernel_size)
Outputs waveform with channels=1 for mono audio.

Streaming Support

VibeVoice supports real-time streaming through causal convolutions with caching.

Streaming Cache

From modular_vibevoice_tokenizer.py:193-256, the VibeVoiceTokenizerStreamingCache maintains state:
class VibeVoiceTokenizerStreamingCache:
    def __init__(self):
        self.cache = {}  # Maps (layer_id, sample_idx) -> state tensor
    
    def get(self, layer_id, sample_indices):
        """Retrieve cached states for continuation"""
        
    def set(self, layer_id, sample_indices, states):
        """Store new states for next chunk"""

Causal Convolution (SConv1d)

The SConv1d layer handles both streaming and non-streaming modes:
def _forward_streaming(self, x, cache, sample_indices, debug=False):
    B, C, T = x.shape
    
    # 1. Retrieve cached context
    cached_states = cache.get(self.layer_id, sample_indices)
    if cached_states is None:
        # Initialize with zeros
        cached_states = torch.zeros(B, C, self.context_size, ...)
    
    # 2. Concatenate cache with new input
    input_with_context = torch.cat([cached_states, x], dim=2)
    
    # 3. Apply convolution (no extra padding needed)
    output = self.conv(input_with_context)
    
    # 4. Update cache with most recent context_size samples
    new_cache = input_with_context[:, :, -self.context_size:]
    cache.set(self.layer_id, sample_indices, new_cache)
    
    return output
Context Size: (kernel_size - 1) * dilation - (stride - 1)For kernel_size=7, dilation=1, stride=1:
  • context_size = (7-1)*1 - 0 = 6 samples

Transposed Convolution Streaming

SConvTranspose1d also supports streaming for the decoder:
def _forward_streaming(self, x, cache, sample_indices, debug=False):
    # Retrieve cached input
    cached_input = cache.get(self.layer_id, sample_indices)
    
    # Concatenate and apply transposed convolution
    full_input = torch.cat([cached_input, x], dim=2)
    full_output = self.convtr(full_input)
    
    # Extract only NEW output samples
    expected_new_output = T * self.stride
    output = full_output[:, :, -expected_new_output:]
    
    return output
Streaming adds minimal latency (~300ms for first chunk) because the model only needs a small context window due to the low 7.5 Hz frame rate.

Normalization Layers

VibeVoice uses RMSNorm for efficiency:

RMSNorm vs LayerNorm

class RMSNorm(nn.Module):
    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
    
    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        if self.weight is not None:
            output = output * self.weight
        return output
Advantages:
  • No mean subtraction (faster)
  • Fewer parameters (no bias)
  • Better numerical stability

ConvRMSNorm

Adapted for convolutional inputs:
class ConvRMSNorm(RMSNorm):
    def forward(self, x):
        x = x.transpose(1, 2)  # b c t -> b t c
        output = self._norm(x.float()).type_as(x)
        if self.weight is not None:
            output = output * self.weight
        output = output.transpose(1, 2)  # b t c -> b c t
        return output
If APEX is available and OPTIMIZE_FOR_SPEED=1, VibeVoice uses fused RMSNorm for faster computation (modular_vibevoice_tokenizer.py:78-91).

Configuration Parameters

From VibeVoiceAcousticTokenizerConfig:
{
  "vae_dim": 512,              # Latent dimension
  "channels": 1,                # Audio channels (mono)
  "n_filters": 64,              # Base filter count
  "decoder_n_filters": 64,      # Decoder filters
  "ratios": [8, 5, 4, 2],       # Upsampling ratios
  "decoder_ratios": [8, 5, 4, 2],
  "depths": [2, 2, 4, 4],       # Blocks per stage
  "decoder_depths": [4, 4, 2, 2], # Reversed for decoder
  "kernel_size": 7,
  "last_kernel_size": 7,
  "causal": true,               # Enable streaming
  "pad_mode": "reflect",
  "layernorm": "RMSNorm",
  "mixer_layer": "depthwise_conv",
  "layer_scale_init_value": 0.0
}

Decode Function

The main interface for converting latents to audio:
@torch.no_grad()
def decode(self, latents, cache=None, sample_indices=None, 
           use_cache=False, debug=False):
    """Convert latent representations back to audio
    
    Args:
        latents: [batch, vae_dim, time] or [batch, time, vae_dim]
        cache: VibeVoiceTokenizerStreamingCache for streaming
        sample_indices: Batch sample IDs for cache management
        use_cache: Enable streaming mode
    
    Returns:
        audio: [batch, channels, samples]
    """
    if latents.shape[1] != self.config.vae_dim:
        latents = latents.permute(0, 2, 1)
    
    audio = self.decoder(latents, cache=cache, 
                        sample_indices=sample_indices,
                        use_cache=use_cache, debug=debug)
    return audio

Performance Characteristics

AspectValueBenefit
Frame Rate7.5 Hz10x sequence length reduction
Latent Dim512Compact representation
Streaming Latency~300msReal-time capable
Context Window6-12 samplesMinimal memory overhead
NormalizationRMSNormFaster than LayerNorm

Usage Example

from transformers import AutoModel
import torch

# Load acoustic tokenizer
tokenizer = AutoModel.from_pretrained(
    "microsoft/vibevoice-realtime-0.5b",
    subfolder="acoustic_tokenizer"
)

# Decode latents to audio (non-streaming)
latents = torch.randn(1, 512, 100)  # [batch, vae_dim, 100 tokens]
audio = tokenizer.decode(latents)
print(audio.shape)  # [1, 1, 32000] (100 tokens * 320 samples/token)

# Streaming mode
from vibevoice.modular import VibeVoiceTokenizerStreamingCache

cache = VibeVoiceTokenizerStreamingCache()
sample_indices = torch.tensor([0])  # Batch index

# Process chunks
for chunk in latent_chunks:  # [1, 512, chunk_size]
    audio_chunk = tokenizer.decode(
        chunk, 
        cache=cache,
        sample_indices=sample_indices,
        use_cache=True
    )
    # Stream audio_chunk to output

Key Takeaways

Ultra-Low Frame Rate

7.5 Hz enables 90-minute generation by drastically reducing sequence length

Streaming-Ready

Causal convolutions with caching support real-time processing

Hierarchical Design

Multi-stage upsampling with residual blocks for quality

Efficient Normalization

RMSNorm reduces computation vs. LayerNorm

Next Steps

Diffusion Head

Learn how the diffusion head generates acoustic tokens using the next-token diffusion framework

Build docs developers (and LLMs) love