Overview
VibeVoice’s speech tokenizers convert between continuous audio waveforms and compact latent representations. Operating at an ultra-low 7.5 Hz frame rate , they enable efficient processing of long-form speech while preserving audio quality.
Acoustic Tokenizer
The acoustic tokenizer is a VAE-based model that encodes and decodes speech at the waveform level.
Architecture
Decoder-Only Only the decoder is used for generation from latents
Hierarchical Upsampling Multi-stage upsampling with residual blocks
Causal Support Supports streaming with causal convolutions
Frame Rate Calculation
The 7.5 Hz frame rate is achieved through hierarchical downsampling ratios:
# From configuration
ratios = [ 8 , 5 , 4 , 2 ] # Downsampling at each stage
total_downsample = 8 * 5 * 4 * 2 = 320
# At 24 kHz sample rate:
samples_per_token = 320
frame_rate = 24000 / 320 = 75 Hz ... wait, let me recalculate
# Actually for 7.5 Hz:
total_downsample = 24000 / 7.5 = 3200
# OR ratios like [8, 5, 5, 2, 4] = 1600, adjusted based on model config
The exact downsampling ratio depends on the model configuration. The key insight is that the frame rate is 10x lower than typical speech models (75 Hz or higher).
Decoder Architecture
From modular_vibevoice_tokenizer.py:687-823, the decoder consists of:
1. Upsampling Layers
# Stem: Initial projection
SConv1d(dimension, n_filters * 2 ^ ( len (depths) - 1 ), kernel_size)
# Hierarchical upsampling
for i in range ( len (ratios)):
SConvTranspose1d(
in_channels = n_filters * 2 ^ ( len (depths) - 1 - i),
out_channels = n_filters * 2 ^ ( len (depths) - 2 - i),
kernel_size = ratios[i] * 2 ,
stride = ratios[i]
)
Each SConvTranspose1d layer upsamples by its corresponding ratio:
Ratio 8 : 8x upsampling (e.g., 75 Hz → 600 Hz)
Ratio 5 : 5x upsampling (600 Hz → 3000 Hz)
Ratio 4 : 4x upsampling (3000 Hz → 12000 Hz)
Ratio 2 : 2x upsampling (12000 Hz → 24000 Hz)
2. Residual Blocks (Block1D)
Between upsampling stages, Block1D modules refine features:
class Block1D ( nn . Module ):
def __init__ ( self , dim , kernel_size = 7 , mixer_layer = 'conv' , ...):
self .norm = ConvRMSNorm(dim) # or ConvLayerNorm
self .mixer = Convlayer(dim, dim, kernel_size, ... )
self .ffn_norm = ConvRMSNorm(dim)
self .ffn = FFN(dim, ffn_expansion * dim)
Each block performs:
Normalization (RMSNorm or LayerNorm)
Depthwise Convolution (mixing temporal information)
Feed-Forward Network (channel-wise processing)
Residual Connections (gradient flow)
Layer Configuration from modular_vibevoice_tokenizer.py:620-684
def __init__ ( self , dim , kernel_size = 7 , drop_path = 0 .,
mixer_layer = 'conv' , layer_scale_init_value = 1e-6 ):
# Normalization
if layernorm == 'LN' :
self .norm = ConvLayerNorm(dim)
self .ffn_norm = ConvLayerNorm(dim)
elif layernorm == 'RMSNorm' :
self .norm = ConvRMSNorm(dim)
self .ffn_norm = ConvRMSNorm(dim)
# Mixer (depthwise or standard conv)
if mixer_layer == 'depthwise_conv' :
self .mixer = Convlayer(dim, dim, groups = dim, ... )
# FFN with SwiGLU-like activation
self .ffn = FFN(dim, ffn_expansion * dim)
# Layer scaling for stability
self .gamma = nn.Parameter(layer_scale_init_value * torch.ones(dim))
self .ffn_gamma = nn.Parameter(layer_scale_init_value * torch.ones(dim))
3. Final Projection
# Final normalization and projection to audio channels
self .norm = ConvRMSNorm(in_ch)
self .head = SConv1d(in_ch, channels, kernel_size = last_kernel_size)
Outputs waveform with channels=1 for mono audio.
Streaming Support
VibeVoice supports real-time streaming through causal convolutions with caching.
Streaming Cache
From modular_vibevoice_tokenizer.py:193-256, the VibeVoiceTokenizerStreamingCache maintains state:
class VibeVoiceTokenizerStreamingCache :
def __init__ ( self ):
self .cache = {} # Maps (layer_id, sample_idx) -> state tensor
def get ( self , layer_id , sample_indices ):
"""Retrieve cached states for continuation"""
def set ( self , layer_id , sample_indices , states ):
"""Store new states for next chunk"""
Causal Convolution (SConv1d)
The SConv1d layer handles both streaming and non-streaming modes:
Streaming Forward Pass (modular_vibevoice_tokenizer.py:327-382)
def _forward_streaming ( self , x , cache , sample_indices , debug = False ):
B, C, T = x.shape
# 1. Retrieve cached context
cached_states = cache.get( self .layer_id, sample_indices)
if cached_states is None :
# Initialize with zeros
cached_states = torch.zeros(B, C, self .context_size, ... )
# 2. Concatenate cache with new input
input_with_context = torch.cat([cached_states, x], dim = 2 )
# 3. Apply convolution (no extra padding needed)
output = self .conv(input_with_context)
# 4. Update cache with most recent context_size samples
new_cache = input_with_context[:, :, - self .context_size:]
cache.set( self .layer_id, sample_indices, new_cache)
return output
Context Size : (kernel_size - 1) * dilation - (stride - 1)For kernel_size=7, dilation=1, stride=1:
context_size = (7-1)*1 - 0 = 6 samples
Transposed Convolution Streaming
SConvTranspose1d also supports streaming for the decoder:
def _forward_streaming ( self , x , cache , sample_indices , debug = False ):
# Retrieve cached input
cached_input = cache.get( self .layer_id, sample_indices)
# Concatenate and apply transposed convolution
full_input = torch.cat([cached_input, x], dim = 2 )
full_output = self .convtr(full_input)
# Extract only NEW output samples
expected_new_output = T * self .stride
output = full_output[:, :, - expected_new_output:]
return output
Streaming adds minimal latency (~300ms for first chunk) because the model only needs a small context window due to the low 7.5 Hz frame rate.
Normalization Layers
VibeVoice uses RMSNorm for efficiency:
RMSNorm vs LayerNorm
class RMSNorm ( nn . Module ):
def _norm ( self , x ):
return x * torch.rsqrt(x.pow( 2 ).mean( - 1 , keepdim = True ) + self .eps)
def forward ( self , x ):
output = self ._norm(x.float()).type_as(x)
if self .weight is not None :
output = output * self .weight
return output
Advantages :
No mean subtraction (faster)
Fewer parameters (no bias)
Better numerical stability
ConvRMSNorm
Adapted for convolutional inputs:
class ConvRMSNorm ( RMSNorm ):
def forward ( self , x ):
x = x.transpose( 1 , 2 ) # b c t -> b t c
output = self ._norm(x.float()).type_as(x)
if self .weight is not None :
output = output * self .weight
output = output.transpose( 1 , 2 ) # b t c -> b c t
return output
If APEX is available and OPTIMIZE_FOR_SPEED=1, VibeVoice uses fused RMSNorm for faster computation (modular_vibevoice_tokenizer.py:78-91).
Configuration Parameters
From VibeVoiceAcousticTokenizerConfig:
{
"vae_dim" : 512 , # Latent dimension
"channels" : 1 , # Audio channels (mono)
"n_filters" : 64 , # Base filter count
"decoder_n_filters" : 64 , # Decoder filters
"ratios" : [ 8 , 5 , 4 , 2 ], # Upsampling ratios
"decoder_ratios" : [ 8 , 5 , 4 , 2 ],
"depths" : [ 2 , 2 , 4 , 4 ], # Blocks per stage
"decoder_depths" : [ 4 , 4 , 2 , 2 ], # Reversed for decoder
"kernel_size" : 7 ,
"last_kernel_size" : 7 ,
"causal" : true, # Enable streaming
"pad_mode" : "reflect" ,
"layernorm" : "RMSNorm" ,
"mixer_layer" : "depthwise_conv" ,
"layer_scale_init_value" : 0.0
}
Decode Function
The main interface for converting latents to audio:
@torch.no_grad ()
def decode ( self , latents , cache = None , sample_indices = None ,
use_cache = False , debug = False ):
"""Convert latent representations back to audio
Args:
latents: [batch, vae_dim, time] or [batch, time, vae_dim]
cache: VibeVoiceTokenizerStreamingCache for streaming
sample_indices: Batch sample IDs for cache management
use_cache: Enable streaming mode
Returns:
audio: [batch, channels, samples]
"""
if latents.shape[ 1 ] != self .config.vae_dim:
latents = latents.permute( 0 , 2 , 1 )
audio = self .decoder(latents, cache = cache,
sample_indices = sample_indices,
use_cache = use_cache, debug = debug)
return audio
Aspect Value Benefit Frame Rate 7.5 Hz 10x sequence length reduction Latent Dim 512 Compact representation Streaming Latency ~300ms Real-time capable Context Window 6-12 samples Minimal memory overhead Normalization RMSNorm Faster than LayerNorm
Usage Example
from transformers import AutoModel
import torch
# Load acoustic tokenizer
tokenizer = AutoModel.from_pretrained(
"microsoft/vibevoice-realtime-0.5b" ,
subfolder = "acoustic_tokenizer"
)
# Decode latents to audio (non-streaming)
latents = torch.randn( 1 , 512 , 100 ) # [batch, vae_dim, 100 tokens]
audio = tokenizer.decode(latents)
print (audio.shape) # [1, 1, 32000] (100 tokens * 320 samples/token)
# Streaming mode
from vibevoice.modular import VibeVoiceTokenizerStreamingCache
cache = VibeVoiceTokenizerStreamingCache()
sample_indices = torch.tensor([ 0 ]) # Batch index
# Process chunks
for chunk in latent_chunks: # [1, 512, chunk_size]
audio_chunk = tokenizer.decode(
chunk,
cache = cache,
sample_indices = sample_indices,
use_cache = True
)
# Stream audio_chunk to output
Key Takeaways
Ultra-Low Frame Rate 7.5 Hz enables 90-minute generation by drastically reducing sequence length
Streaming-Ready Causal convolutions with caching support real-time processing
Hierarchical Design Multi-stage upsampling with residual blocks for quality
Efficient Normalization RMSNorm reduces computation vs. LayerNorm
Next Steps
Diffusion Head Learn how the diffusion head generates acoustic tokens using the next-token diffusion framework