Schedulers

DPMSolverMultistepScheduler

Fast dedicated high-order solver for diffusion ODEs used in VibeVoice speech token sampling. This scheduler implements DPM-Solver++ algorithm for efficient diffusion inference.

Class Signature

class DPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
    order = 1

Initialization

from vibevoice.schedule import DPMSolverMultistepScheduler

scheduler = DPMSolverMultistepScheduler(
    num_train_timesteps=1000,
    beta_start=0.0001,
    beta_end=0.02,
    beta_schedule="linear",
    solver_order=2,
    prediction_type="epsilon",
    algorithm_type="dpmsolver++"
)

Parameters

num_train_timesteps

int

default:"1000"

The number of diffusion steps to train the model

beta_start

float

default:"0.0001"

The starting beta value for the noise schedule

beta_end

float

default:"0.02"

The final beta value for the noise schedule

beta_schedule

str

default:"linear"

The beta schedule type. Options:

"linear": Linear schedule
"scaled_linear": Scaled linear (for latent diffusion)
"squaredcos_cap_v2" or "cosine": Cosine schedule
"cauchy": Cauchy distribution schedule
"laplace": Laplace distribution schedule

trained_betas

np.ndarray | List[float]

Pass an array of betas directly to bypass beta_start and beta_end

solver_order

int

default:"2"

The DPMSolver order (1, 2, or 3). Recommended:

Order 2 for guided sampling
Order 3 for unconditional sampling

prediction_type

str

default:"epsilon"

Prediction type of the scheduler. Options:

"epsilon": Predicts noise
"sample": Predicts noisy sample directly
"v_prediction": Predicts velocity

thresholding

bool

default:"False"

Whether to use dynamic thresholding (not suitable for latent-space models)

dynamic_thresholding_ratio

float

default:"0.995"

The ratio for dynamic thresholding. Only used when thresholding=True

sample_max_value

float

default:"1.0"

The threshold value for dynamic thresholding

algorithm_type

str

default:"dpmsolver++"

Algorithm type for the solver. Options:

"dpmsolver++": Recommended for most cases
"dpmsolver": Original DPMSolver
"sde-dpmsolver++": SDE variant with stochasticity
"sde-dpmsolver": SDE variant of original

solver_type

str

default:"midpoint"

Solver type for second-order solver. Options:

"midpoint": Recommended
"heun": Alternative second-order method

lower_order_final

bool

default:"True"

Whether to use lower-order solvers in final steps. Helps stabilize sampling for < 15 inference steps

euler_at_final

bool

default:"False"

Whether to use Euler’s method in the final step (trade-off between stability and detail)

use_karras_sigmas

bool

default:"False"

Whether to use Karras sigmas for step sizes in the noise schedule

use_lu_lambdas

bool

default:"False"

Whether to use uniform-logSNR for step sizes (Lu’s DPM-Solver)

final_sigmas_type

str

default:"zero"

The final sigma value. Options:

"zero": Final sigma is 0
"sigma_min": Final sigma is same as last training sigma

lambda_min_clipped

float

default:"-inf"

Clipping threshold for minimum lambda(t) for numerical stability (critical for cosine schedules)

variance_type

str

Set to "learned" or "learned_range" for models that predict variance

timestep_spacing

str

default:"linspace"

How timesteps should be scaled. Options:

"linspace": Linear spacing
"leading": Leading spacing
"trailing": Trailing spacing

steps_offset

int

default:"0"

Offset added to inference steps (required by some model families)

rescale_betas_zero_snr

bool

default:"False"

Whether to rescale betas to have zero terminal SNR (enables very bright/dark generation)

Methods

set_timesteps

Set the discrete timesteps used for the diffusion chain.

scheduler.set_timesteps(num_inference_steps=5, device="cuda")

num_inference_steps

int

required

Number of diffusion steps for generation. Fewer steps = faster but potentially lower quality

device

str | torch.device

Device to place timesteps tensor on

timesteps

List[int]

Custom timesteps to use instead of automatic generation. Cannot be used with num_inference_steps

step

Predict the sample from the previous timestep using the multistep DPMSolver.

result = scheduler.step(
    model_output=noise_pred,
    timestep=t,
    sample=latent,
    return_dict=True
)
prev_sample = result.prev_sample

model_output

torch.Tensor

required

Direct output from the learned diffusion model

timestep

int

required

Current discrete timestep in the diffusion chain

sample

torch.Tensor

required

Current instance of sample created by diffusion process

generator

torch.Generator

Random number generator for stochastic sampling

variance_noise

torch.Tensor

Alternative to generating noise with generator (for reproducibility)

return_dict

bool

default:"True"

Whether to return a SchedulerOutput or tuple

SchedulerOutput

object

Contains:

prev_sample (torch.Tensor): Computed sample at previous timestep

add_noise

Add noise to original samples at given timesteps.

noisy_samples = scheduler.add_noise(
    original_samples=clean_latents,
    noise=noise,
    timesteps=timesteps
)

original_samples

torch.Tensor

required

Clean samples to add noise to

noise

torch.Tensor

required

Noise to add to samples

timesteps

torch.IntTensor

required

Timesteps indicating noise levels

noisy_samples

torch.Tensor

Noisy samples at specified timesteps

get_velocity

Get velocity for v-prediction formulation.

velocity = scheduler.get_velocity(
    original_samples=clean_latents,
    noise=noise,
    timesteps=timesteps
)

original_samples

torch.Tensor

required

Clean samples

noise

torch.Tensor

required

Noise tensor

timesteps

torch.IntTensor

required

Timesteps

velocity

torch.Tensor

Velocity: alpha_t * noise - sigma_t * original_samples

convert_model_output

Convert model output to the type needed by DPMSolver algorithm.

converted = scheduler.convert_model_output(
    model_output=noise_pred,
    sample=current_sample
)

model_output

torch.Tensor

required

Direct output from learned diffusion model

sample

torch.Tensor

required

Current sample instance

converted_output

torch.Tensor

Converted output (either x0_pred for dpmsolver++ or epsilon for dpmsolver)

Properties

timesteps

torch.Tensor

Discrete timesteps used for inference (set by set_timesteps())

step_index

int

Current step index counter (increases after each step() call)

begin_index

int

Beginning index for scheduler (set with set_begin_index())

sigmas

torch.Tensor

Noise levels (sigma values) for each timestep

init_noise_sigma

float

Standard deviation of initial noise distribution (always 1.0)

Usage in VibeVoice

The scheduler is used internally by VibeVoice models for speech token diffusion sampling:

# Inside VibeVoiceStreamingForConditionalGenerationInference
def sample_speech_tokens(self, condition, neg_condition, cfg_scale=3.0):
    self.model.noise_scheduler.set_timesteps(self.ddpm_inference_steps)
    condition = torch.cat([condition, neg_condition], dim=0).to(self.model.prediction_head.device)
    speech = torch.randn(condition.shape[0], self.config.acoustic_vae_dim).to(condition)
    
    for t in self.model.noise_scheduler.timesteps:
        half = speech[: len(speech) // 2]
        combined = torch.cat([half, half], dim=0)
        eps = self.model.prediction_head(combined, t.repeat(combined.shape[0]).to(combined), condition=condition)
        cond_eps, uncond_eps = torch.split(eps, len(eps) // 2, dim=0)
        half_eps = uncond_eps + cfg_scale * (cond_eps - uncond_eps)
        eps = torch.cat([half_eps, half_eps], dim=0)
        speech = self.model.noise_scheduler.step(eps, t, speech).prev_sample
    
    return speech[: len(speech) // 2]

Example: Standalone Usage

from vibevoice.schedule import DPMSolverMultistepScheduler
import torch

# Initialize scheduler
scheduler = DPMSolverMultistepScheduler(
    num_train_timesteps=1000,
    beta_schedule="linear",
    solver_order=2,
    prediction_type="epsilon",
    algorithm_type="dpmsolver++"
)

# Set inference timesteps
scheduler.set_timesteps(num_inference_steps=5, device="cuda")

# Initialize with noise
sample = torch.randn(1, 64).to("cuda")

# Denoise iteratively
for t in scheduler.timesteps:
    # Model prediction (epsilon)
    model_output = model(sample, t)
    
    # Scheduler step
    result = scheduler.step(model_output, t, sample)
    sample = result.prev_sample

# sample now contains denoised output

Beta Schedule Types

scheduler = DPMSolverMultistepScheduler(
    beta_schedule="linear",
    beta_start=0.0001,
    beta_end=0.02
)

Algorithm Types Comparison

Algorithm Type	Use Case	Stochastic	Speed
`dpmsolver++`	General (recommended)	No	Fast
`dpmsolver`	Original formulation	No	Fast
`sde-dpmsolver++`	High quality, small steps	Yes	Fast
`sde-dpmsolver`	Original SDE variant	Yes	Fast

Notes

VibeVoice typically uses 5 inference steps for real-time performance
Higher cfg_scale values (1.5-3.0) improve conditioning adherence
The scheduler supports both epsilon and v-prediction formulations
For best results with few steps, use solver_order=2 and lower_order_final=True

Core Components

Utilities

DPMSolverMultistepScheduler

Class Signature

Initialization

Parameters

Methods

set_timesteps

step

add_noise

get_velocity

convert_model_output

Properties

Usage in VibeVoice

Example: Standalone Usage

Beta Schedule Types

Algorithm Types Comparison

Notes

Build docs developers (and LLMs) love

Core Components

Utilities

​DPMSolverMultistepScheduler

​Class Signature

​Initialization

​Parameters

​Methods

​set_timesteps

​step

​add_noise

​get_velocity

​convert_model_output

​Properties

​Usage in VibeVoice

​Example: Standalone Usage

​Beta Schedule Types

​Algorithm Types Comparison

​Notes

Build docs developers (and LLMs) love

DPMSolverMultistepScheduler

Class Signature

Initialization

Parameters

Methods

set_timesteps

step

add_noise

get_velocity

convert_model_output

Properties

Usage in VibeVoice

Example: Standalone Usage

Beta Schedule Types

Algorithm Types Comparison

Notes