Skip to main content

DPMSolverMultistepScheduler

Fast dedicated high-order solver for diffusion ODEs used in VibeVoice speech token sampling. This scheduler implements DPM-Solver++ algorithm for efficient diffusion inference.

Class Signature

class DPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
    order = 1

Initialization

from vibevoice.schedule import DPMSolverMultistepScheduler

scheduler = DPMSolverMultistepScheduler(
    num_train_timesteps=1000,
    beta_start=0.0001,
    beta_end=0.02,
    beta_schedule="linear",
    solver_order=2,
    prediction_type="epsilon",
    algorithm_type="dpmsolver++"
)

Parameters

num_train_timesteps
int
default:"1000"
The number of diffusion steps to train the model
beta_start
float
default:"0.0001"
The starting beta value for the noise schedule
beta_end
float
default:"0.02"
The final beta value for the noise schedule
beta_schedule
str
default:"linear"
The beta schedule type. Options:
  • "linear": Linear schedule
  • "scaled_linear": Scaled linear (for latent diffusion)
  • "squaredcos_cap_v2" or "cosine": Cosine schedule
  • "cauchy": Cauchy distribution schedule
  • "laplace": Laplace distribution schedule
trained_betas
np.ndarray | List[float]
Pass an array of betas directly to bypass beta_start and beta_end
solver_order
int
default:"2"
The DPMSolver order (1, 2, or 3). Recommended:
  • Order 2 for guided sampling
  • Order 3 for unconditional sampling
prediction_type
str
default:"epsilon"
Prediction type of the scheduler. Options:
  • "epsilon": Predicts noise
  • "sample": Predicts noisy sample directly
  • "v_prediction": Predicts velocity
thresholding
bool
default:"False"
Whether to use dynamic thresholding (not suitable for latent-space models)
dynamic_thresholding_ratio
float
default:"0.995"
The ratio for dynamic thresholding. Only used when thresholding=True
sample_max_value
float
default:"1.0"
The threshold value for dynamic thresholding
algorithm_type
str
default:"dpmsolver++"
Algorithm type for the solver. Options:
  • "dpmsolver++": Recommended for most cases
  • "dpmsolver": Original DPMSolver
  • "sde-dpmsolver++": SDE variant with stochasticity
  • "sde-dpmsolver": SDE variant of original
solver_type
str
default:"midpoint"
Solver type for second-order solver. Options:
  • "midpoint": Recommended
  • "heun": Alternative second-order method
lower_order_final
bool
default:"True"
Whether to use lower-order solvers in final steps. Helps stabilize sampling for < 15 inference steps
euler_at_final
bool
default:"False"
Whether to use Euler’s method in the final step (trade-off between stability and detail)
use_karras_sigmas
bool
default:"False"
Whether to use Karras sigmas for step sizes in the noise schedule
use_lu_lambdas
bool
default:"False"
Whether to use uniform-logSNR for step sizes (Lu’s DPM-Solver)
final_sigmas_type
str
default:"zero"
The final sigma value. Options:
  • "zero": Final sigma is 0
  • "sigma_min": Final sigma is same as last training sigma
lambda_min_clipped
float
default:"-inf"
Clipping threshold for minimum lambda(t) for numerical stability (critical for cosine schedules)
variance_type
str
Set to "learned" or "learned_range" for models that predict variance
timestep_spacing
str
default:"linspace"
How timesteps should be scaled. Options:
  • "linspace": Linear spacing
  • "leading": Leading spacing
  • "trailing": Trailing spacing
steps_offset
int
default:"0"
Offset added to inference steps (required by some model families)
rescale_betas_zero_snr
bool
default:"False"
Whether to rescale betas to have zero terminal SNR (enables very bright/dark generation)

Methods

set_timesteps

Set the discrete timesteps used for the diffusion chain.
scheduler.set_timesteps(num_inference_steps=5, device="cuda")
num_inference_steps
int
required
Number of diffusion steps for generation. Fewer steps = faster but potentially lower quality
device
str | torch.device
Device to place timesteps tensor on
timesteps
List[int]
Custom timesteps to use instead of automatic generation. Cannot be used with num_inference_steps

step

Predict the sample from the previous timestep using the multistep DPMSolver.
result = scheduler.step(
    model_output=noise_pred,
    timestep=t,
    sample=latent,
    return_dict=True
)
prev_sample = result.prev_sample
model_output
torch.Tensor
required
Direct output from the learned diffusion model
timestep
int
required
Current discrete timestep in the diffusion chain
sample
torch.Tensor
required
Current instance of sample created by diffusion process
generator
torch.Generator
Random number generator for stochastic sampling
variance_noise
torch.Tensor
Alternative to generating noise with generator (for reproducibility)
return_dict
bool
default:"True"
Whether to return a SchedulerOutput or tuple
SchedulerOutput
object
Contains:
  • prev_sample (torch.Tensor): Computed sample at previous timestep

add_noise

Add noise to original samples at given timesteps.
noisy_samples = scheduler.add_noise(
    original_samples=clean_latents,
    noise=noise,
    timesteps=timesteps
)
original_samples
torch.Tensor
required
Clean samples to add noise to
noise
torch.Tensor
required
Noise to add to samples
timesteps
torch.IntTensor
required
Timesteps indicating noise levels
noisy_samples
torch.Tensor
Noisy samples at specified timesteps

get_velocity

Get velocity for v-prediction formulation.
velocity = scheduler.get_velocity(
    original_samples=clean_latents,
    noise=noise,
    timesteps=timesteps
)
original_samples
torch.Tensor
required
Clean samples
noise
torch.Tensor
required
Noise tensor
timesteps
torch.IntTensor
required
Timesteps
velocity
torch.Tensor
Velocity: alpha_t * noise - sigma_t * original_samples

convert_model_output

Convert model output to the type needed by DPMSolver algorithm.
converted = scheduler.convert_model_output(
    model_output=noise_pred,
    sample=current_sample
)
model_output
torch.Tensor
required
Direct output from learned diffusion model
sample
torch.Tensor
required
Current sample instance
converted_output
torch.Tensor
Converted output (either x0_pred for dpmsolver++ or epsilon for dpmsolver)

Properties

timesteps
torch.Tensor
Discrete timesteps used for inference (set by set_timesteps())
step_index
int
Current step index counter (increases after each step() call)
begin_index
int
Beginning index for scheduler (set with set_begin_index())
sigmas
torch.Tensor
Noise levels (sigma values) for each timestep
init_noise_sigma
float
Standard deviation of initial noise distribution (always 1.0)

Usage in VibeVoice

The scheduler is used internally by VibeVoice models for speech token diffusion sampling:
# Inside VibeVoiceStreamingForConditionalGenerationInference
def sample_speech_tokens(self, condition, neg_condition, cfg_scale=3.0):
    self.model.noise_scheduler.set_timesteps(self.ddpm_inference_steps)
    condition = torch.cat([condition, neg_condition], dim=0).to(self.model.prediction_head.device)
    speech = torch.randn(condition.shape[0], self.config.acoustic_vae_dim).to(condition)
    
    for t in self.model.noise_scheduler.timesteps:
        half = speech[: len(speech) // 2]
        combined = torch.cat([half, half], dim=0)
        eps = self.model.prediction_head(combined, t.repeat(combined.shape[0]).to(combined), condition=condition)
        cond_eps, uncond_eps = torch.split(eps, len(eps) // 2, dim=0)
        half_eps = uncond_eps + cfg_scale * (cond_eps - uncond_eps)
        eps = torch.cat([half_eps, half_eps], dim=0)
        speech = self.model.noise_scheduler.step(eps, t, speech).prev_sample
    
    return speech[: len(speech) // 2]

Example: Standalone Usage

from vibevoice.schedule import DPMSolverMultistepScheduler
import torch

# Initialize scheduler
scheduler = DPMSolverMultistepScheduler(
    num_train_timesteps=1000,
    beta_schedule="linear",
    solver_order=2,
    prediction_type="epsilon",
    algorithm_type="dpmsolver++"
)

# Set inference timesteps
scheduler.set_timesteps(num_inference_steps=5, device="cuda")

# Initialize with noise
sample = torch.randn(1, 64).to("cuda")

# Denoise iteratively
for t in scheduler.timesteps:
    # Model prediction (epsilon)
    model_output = model(sample, t)
    
    # Scheduler step
    result = scheduler.step(model_output, t, sample)
    sample = result.prev_sample

# sample now contains denoised output

Beta Schedule Types

scheduler = DPMSolverMultistepScheduler(
    beta_schedule="linear",
    beta_start=0.0001,
    beta_end=0.02
)

Algorithm Types Comparison

Algorithm TypeUse CaseStochasticSpeed
dpmsolver++General (recommended)NoFast
dpmsolverOriginal formulationNoFast
sde-dpmsolver++High quality, small stepsYesFast
sde-dpmsolverOriginal SDE variantYesFast

Notes

  • VibeVoice typically uses 5 inference steps for real-time performance
  • Higher cfg_scale values (1.5-3.0) improve conditioning adherence
  • The scheduler supports both epsilon and v-prediction formulations
  • For best results with few steps, use solver_order=2 and lower_order_final=True

Build docs developers (and LLMs) love