DPMSolverMultistepScheduler
Fast dedicated high-order solver for diffusion ODEs used in VibeVoice speech token sampling. This scheduler implements DPM-Solver++ algorithm for efficient diffusion inference.Class Signature
Initialization
Parameters
The number of diffusion steps to train the model
The starting beta value for the noise schedule
The final beta value for the noise schedule
The beta schedule type. Options:
"linear": Linear schedule"scaled_linear": Scaled linear (for latent diffusion)"squaredcos_cap_v2"or"cosine": Cosine schedule"cauchy": Cauchy distribution schedule"laplace": Laplace distribution schedule
Pass an array of betas directly to bypass
beta_start and beta_endThe DPMSolver order (1, 2, or 3). Recommended:
- Order 2 for guided sampling
- Order 3 for unconditional sampling
Prediction type of the scheduler. Options:
"epsilon": Predicts noise"sample": Predicts noisy sample directly"v_prediction": Predicts velocity
Whether to use dynamic thresholding (not suitable for latent-space models)
The ratio for dynamic thresholding. Only used when
thresholding=TrueThe threshold value for dynamic thresholding
Algorithm type for the solver. Options:
"dpmsolver++": Recommended for most cases"dpmsolver": Original DPMSolver"sde-dpmsolver++": SDE variant with stochasticity"sde-dpmsolver": SDE variant of original
Solver type for second-order solver. Options:
"midpoint": Recommended"heun": Alternative second-order method
Whether to use lower-order solvers in final steps. Helps stabilize sampling for < 15 inference steps
Whether to use Euler’s method in the final step (trade-off between stability and detail)
Whether to use Karras sigmas for step sizes in the noise schedule
Whether to use uniform-logSNR for step sizes (Lu’s DPM-Solver)
The final sigma value. Options:
"zero": Final sigma is 0"sigma_min": Final sigma is same as last training sigma
Clipping threshold for minimum lambda(t) for numerical stability (critical for cosine schedules)
Set to
"learned" or "learned_range" for models that predict varianceHow timesteps should be scaled. Options:
"linspace": Linear spacing"leading": Leading spacing"trailing": Trailing spacing
Offset added to inference steps (required by some model families)
Whether to rescale betas to have zero terminal SNR (enables very bright/dark generation)
Methods
set_timesteps
Set the discrete timesteps used for the diffusion chain.Number of diffusion steps for generation. Fewer steps = faster but potentially lower quality
Device to place timesteps tensor on
Custom timesteps to use instead of automatic generation. Cannot be used with
num_inference_stepsstep
Predict the sample from the previous timestep using the multistep DPMSolver.Direct output from the learned diffusion model
Current discrete timestep in the diffusion chain
Current instance of sample created by diffusion process
Random number generator for stochastic sampling
Alternative to generating noise with generator (for reproducibility)
Whether to return a
SchedulerOutput or tupleContains:
prev_sample(torch.Tensor): Computed sample at previous timestep
add_noise
Add noise to original samples at given timesteps.Clean samples to add noise to
Noise to add to samples
Timesteps indicating noise levels
Noisy samples at specified timesteps
get_velocity
Get velocity for v-prediction formulation.Clean samples
Noise tensor
Timesteps
Velocity: alpha_t * noise - sigma_t * original_samples
convert_model_output
Convert model output to the type needed by DPMSolver algorithm.Direct output from learned diffusion model
Current sample instance
Converted output (either x0_pred for dpmsolver++ or epsilon for dpmsolver)
Properties
Discrete timesteps used for inference (set by
set_timesteps())Current step index counter (increases after each
step() call)Beginning index for scheduler (set with
set_begin_index())Noise levels (sigma values) for each timestep
Standard deviation of initial noise distribution (always 1.0)
Usage in VibeVoice
The scheduler is used internally by VibeVoice models for speech token diffusion sampling:Example: Standalone Usage
Beta Schedule Types
Algorithm Types Comparison
| Algorithm Type | Use Case | Stochastic | Speed |
|---|---|---|---|
dpmsolver++ | General (recommended) | No | Fast |
dpmsolver | Original formulation | No | Fast |
sde-dpmsolver++ | High quality, small steps | Yes | Fast |
sde-dpmsolver | Original SDE variant | Yes | Fast |
Notes
- VibeVoice typically uses 5 inference steps for real-time performance
- Higher
cfg_scalevalues (1.5-3.0) improve conditioning adherence - The scheduler supports both epsilon and v-prediction formulations
- For best results with few steps, use
solver_order=2andlower_order_final=True