Overview
The inference API provides a simplified interface for generating videos from WorldStereo. It handles the complexity of memory management, diffusion scheduling, and camera control, allowing you to focus on creative applications.This API is designed for production use cases and research experiments. For low-level control, use the WorldStereo model directly.
Function: generate_video
Main function for generating multi-view-consistent videos.Parameters
Initialized WorldStereo model instance.
Conditioning image for video generation. Can be:
- File path (str)
- PIL Image object
- Torch tensor with shape
(C, H, W)or(1, C, H, W)
Camera trajectory defining viewpoints for each frame. See Camera Control for creation methods.
Number of frames to generate. Must match the trajectory length.
Output video height in pixels.
Output video width in pixels.
Number of denoising steps. More steps typically improve quality but increase computation time.
- Recommended: 25-50 for fast generation
- 50-100 for high quality
Classifier-free guidance scale. Higher values increase fidelity to the condition image but may reduce diversity.
- Typical range: 5.0-15.0
- Lower values (5.0-7.5): More creative
- Higher values (10.0-15.0): More faithful to input
Optional initial point cloud for geometric conditioning, shape
(N, 3) or (N, 6) with RGB.Whether to use global geometric memory module.
Whether to use spatial-stereo memory module.
Random seed for reproducibility. If None, uses a random seed.
Output format:
- “tensor”: PyTorch tensor
(T, C, H, W) - “numpy”: NumPy array
(T, H, W, C) - “pil”: List of PIL Images
- “video”: MP4 file (requires
save_path)
Path to save the output video. Required if
output_format="video".Optional callback function called after each denoising step:
callback(step: int, total_steps: int, latent: torch.Tensor) -> NoneReturns
Generated video in the specified output format.
Function: generate_multi_view
Generate multiple videos from different camera trajectories with shared memory.WorldStereo model instance.
Conditioning image.
List of camera trajectories for different viewpoints.
If True, memory modules are shared and updated across all trajectories for consistency.
Additional arguments passed to
generate_video().List of generated videos, one per trajectory.
Function: reconstruct_3d
Generate video and reconstruct the 3D scene.WorldStereo model instance.
Conditioning image.
Camera trajectory for video generation.
3D reconstruction method:
- “point_cloud”: Dense point cloud
- “mesh”: Triangle mesh
- “neural_field”: Neural radiance field
Number of frames to generate.
Additional arguments for generation.
Generated video.
3D reconstruction with:
points: Point cloud coordinates(N, 3)colors: RGB colors(N, 3)(if available)normals: Surface normals(N, 3)(if available)mesh: Triangle mesh (ifreconstruction_method="mesh")quality_metrics: Reconstruction quality metrics
Pipeline Class
For more control, use theWorldStereoPipeline class.
Class: WorldStereoPipeline
Path to pretrained model weights or Hugging Face model ID.
Device to run inference on.
Data type for model weights (float32 or float16).
Use memory-efficient attention from xFormers.
Offload model to CPU when not in use (saves GPU memory).
Methods
The pipeline exposes the same methods as the functional API:generate_video(), generate_multi_view(), and reconstruct_3d().
Examples
Advanced Configuration
Memory Management
Controlling Memory Usage
Controlling Memory Usage
Optimization
Performance Optimization
Performance Optimization
Batch Processing
Batch Generation
Batch Generation
Quality Tips
- For Best Quality
- For Speed
- For 3D Reconstruction
- Use 50-100 inference steps
- Set guidance_scale to 10.0-15.0
- Enable both memory modules
- Use higher resolution (768x768 or 1024x1024)
- Provide initial point cloud if available
Related Documentation
WorldStereo Model
Low-level model API for advanced use cases
Memory Modules
Configure geometric memory components
Camera Control
Create and customize camera trajectories
3D Reconstruction
Detailed reconstruction pipeline