Overview
TheWorldStereo class is the core model that bridges camera-guided video generation and 3D scene reconstruction. It integrates a Video Diffusion Model (VDM) backbone with geometric memory modules to generate multi-view-consistent videos under precise camera control.
Model Architecture
WorldStereo consists of three primary components:- VDM Backbone: Distribution matching distilled Video Diffusion Model
- Global Geometric Memory: Injects coarse structural priors via point clouds
- Spatial-Stereo Memory: Constrains attention with 3D correspondence for fine details
The control branch design enables efficient integration without requiring joint training with the VDM backbone.
Class: WorldStereo
Constructor
Configuration for the Video Diffusion Model backbone. Includes parameters for:
- Model architecture (layers, dimensions, attention heads)
- Temporal modeling settings
- Diffusion process parameters (timesteps, noise schedule)
Configuration for the global geometric memory module. See Global Geometric Memory for details.
Configuration for the spatial-stereo memory module. See Spatial-Stereo Memory for details.
Path to pretrained model weights. If not provided, initializes with random weights.
Device to run the model on (“cuda” or “cpu”).
Methods
forward
Performs a forward pass through the model for training.Input images with shape
(B, T, C, H, W) where:- B: batch size
- T: number of frames
- C: channels (3 for RGB)
- H: height
- W: width
Camera parameters including:
- Intrinsics (focal length, principal point)
- Extrinsics (rotation, translation)
- Trajectory information for multi-view generation
Diffusion timesteps for the forward process, shape
(B,).Optional point cloud for global geometric memory initialization. Shape
(N, 3) or (N, 6) with RGB.Whether to return a ModelOutput object or a plain tuple.
Training loss value.
Predicted noise from the diffusion process, shape
(B, T, C, H, W).Features extracted from both memory modules for analysis.
generate
Generate videos from input conditioning. See Inference API for detailed usage.Conditioning image (perspective or panoramic), shape
(C, H, W) or (B, C, H, W).Camera trajectory defining the viewpoints for video generation.
Number of frames to generate in the output video.
Number of denoising steps in the diffusion process.
Classifier-free guidance scale for controlling generation fidelity.
Optional initial point cloud for geometric conditioning.
Random generator for reproducibility.
Generated video frames, shape
(num_frames, C, H, W).update_geometric_memory
Incrementally update the global geometric memory with new observations.New point cloud observations to integrate, shape
(N, 3) or (N, 6).Camera pose for the new observations, shape
(4, 4) transformation matrix.Strategy for merging point clouds: “incremental”, “replace”, or “weighted”.
extract_features
Extract feature representations from the model at different stages.Input images, shape
(B, T, C, H, W).Camera parameters for the input views.
Specific layer indices to extract features from. If None, extracts from all layers.
Dictionary mapping layer names to feature tensors.
Properties
memory_state
Access the current state of geometric memory modules.Current state of global geometric memory including:
- Point cloud data
- Feature embeddings
- Spatial index structure
Current state of spatial-stereo memory including:
- Memory bank contents
- Correspondence mappings
- Attention masks
config
Model configuration dictionary.Example Usage
Related Documentation
Memory Modules
Detailed documentation for geometric memory components
Inference API
High-level interface for video generation
Camera Control
Camera parameters and trajectory configuration
3D Reconstruction
Reconstruct 3D scenes from generated videos