Skip to main content
This API documentation is provisional. WorldStereo code and model weights are not yet publicly released. The API described here represents the expected interface based on the research framework.

Overview

The WorldStereo class is the core model that bridges camera-guided video generation and 3D scene reconstruction. It integrates a Video Diffusion Model (VDM) backbone with geometric memory modules to generate multi-view-consistent videos under precise camera control.

Model Architecture

WorldStereo consists of three primary components:
  1. VDM Backbone: Distribution matching distilled Video Diffusion Model
  2. Global Geometric Memory: Injects coarse structural priors via point clouds
  3. Spatial-Stereo Memory: Constrains attention with 3D correspondence for fine details
The control branch design enables efficient integration without requiring joint training with the VDM backbone.

Class: WorldStereo

Constructor

WorldStereo(
    vdm_config: Dict[str, Any],
    global_memory_config: Dict[str, Any],
    spatial_memory_config: Dict[str, Any],
    pretrained_path: Optional[str] = None,
    device: str = "cuda"
)
vdm_config
Dict[str, Any]
required
Configuration for the Video Diffusion Model backbone. Includes parameters for:
  • Model architecture (layers, dimensions, attention heads)
  • Temporal modeling settings
  • Diffusion process parameters (timesteps, noise schedule)
global_memory_config
Dict[str, Any]
required
Configuration for the global geometric memory module. See Global Geometric Memory for details.
spatial_memory_config
Dict[str, Any]
required
Configuration for the spatial-stereo memory module. See Spatial-Stereo Memory for details.
pretrained_path
str
Path to pretrained model weights. If not provided, initializes with random weights.
device
str
default:"cuda"
Device to run the model on (“cuda” or “cpu”).

Methods

forward

Performs a forward pass through the model for training.
forward(
    images: torch.Tensor,
    camera_params: CameraParameters,
    timesteps: torch.Tensor,
    point_cloud: Optional[torch.Tensor] = None,
    return_dict: bool = True
) -> Union[ModelOutput, Tuple]
images
torch.Tensor
required
Input images with shape (B, T, C, H, W) where:
  • B: batch size
  • T: number of frames
  • C: channels (3 for RGB)
  • H: height
  • W: width
camera_params
CameraParameters
required
Camera parameters including:
  • Intrinsics (focal length, principal point)
  • Extrinsics (rotation, translation)
  • Trajectory information for multi-view generation
timesteps
torch.Tensor
required
Diffusion timesteps for the forward process, shape (B,).
point_cloud
torch.Tensor
Optional point cloud for global geometric memory initialization. Shape (N, 3) or (N, 6) with RGB.
return_dict
bool
default:"True"
Whether to return a ModelOutput object or a plain tuple.
loss
torch.Tensor
Training loss value.
predicted_noise
torch.Tensor
Predicted noise from the diffusion process, shape (B, T, C, H, W).
memory_features
Dict[str, torch.Tensor]
Features extracted from both memory modules for analysis.

generate

Generate videos from input conditioning. See Inference API for detailed usage.
generate(
    condition_image: torch.Tensor,
    camera_trajectory: CameraTrajectory,
    num_frames: int = 16,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    point_cloud: Optional[torch.Tensor] = None,
    generator: Optional[torch.Generator] = None
) -> torch.Tensor
condition_image
torch.Tensor
required
Conditioning image (perspective or panoramic), shape (C, H, W) or (B, C, H, W).
camera_trajectory
CameraTrajectory
required
Camera trajectory defining the viewpoints for video generation.
num_frames
int
default:"16"
Number of frames to generate in the output video.
num_inference_steps
int
default:"50"
Number of denoising steps in the diffusion process.
guidance_scale
float
default:"7.5"
Classifier-free guidance scale for controlling generation fidelity.
point_cloud
torch.Tensor
Optional initial point cloud for geometric conditioning.
generator
torch.Generator
Random generator for reproducibility.
output
torch.Tensor
Generated video frames, shape (num_frames, C, H, W).

update_geometric_memory

Incrementally update the global geometric memory with new observations.
update_geometric_memory(
    new_point_cloud: torch.Tensor,
    camera_pose: torch.Tensor,
    merge_strategy: str = "incremental"
) -> None
new_point_cloud
torch.Tensor
required
New point cloud observations to integrate, shape (N, 3) or (N, 6).
camera_pose
torch.Tensor
required
Camera pose for the new observations, shape (4, 4) transformation matrix.
merge_strategy
str
default:"incremental"
Strategy for merging point clouds: “incremental”, “replace”, or “weighted”.

extract_features

Extract feature representations from the model at different stages.
extract_features(
    images: torch.Tensor,
    camera_params: CameraParameters,
    layer_indices: Optional[List[int]] = None
) -> Dict[str, torch.Tensor]
images
torch.Tensor
required
Input images, shape (B, T, C, H, W).
camera_params
CameraParameters
required
Camera parameters for the input views.
layer_indices
List[int]
Specific layer indices to extract features from. If None, extracts from all layers.
features
Dict[str, torch.Tensor]
Dictionary mapping layer names to feature tensors.

Properties

memory_state

Access the current state of geometric memory modules.
@property
memory_state() -> Dict[str, Any]
global_memory
Dict[str, Any]
Current state of global geometric memory including:
  • Point cloud data
  • Feature embeddings
  • Spatial index structure
spatial_memory
Dict[str, Any]
Current state of spatial-stereo memory including:
  • Memory bank contents
  • Correspondence mappings
  • Attention masks

config

Model configuration dictionary.
@property
config() -> Dict[str, Any]

Example Usage

import torch
from worldstereo import WorldStereo, CameraParameters

# Initialize model
model = WorldStereo(
    vdm_config={
        "layers": 12,
        "dim": 768,
        "num_heads": 12,
        "timesteps": 1000
    },
    global_memory_config={
        "point_cloud_dim": 256,
        "num_neighbors": 32,
        "update_frequency": "per_frame"
    },
    spatial_memory_config={
        "memory_size": 1024,
        "num_correspondences": 64,
        "attention_window": 7
    },
    pretrained_path="path/to/weights.pth",
    device="cuda"
)

Memory Modules

Detailed documentation for geometric memory components

Inference API

High-level interface for video generation

Camera Control

Camera parameters and trajectory configuration

3D Reconstruction

Reconstruct 3D scenes from generated videos

Build docs developers (and LLMs) love