WorldStereo Model

This API documentation is provisional. WorldStereo code and model weights are not yet publicly released. The API described here represents the expected interface based on the research framework.

Overview

The WorldStereo class is the core model that bridges camera-guided video generation and 3D scene reconstruction. It integrates a Video Diffusion Model (VDM) backbone with geometric memory modules to generate multi-view-consistent videos under precise camera control.

Model Architecture

WorldStereo consists of three primary components:

VDM Backbone: Distribution matching distilled Video Diffusion Model
Global Geometric Memory: Injects coarse structural priors via point clouds
Spatial-Stereo Memory: Constrains attention with 3D correspondence for fine details

The control branch design enables efficient integration without requiring joint training with the VDM backbone.

Class: WorldStereo

Constructor

WorldStereo(
    vdm_config: Dict[str, Any],
    global_memory_config: Dict[str, Any],
    spatial_memory_config: Dict[str, Any],
    pretrained_path: Optional[str] = None,
    device: str = "cuda"
)

vdm_config

Dict[str, Any]

required

Configuration for the Video Diffusion Model backbone. Includes parameters for:

Model architecture (layers, dimensions, attention heads)
Temporal modeling settings
Diffusion process parameters (timesteps, noise schedule)

global_memory_config

Dict[str, Any]

required

Configuration for the global geometric memory module. See Global Geometric Memory for details.

spatial_memory_config

Dict[str, Any]

required

Configuration for the spatial-stereo memory module. See Spatial-Stereo Memory for details.

pretrained_path

str

Path to pretrained model weights. If not provided, initializes with random weights.

device

str

default:"cuda"

Device to run the model on (“cuda” or “cpu”).

Methods

forward

Performs a forward pass through the model for training.

forward(
    images: torch.Tensor,
    camera_params: CameraParameters,
    timesteps: torch.Tensor,
    point_cloud: Optional[torch.Tensor] = None,
    return_dict: bool = True
) -> Union[ModelOutput, Tuple]

images

torch.Tensor

required

Input images with shape (B, T, C, H, W) where:

B: batch size
T: number of frames
C: channels (3 for RGB)
H: height
W: width

camera_params

CameraParameters

required

Camera parameters including:

Intrinsics (focal length, principal point)
Extrinsics (rotation, translation)
Trajectory information for multi-view generation

timesteps

torch.Tensor

required

Diffusion timesteps for the forward process, shape (B,).

point_cloud

torch.Tensor

Optional point cloud for global geometric memory initialization. Shape (N, 3) or (N, 6) with RGB.

return_dict

bool

default:"True"

Whether to return a ModelOutput object or a plain tuple.

loss

torch.Tensor

Training loss value.

predicted_noise

torch.Tensor

Predicted noise from the diffusion process, shape (B, T, C, H, W).

memory_features

Dict[str, torch.Tensor]

Features extracted from both memory modules for analysis.

generate

Generate videos from input conditioning. See Inference API for detailed usage.

generate(
    condition_image: torch.Tensor,
    camera_trajectory: CameraTrajectory,
    num_frames: int = 16,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    point_cloud: Optional[torch.Tensor] = None,
    generator: Optional[torch.Generator] = None
) -> torch.Tensor

condition_image

torch.Tensor

required

Conditioning image (perspective or panoramic), shape (C, H, W) or (B, C, H, W).

camera_trajectory

CameraTrajectory

required

Camera trajectory defining the viewpoints for video generation.

num_frames

int

default:"16"

Number of frames to generate in the output video.

num_inference_steps

int

default:"50"

Number of denoising steps in the diffusion process.

guidance_scale

float

default:"7.5"

Classifier-free guidance scale for controlling generation fidelity.

point_cloud

torch.Tensor

Optional initial point cloud for geometric conditioning.

generator

torch.Generator

Random generator for reproducibility.

output

torch.Tensor

Generated video frames, shape (num_frames, C, H, W).

update_geometric_memory

Incrementally update the global geometric memory with new observations.

update_geometric_memory(
    new_point_cloud: torch.Tensor,
    camera_pose: torch.Tensor,
    merge_strategy: str = "incremental"
) -> None

new_point_cloud

torch.Tensor

required

New point cloud observations to integrate, shape (N, 3) or (N, 6).

camera_pose

torch.Tensor

required

Camera pose for the new observations, shape (4, 4) transformation matrix.

merge_strategy

str

default:"incremental"

Strategy for merging point clouds: “incremental”, “replace”, or “weighted”.

extract_features

Extract feature representations from the model at different stages.

extract_features(
    images: torch.Tensor,
    camera_params: CameraParameters,
    layer_indices: Optional[List[int]] = None
) -> Dict[str, torch.Tensor]

images

torch.Tensor

required

Input images, shape (B, T, C, H, W).

camera_params

CameraParameters

required

Camera parameters for the input views.

layer_indices

List[int]

Specific layer indices to extract features from. If None, extracts from all layers.

features

Dict[str, torch.Tensor]

Dictionary mapping layer names to feature tensors.

Properties

memory_state

Access the current state of geometric memory modules.

@property
memory_state() -> Dict[str, Any]

global_memory

Dict[str, Any]

Current state of global geometric memory including:

Point cloud data
Feature embeddings
Spatial index structure

spatial_memory

Dict[str, Any]

Current state of spatial-stereo memory including:

Memory bank contents
Correspondence mappings
Attention masks

config

Model configuration dictionary.

@property
config() -> Dict[str, Any]

Example Usage

import torch
from worldstereo import WorldStereo, CameraParameters

# Initialize model
model = WorldStereo(
    vdm_config={
        "layers": 12,
        "dim": 768,
        "num_heads": 12,
        "timesteps": 1000
    },
    global_memory_config={
        "point_cloud_dim": 256,
        "num_neighbors": 32,
        "update_frequency": "per_frame"
    },
    spatial_memory_config={
        "memory_size": 1024,
        "num_correspondences": 64,
        "attention_window": 7
    },
    pretrained_path="path/to/weights.pth",
    device="cuda"
)

Memory Modules

Detailed documentation for geometric memory components

Inference API

High-level interface for video generation

Camera Control

Camera parameters and trajectory configuration

3D Reconstruction

Reconstruct 3D scenes from generated videos

Model API

Utilities

Overview

Model Architecture

Class: WorldStereo

Constructor

Methods

forward

generate

update_geometric_memory

extract_features

Properties

memory_state

config

Example Usage

Memory Modules

Inference API

Camera Control

3D Reconstruction

Build docs developers (and LLMs) love

Model API

Utilities

​Overview

​Model Architecture

​Class: WorldStereo

​Constructor

​Methods

​forward

​generate

​update_geometric_memory

​extract_features

​Properties

​memory_state

​config

​Example Usage

​Related Documentation

Memory Modules

Inference API

Camera Control

3D Reconstruction

Build docs developers (and LLMs) love

Overview

Model Architecture

Class: WorldStereo

Constructor

Methods

forward

generate

update_geometric_memory

extract_features

Properties

memory_state

config

Example Usage

Related Documentation