Global-Geometric Memory

Overview

The global-geometric memory is a core component of WorldStereo that enables precise camera control while injecting coarse structural priors into the video generation process. It operates through incrementally updated point clouds that represent the 3D structure of the scene.

The global-geometric memory serves as the coarse structural backbone of WorldStereo, providing geometric guidance that ensures generated videos respect the underlying 3D scene structure.

Purpose and Function

Primary Objectives

The global-geometric memory module achieves three critical objectives:

Camera Control

Enables precise control over camera trajectories during video generation

Structural Priors

Injects coarse 3D structural information to guide the generation process

View Consistency

Maintains geometric consistency across different viewpoints

Incremental Point Cloud Updates

A key innovation of the global-geometric memory is its use of incrementally updated point clouds rather than static representations.

How Incremental Updates Work

Point Cloud Initialization

The point cloud is initialized from the input image (perspective or panoramic) using depth estimation or explicit depth information. This initial point cloud captures the basic 3D structure visible in the starting view.

Incremental Growth

As new frames are generated along a camera trajectory, the point cloud is incrementally updated:

New observations from generated frames add points to previously unobserved regions
Existing points are refined with additional observations from different viewpoints
Confidence values are updated based on multi-view consistency

This incremental approach allows the 3D representation to grow and improve as more of the scene is explored.

Efficient Memory Management

Incremental updates enable efficient memory usage:

Only relevant regions of the scene are stored in detail
Points can be pruned based on confidence or relevance
The representation scales gracefully with scene complexity

Benefits of Incremental Representation

Computational Efficiency: Incremental updates avoid the need to recompute the entire 3D representation from scratch at each frame, significantly reducing computational overhead.

Additional advantages include:

Real-time capability: Supports streaming video generation scenarios
Memory efficiency: Only stores relevant geometric information
Progressive refinement: Quality improves with more observations
Adaptivity: Naturally handles scene exploration and discovery

Coarse Structural Priors

The global-geometric memory provides coarse structural priors that guide the video diffusion process.

What Are Structural Priors?

Structural priors encode:

Geometric Information:
  - Surface positions in 3D space
  - Overall scene layout and topology
  - Depth relationships between objects
  - Spatial extent of scene elements

Coarse Level:
  - General shapes rather than fine details
  - Object boundaries and major surfaces
  - Spatial relationships between elements
  - Overall geometric consistency

Injection into Generation Process

The point cloud-based structural priors influence video generation by:

Conditioning the diffusion model on geometric features derived from the point cloud
Constraining plausible outputs to those consistent with the 3D structure
Guiding attention mechanisms to geometrically relevant regions
Ensuring multi-view consistency through shared 3D representation

While the global-geometric memory handles coarse structure, the spatial-stereo memory complements it by focusing on fine-grained details. This division of labor enables efficient processing at multiple geometric scales.

Precise Camera Control

The global-geometric memory is the foundation for WorldStereo’s precise camera control capabilities.

Camera Trajectory Integration

Camera control is achieved through:

Camera Pose Encoding

Camera extrinsics (position and orientation) and intrinsics (field of view, focal length) are encoded and provided to the generation model

View-Dependent Rendering

The point cloud is projected to the target camera view, creating view-specific geometric features that condition the generation

Trajectory Consistency

Sequential frames along a trajectory maintain consistency through the shared point cloud representation

Viewpoint Transitions

Smooth camera movements are supported by continuous point cloud queries across viewpoints

Supporting Complex Camera Motions

The global-geometric memory enables various camera motion types:

Orbital movements: Rotating around objects or scenes
Forward/backward motion: Moving through the scene
Sideways translation: Parallel camera movements
Complex trajectories: Arbitrary 6-DOF camera paths
Zoom operations: Changing field of view with geometric awareness

Point Cloud Representation

Point Cloud Structure

Each point in the global-geometric memory stores:

Point Attributes:
  position: [x, y, z]           # 3D world coordinates
  color: [r, g, b]              # RGB appearance
  normal: [nx, ny, nz]          # Surface orientation
  confidence: float              # Multi-view consistency score
  feature: [f1, f2, ..., fn]    # Learned geometric features

The learned geometric features encode higher-level structural information that helps the diffusion model generate consistent content.

Point Cloud Processing

The point cloud undergoes several processing steps:

Projection: Points are projected to the target camera view
Feature extraction: Geometric and appearance features are computed
Aggregation: Multiple points may contribute to the same image region
Encoding: Processed features are encoded for the diffusion model

Integration with Video Diffusion

The global-geometric memory integrates with the VDM backbone through the control branch:

Control Signal Generation

The point cloud produces control signals that:

Modulate the diffusion model’s attention patterns
Provide geometric conditioning at multiple layers
Ensure geometric consistency across frames
Guide content generation in previously unobserved regions

Comparison with Alternative Approaches

vs. Implicit Neural Representations

Point clouds offer:

Faster query and update times
More interpretable representations
Easier integration with traditional 3D reconstruction pipelines
Better support for incremental updates

Trade-offs:

Point clouds may require more memory for very dense representations
Implicit representations can be more compact for smooth surfaces

vs. Voxel Grids

Point clouds offer:

Memory efficiency for sparse scenes
No fixed resolution limitations
Better handling of unbounded scenes

Trade-offs:

Voxel grids provide more regular structure
Point clouds require spatial indexing for efficient queries

vs. Mesh Representations

Point clouds offer:

Easier updates and modifications
No need for explicit topology
Better handling of incomplete or partial observations

Trade-offs:

Meshes provide explicit surface connectivity
Point clouds may have gaps between points

Technical Advantages

Scalability

The global-geometric memory scales effectively:

Scene size: Works with small objects to large environments
Trajectory length: Supports short clips to long video sequences
Resolution: Adapts point density to quality requirements

Robustness

The incremental update mechanism provides robustness:

Handles partial observations gracefully
Recovers from temporary inconsistencies
Improves with additional observations

Flexibility

The design supports:

Different input modalities (perspective, panoramic)
Various scene types and complexities
Integration with different VDM backbones

Relationship with Spatial-Stereo Memory

The global-geometric and spatial-stereo memories work in tandem: