Spatial-Stereo Memory

Overview

The spatial-stereo memory is WorldStereo’s mechanism for preserving and enforcing fine-grained details across multiple viewpoints. It operates by constraining the model’s attention receptive fields using 3D correspondence information, ensuring that local details remain consistent when viewed from different camera angles.

While the global-geometric memory handles coarse structural priors, the spatial-stereo memory specializes in fine-grained detail preservation—textures, edges, local surface variations, and other high-frequency visual information.

Core Concept: 3D Correspondence

What is 3D Correspondence?

3D correspondence refers to the relationship between image regions across different viewpoints that observe the same physical 3D point or surface patch.

When two pixels in different views correspond to the same 3D point, they should exhibit consistent appearance properties (color, texture) when accounting for lighting and viewing angle changes.

Why 3D Correspondence Matters

For multi-view-consistent video generation:

Detail Consistency

Ensures textures and fine details appear the same across views

Geometric Accuracy

Maintains precise spatial relationships at the pixel level

Reconstruction Quality

Enables high-quality 3D reconstruction by providing reliable correspondences

Temporal Stability

Prevents flickering and inconsistencies in generated video sequences

Memory Bank Architecture

The spatial-stereo memory operates through a memory bank that stores fine-grained visual information with associated 3D correspondence data.

Memory Bank Components

Visual Features

High-resolution feature representations extracted from previously generated frames:

Feature Storage:
  - Multi-scale feature pyramids
  - Texture and appearance descriptors
  - Edge and corner information
  - Local pattern encodings

These features capture the fine-grained visual details that need to be preserved across views.

Geometric Metadata

For each stored feature, the memory bank maintains:

3D position: The world-space location of the feature
Surface normal: Orientation of the local surface
View information: Which views have observed this feature
Confidence scores: Reliability of the stored information

This metadata enables accurate correspondence matching across viewpoints.

Correspondence Maps

Explicit or implicit mappings that define which image regions across different views correspond to the same 3D structure:

Dense correspondence fields: Pixel-to-pixel mappings
Sparse keypoint correspondences: Distinctive feature locations
Patch-level associations: Groups of related pixels

These maps guide the attention mechanism during generation.

Memory Bank Updates

As new frames are generated:

Feature extraction: Visual features are extracted from the new frame
Correspondence computation: 3D correspondences are established with existing memory
Memory insertion: New features and correspondences are added to the bank
Consistency refinement: Existing entries are refined with new observations

The memory bank grows incrementally as more views are generated, continuously improving the quality and coverage of stored correspondences.

Attention Constraint Mechanism

The spatial-stereo memory’s primary function is to constrain the model’s attention receptive fields based on 3D correspondence.

Traditional Attention in Video Diffusion

Standard video diffusion models use attention mechanisms that:

Attend to all spatial and temporal locations
Learn attention patterns from training data
May produce inconsistent correspondences across views

Correspondence-Constrained Attention

Spatial-stereo memory introduces geometric constraints:

# Conceptual attention constraint
Attention(query, key, value):
  # Standard attention computation
  attention_weights = softmax(query @ key.T / sqrt(d))
  
  # Geometric constraint from spatial-stereo memory
  correspondence_mask = get_correspondence_mask(query_3d_pos, key_3d_pos)
  
  # Apply constraint to focus on corresponding regions
  constrained_weights = attention_weights * correspondence_mask
  
  # Compute output with geometrically-aware attention
  output = constrained_weights @ value

By constraining attention to geometrically corresponding regions, the spatial-stereo memory ensures that the model focuses on relevant details from previous views when generating new frames.

Benefits of Constrained Attention

Reduced Ambiguity

Limits attention to regions that are geometrically relevant, reducing confusion

Detail Preservation

Ensures fine details are copied from appropriate source views

Consistency Enforcement

Prevents the model from hallucinating inconsistent details

Efficient Computation

Sparse attention patterns reduce computational requirements

Fine-Grained Detail Preservation

The spatial-stereo memory excels at preserving fine-grained details that are critical for visual quality and 3D reconstruction accuracy.

Types of Details Preserved

Texture Patterns

Surface textures like:

Fabric weaves
Wood grain
Stone patterns
Paint details

These high-frequency patterns must remain consistent across viewpoints for realistic generation and accurate reconstruction.

Edges and Boundaries

Sharp transitions such as:

Object silhouettes
Shadow boundaries
Material transitions
Occlusion edges

Precise edge localization across views is essential for geometric accuracy.

Local Surface Variation

Small-scale geometry including:

Surface bumps and indentations
Wrinkles and folds
Fine geometric details
Relief patterns

These variations create realistic appearance under different viewing angles.

Specular Highlights

View-dependent effects like:

Reflections
Glossy highlights
Transparent surface appearance

While view-dependent, these effects must change consistently with viewing angle.

Detail-Preserving Mechanism

The spatial-stereo memory preserves details through:

High-resolution feature storage: Maintains detailed feature representations
Precise correspondence: Accurately maps details across views
Attention guidance: Directs the model to copy details from appropriate sources
Multi-view consistency checking: Validates detail consistency across multiple observations

Integration with Global-Geometric Memory

The spatial-stereo and global-geometric memories form a complementary hierarchy:

Division of Responsibilities

Global-Geometric

Scale: Scene-levelFocus: Coarse geometry, camera pathsRepresentation: Point cloudsUpdates: Incremental 3D structure

Spatial-Stereo

Scale: Local patchesFocus: Fine details, texturesRepresentation: Feature memory bankUpdates: Correspondence refinement

This hierarchical design allows WorldStereo to efficiently process geometric information at multiple scales, allocating computational resources appropriately for both coarse structure and fine details.

Attention Receptive Field Control

What Are Attention Receptive Fields?

In video diffusion models, attention receptive fields define which spatial and temporal regions each location can attend to during generation. Unconstrained receptive fields:

Can attend to any location in space and time
Learn patterns from training data
May produce geometrically inconsistent attention

Spatially-constrained receptive fields:

Attend only to geometrically corresponding regions
Guided by 3D correspondence from spatial-stereo memory
Enforce multi-view consistency through geometry

How Constraints Are Applied

The spatial-stereo memory applies constraints through:

Correspondence-Based Masking

Attention masks are generated based on 3D correspondence:

# For each query position
mask[query_position] = {
  key_position: is_corresponding(query_position, key_position)
  for key_position in all_positions
}

Only corresponding positions receive non-zero attention weights.

Spatial Warping

Features from previous views are warped to the current view using correspondence:

Retrieve features from memory bank
Use 3D correspondence to warp to current view
Use warped features as keys/values in attention

This provides geometrically-aligned features for attention computation.

Adaptive Receptive Fields

Receptive field sizes adapt based on:

Geometric certainty: Larger fields where correspondence is uncertain
Detail level: Smaller fields for fine details
View angle: Adjusted for foreshortening and perspective effects

This adaptivity balances geometric constraints with generation flexibility.

Benefits for 3D Reconstruction

The spatial-stereo memory directly improves 3D reconstruction quality:

Reliable Correspondences

Accurate 3D reconstruction requires reliable correspondences across views. The spatial-stereo memory ensures generated videos have the precise correspondences that reconstruction algorithms depend on.

Reconstruction algorithms benefit from:

Dense, accurate point correspondences
Consistent textures across views
Precise edge localization
Reliable feature matching

High-Frequency Detail Recovery

Traditional multi-view reconstruction often struggles with fine details. Spatial-stereo memory:

Ensures details are consistently generated across views
Provides reliable high-frequency information
Enables reconstruction of textures and small geometric features
Reduces smoothing artifacts in final 3D models

Reduced Reconstruction Artifacts

Common reconstruction problems addressed:

Floating Artifacts

Prevented by consistent depth cues across views

Holes and Gaps

Reduced through complete, consistent coverage

Texture Blur

Avoided by preserving high-frequency details

Geometric Inconsistencies

Eliminated through correspondence constraints

Technical Implementation Details

Memory Efficiency

The memory bank implements efficient storage:

Sparse representation: Only stores relevant features
Hierarchical indexing: Fast correspondence queries
Pruning strategies: Removes redundant or low-confidence entries
Compression: Efficient encoding of visual features

Computational Efficiency

Correspondence-constrained attention can be more efficient than full attention:

Sparse attention patterns: Fewer operations required
Early termination: Skip irrelevant regions
Batched correspondence queries: Efficient geometry processing

Robustness Considerations

The system handles challenging scenarios:

Occlusions: Gracefully handles temporarily occluded regions
View-dependent appearance: Accounts for lighting and reflectance changes
Partial observations: Works with incomplete correspondence information
Ambiguous regions: Falls back to global-geometric guidance when local correspondence is uncertain

Comparison with Alternative Approaches

vs. Explicit Reprojection

Spatial-stereo memory offers:

Learned feature representations beyond raw pixels
Handling of view-dependent effects
Soft constraints that allow generation flexibility

Trade-offs:

More complex than simple pixel reprojection
Requires feature extraction and storage

vs. Learned Correspondence Networks

Spatial-stereo memory offers:

Explicit geometric constraints based on 3D structure
Better generalization to novel viewpoints
Direct integration with point cloud representation

Trade-offs:

Requires 3D geometric information
Pure learning-based approaches may handle some cases more flexibly

vs. Global Attention Only

Spatial-stereo memory offers:

Guaranteed geometric consistency
Preservation of fine details
More efficient sparse attention patterns

Trade-offs:

Additional memory and computation for correspondence
Less flexibility in handling novel content generation

Next Steps

Explore the Global-Geometric Memory for coarse structure control
Learn about the Video Diffusion Model backbone
See the complete Architecture Overview

Get Started

Core Concepts

Guides

Research

​Overview

​Core Concept: 3D Correspondence

​What is 3D Correspondence?

​Why 3D Correspondence Matters

Detail Consistency

Geometric Accuracy

Reconstruction Quality

Temporal Stability

​Memory Bank Architecture

​Memory Bank Components

​Memory Bank Updates

​Attention Constraint Mechanism

​Traditional Attention in Video Diffusion

​Correspondence-Constrained Attention

​Benefits of Constrained Attention

Reduced Ambiguity

Detail Preservation

Consistency Enforcement

Efficient Computation

​Fine-Grained Detail Preservation

​Types of Details Preserved

​Detail-Preserving Mechanism

​Integration with Global-Geometric Memory

​Division of Responsibilities

Global-Geometric

Spatial-Stereo

​Attention Receptive Field Control

​What Are Attention Receptive Fields?

​How Constraints Are Applied

​Benefits for 3D Reconstruction

​Reliable Correspondences

​High-Frequency Detail Recovery

​Reduced Reconstruction Artifacts

Floating Artifacts

Holes and Gaps

Texture Blur

Geometric Inconsistencies

​Technical Implementation Details

​Memory Efficiency

​Computational Efficiency

​Robustness Considerations

​Comparison with Alternative Approaches

​Next Steps

Build docs developers (and LLMs) love

Overview

Core Concept: 3D Correspondence

What is 3D Correspondence?

Why 3D Correspondence Matters

Memory Bank Architecture

Memory Bank Components

Memory Bank Updates

Attention Constraint Mechanism

Traditional Attention in Video Diffusion

Correspondence-Constrained Attention

Benefits of Constrained Attention

Fine-Grained Detail Preservation

Types of Details Preserved

Detail-Preserving Mechanism

Integration with Global-Geometric Memory

Division of Responsibilities

Attention Receptive Field Control

What Are Attention Receptive Fields?

How Constraints Are Applied

Benefits for 3D Reconstruction

Reliable Correspondences

High-Frequency Detail Recovery

Reduced Reconstruction Artifacts

Technical Implementation Details

Memory Efficiency

Computational Efficiency

Robustness Considerations

Comparison with Alternative Approaches

Next Steps