Overview
WorldStereo employs two dedicated geometric memory modules that enable multi-view-consistent video generation:- Global Geometric Memory: Provides coarse structural priors through incrementally updated point clouds
- Spatial-Stereo Memory: Constrains attention receptive fields with 3D correspondence for fine-grained details
The memory modules operate as control branches that integrate with the VDM backbone without requiring joint training, ensuring efficiency and modularity.
Global Geometric Memory
The global geometric memory module maintains a dynamic 3D representation of the scene through incrementally updated point clouds. It provides coarse structural priors that guide the video generation process.Class: GlobalGeometricMemory
Constructor
Dimension of point cloud feature embeddings.
Dimension of output features to inject into the VDM.
Number of nearest neighbors to consider for each query point.
How often to update the point cloud: “per_frame”, “per_step”, or “manual”.
Spatial indexing structure for efficient nearest neighbor search: “kd_tree”, “ball_tree”, or “cuda_knn”.
Maximum number of points to maintain in memory (oldest points are removed).
Methods
initialize
Initialize or reset the memory with a point cloud.Initial point cloud coordinates, shape
(N, 3).Optional per-point features, shape
(N, D). If not provided, features are computed from coordinates.Optional RGB colors for each point, shape
(N, 3).update
Incrementally update the memory with new observations.New point cloud observations, shape
(M, 3).Camera pose for the observations, shape
(4, 4) transformation matrix.Optional features for new points, shape
(M, D).Distance threshold for merging nearby points (in scene units).
Number of new points added to memory.
Number of points merged with existing points.
Total number of points in memory after update.
query
Query the memory for features at specific 3D locations.3D query locations, shape
(B, Q, 3) where Q is the number of query points.Current camera parameters for view-dependent feature computation.
How to aggregate neighbor features: “mean”, “max”, “weighted”, or “attention”.
Aggregated features for query points, shape
(B, Q, feature_dim).inject_to_latent
Inject geometric memory features into the VDM latent representation.VDM latent representation, shape
(B, T, C, H, W).Camera parameters for current viewpoint.
Method for feature injection: “cross_attention”, “addition”, or “concatenation”.
Latent representation with injected geometric features, shape
(B, T, C, H, W).Properties
Example Usage
Spatial-Stereo Memory
The spatial-stereo memory module maintains a memory bank of fine-grained visual features with 3D correspondence information. It constrains the model’s attention receptive fields to focus on geometrically consistent regions.Class: SpatialStereoMemory
Constructor
Maximum number of feature vectors in the memory bank.
Dimension of feature vectors.
Number of 3D correspondences to maintain per memory entry.
Spatial window size for constrained attention (in pixels).
Confidence threshold for accepting 3D correspondences.
Memory replacement strategy: “fifo”, “lru”, or “importance_based”.
Methods
add_to_memory
Add new features and correspondences to the memory bank.Feature vectors to add, shape
(N, feature_dim).3D correspondence locations, shape
(N, num_correspondences, 3).Confidence scores for correspondences, shape
(N, num_correspondences).Optional metadata (camera pose, frame index, etc.).
Memory indices where features were stored.
constrained_attention
Compute attention with spatial-stereo constraints.Query features, shape
(B, N, feature_dim).3D locations of query points, shape
(B, N, 3).Current camera parameters.
Whether to apply correspondence-based attention masking.
Attention output, shape
(B, N, feature_dim).Attention weights for visualization, shape
(B, N, memory_size).retrieve
Retrieve relevant memory entries based on 3D proximity.3D query locations, shape
(B, N, 3).Camera parameters for view-dependent retrieval.
Number of nearest memory entries to retrieve.
Retrieved feature vectors, shape
(B, N, top_k, feature_dim).Indices of retrieved memory entries, shape
(B, N, top_k).clear
Clear the memory bank.Properties
Example Usage
Memory Integration
The two memory modules work together during video generation:Global Geometric Memory
Provides coarse 3D structure from the point cloud, guiding overall scene geometry and camera control.
Spatial-Stereo Memory
Refines generation with fine-grained details by constraining attention to geometrically consistent regions.
The memory modules operate independently but can be queried in parallel for optimal performance. The global memory focuses on structure, while spatial memory handles texture and detail.
Performance Considerations
Memory Management
Memory Management
- Global memory uses spatial indexing (KD-tree) for efficient nearest neighbor search
- Spatial memory implements FIFO or LRU caching to maintain bounded size
- Point cloud merging reduces redundancy and memory footprint
- GPU-accelerated operations for real-time performance
Scalability
Scalability
- Global memory can handle 100K+ points efficiently
- Spatial memory bank size is configurable based on available GPU memory
- Batch processing for multiple queries reduces overhead
- Incremental updates avoid full recomputation
Quality vs. Efficiency Trade-offs
Quality vs. Efficiency Trade-offs
- Increase
num_neighborsfor better geometric conditioning (slower) - Increase
memory_sizefor more detailed spatial memory (more GPU memory) - Larger
attention_windowimproves detail but reduces speed - More
num_correspondencesincreases accuracy but computation cost
Related Documentation
WorldStereo Model
Main model class that integrates memory modules
Inference API
High-level generation interface
Global Geometric Memory
Conceptual overview
Spatial-Stereo Memory
Conceptual overview