Skip to main content

Overview

WorldStereo has been extensively evaluated across both camera-guided video generation and 3D reconstruction benchmarks, demonstrating state-of-the-art performance in multi-view consistency and geometric accuracy.
Experiments validate WorldStereo’s effectiveness as a powerful world model capable of tackling diverse scene generation tasks with high-fidelity 3D results.

Evaluation Domains

WorldStereo’s performance is assessed across two primary domains:

Camera-Guided Video Generation

Evaluates the quality and controllability of generated videos following specified camera trajectories

3D Scene Reconstruction

Measures the geometric accuracy and completeness of reconstructed 3D scenes from generated multi-view videos

Camera-Guided Video Generation

Key Metrics

Camera-guided video generation performance is evaluated using multiple complementary metrics:

Visual Quality Metrics

  • FVD (Fréchet Video Distance): Measures overall video quality and temporal consistency
  • FID (Fréchet Inception Distance): Evaluates per-frame image quality
  • LPIPS (Learned Perceptual Image Patch Similarity): Assesses perceptual similarity to ground truth

Camera Control Metrics

  • Camera Trajectory Accuracy: Measures adherence to specified camera paths
  • Pose Estimation Error: Evaluates the recoverability of camera poses from generated videos

Multi-View Consistency Metrics

  • Cross-View Consistency: Measures geometric consistency between different viewpoints
  • Temporal Stability: Evaluates smoothness and coherence across video frames
WorldStereo demonstrates superior performance in multi-view consistency metrics compared to baseline VDM approaches, thanks to its geometric memory modules.

Benchmark Datasets

Evaluation is conducted on standard video generation datasets:

Perspective Scenes

Standard benchmark datasets featuring conventional perspective camera views

Panoramic Scenes

360-degree panoramic scene datasets for immersive environment evaluation

Performance Highlights

Precise Camera Control

  • Global-Geometric Memory: Enables accurate following of complex camera trajectories
  • Trajectory Flexibility: Supports diverse camera motion patterns (orbiting, forward motion, complex paths)
  • Pose Consistency: Generated videos maintain geometric consistency with input camera poses

Visual Quality

  • VDM Backbone Benefits: Inherits high-quality generation from foundational Video Diffusion Models
  • Detail Preservation: Spatial-stereo memory maintains fine-grained details across views
  • Temporal Coherence: Smooth transitions between frames without flickering or artifacts
The distribution matching approach allows WorldStereo to maintain the visual quality of state-of-the-art VDMs while adding precise geometric control.

3D Reconstruction Benchmarks

Reconstruction Pipeline

WorldStereo’s 3D reconstruction capability is evaluated using generated multi-view videos:
  1. Multi-View Generation: Generate videos from controlled camera trajectories
  2. Feature Matching: Extract and match features across generated views
  3. Structure from Motion: Recover camera poses and sparse point clouds
  4. Dense Reconstruction: Build dense 3D models from multi-view consistency

Evaluation Metrics

Geometric Accuracy

  • Chamfer Distance: Measures point cloud similarity to ground truth geometry
  • Point-to-Mesh Distance: Evaluates accuracy of reconstructed mesh surfaces
  • Normal Consistency: Assesses correctness of surface orientations

Completeness

  • Coverage: Percentage of ground truth geometry successfully reconstructed
  • Hole Detection: Identifies gaps or missing regions in reconstruction
  • Outlier Ratio: Measures spurious geometric elements

Multi-View Consistency

  • Reprojection Error: Measures consistency of 3D points across different views
  • Depth Consistency: Evaluates agreement of depth estimates from multiple viewpoints
  • Photometric Consistency: Assesses color/appearance consistency across views
WorldStereo’s geometric memory modules specifically target multi-view consistency, resulting in superior 3D reconstruction quality compared to standard video generation models.

Benchmark Datasets

Synthetic Datasets

Controlled environments with ground truth 3D geometry for quantitative evaluation

Real-World Scenes

Challenging real-world scenarios to test generalization and robustness

Performance Highlights

High-Fidelity 3D Results

  • Geometric Memory Integration: Both global and spatial memory modules contribute to reconstruction quality
  • Fine-Grained Details: Spatial-stereo memory preserves detailed geometric features
  • Structural Coherence: Global-geometric memory ensures overall structural correctness

Diverse Scene Support

WorldStereo demonstrates strong performance across various scene types:
  • Perspective Scenes: Standard 3D environments captured from perspective cameras
  • Panoramic Scenes: Full 360-degree environments from panoramic inputs
  • Complex Geometry: Challenging structures with intricate details and occlusions

Comparative Analysis

Baseline Comparisons

WorldStereo is compared against several baseline approaches:

Standard VDMs

Foundational video diffusion models without geometric controls

Camera-Controlled Generation

Existing methods for camera-guided video synthesis

Multi-View Synthesis

Traditional multi-view image generation approaches

3D-Aware Generation

3D-aware generative models for novel view synthesis

Key Advantages

Over Standard VDMs

  • Precise Camera Control: WorldStereo provides accurate trajectory following vs. limited or no control
  • Multi-View Consistency: Geometric memory modules ensure consistency vs. view-independent generation
  • 3D Reconstruction: Enables high-quality reconstruction vs. inconsistent geometry

Over Camera-Controlled Methods

  • Geometric Awareness: Explicit 3D memory modules vs. implicit control mechanisms
  • Fine-Grained Consistency: Spatial-stereo memory preserves details vs. coarse consistency
  • Reconstruction Quality: Superior 3D output quality due to geometric constraints

Efficiency Benefits

WorldStereo’s control branch architecture provides impressive efficiency by avoiding joint training requirements, enabling faster development and deployment compared to end-to-end trained alternatives.

World Model Capabilities

Task Diversity

Extensive experiments demonstrate WorldStereo’s versatility as a world model:
  • Novel View Synthesis: Generate new viewpoints from single or sparse inputs
  • Scene Completion: Infer and generate unseen regions of 3D environments
  • Camera Path Planning: Support arbitrary camera trajectories for exploration
  • Multi-Modal Input: Handle both perspective and panoramic image inputs

Qualitative Results

WorldStereo produces:
  • Visually Plausible: High-quality videos that match foundational VDM output quality
  • Geometrically Consistent: Multi-view outputs suitable for accurate 3D reconstruction
  • Detail-Preserving: Fine-grained textures and structures maintained across views
  • Temporally Smooth: Coherent video sequences without artifacts or inconsistencies
The combination of visual quality from VDM backbones and geometric consistency from memory modules positions WorldStereo as a powerful foundation for 3D-aware video generation tasks.

Experimental Validation

Ablation Studies

Extensive ablation studies validate the contribution of each component:
  • Global-Geometric Memory: Critical for camera control and coarse structure
  • Spatial-Stereo Memory: Essential for fine-grained multi-view consistency
  • Combined Architecture: Synergistic benefits from both memory modules

Generalization Tests

  • Cross-Dataset: Performance on unseen datasets and scene types
  • Camera Trajectory Variety: Robustness to diverse camera motion patterns
  • Scene Complexity: Handling of intricate geometric structures and textures

Future Benchmarking

As WorldStereo continues to develop, additional benchmark evaluations will be conducted:
  • Large-Scale Scenes: Evaluation on extensive environments and urban scenes
  • Dynamic Content: Extension to scenes with moving objects and temporal dynamics
  • Real-Time Performance: Speed and efficiency benchmarks for interactive applications
  • User Studies: Perceptual quality evaluation through human assessments

Build docs developers (and LLMs) love