WorldStereo
Overview
WorldStereo is a cutting-edge framework developed by researchers from Zhejiang University and Tencent Hunyuan that revolutionizes camera-guided video generation and 3D scene reconstruction. Built on Video Diffusion Models (VDMs), WorldStereo addresses the fundamental challenge of reconstructing consistent 3D scenes from generated videos by introducing precise camera controllability and multi-view consistency. While recent advances in foundational Video Diffusion Models have achieved remarkable visual quality, reconstructing consistent 3D scenes from these outputs has remained challenging due to limited camera control and inconsistent content across different viewpoints. WorldStereo solves this through two dedicated geometric memory modules that enable both precise camera control and high-quality 3D reconstruction.Key Features
Global-Geometric Memory
Enables precise camera control while injecting coarse structural priors through incrementally updated point clouds
Spatial-Stereo Memory
Constrains attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank
Multi-View Consistency
Generate videos under precise camera control with multi-view consistency for high-quality 3D reconstruction
Flexible World Model
Tackle diverse scene generation tasks from perspective or panoramic images with high-fidelity 3D results
Core Architecture
Global-Geometric Memory
The global-geometric memory module provides precise camera control by maintaining and updating point clouds that encode coarse structural priors. This module operates incrementally, building up a geometric understanding of the scene as video frames are generated. The point cloud representation allows WorldStereo to maintain spatial consistency across different camera trajectories while guiding the diffusion process.Spatial-Stereo Memory
The spatial-stereo memory works at a finer granularity, constraining the model’s attention mechanisms using 3D correspondences. By focusing attention on relevant fine-grained details stored in the memory bank, this module ensures that generated content remains geometrically consistent when viewed from different angles. This is crucial for enabling high-quality 3D reconstruction from the generated videos.How It Works
WorldStereo leverages a control branch-based architecture that integrates with a distribution matching distilled VDM backbone. This design choice provides impressive efficiency benefits, as the geometric memory modules can be added without requiring joint training of the entire system. The framework processes input images (either perspective or panoramic) and camera trajectories to generate multi-view consistent videos that can be directly used for 3D reconstruction.WorldStereo has been validated across both camera-guided video generation and 3D reconstruction benchmarks, demonstrating state-of-the-art performance in generating geometrically consistent content.
Research Background
WorldStereo was developed through collaboration between:- Zhejiang University: Leading research in computer vision and 3D reconstruction
- Tencent Hunyuan: Advancing foundational models and AI infrastructure
Next Steps
Installation
Learn about installation requirements and pre-release status
Quick Start
Get started with WorldStereo once the code is released
Research Paper
Read the full technical paper on arXiv
GitHub Repository
Follow development and star the project