Overview
WorldStereo is a novel framework that bridges camera-guided video generation and 3D reconstruction via geometric memory modules. The research introduces innovative approaches to enable multi-view-consistent video generation under precise camera control.The paper demonstrates how WorldStereo acts as a powerful world model, tackling diverse scene generation tasks with high-fidelity 3D results.
Authors
The WorldStereo project is a collaboration between Zhejiang University and Tencent Hunyuan.Research Team
Yisu Zhang
Zhejiang University & Tencent Hunyuan
Equal Contribution
Equal Contribution
Chenjie Cao
Tencent Hunyuan
Equal Contribution
Equal Contribution
Tengfei Wang
Tencent Hunyuan
Project Lead
Project Lead
Xuhui Zuo
Tencent Hunyuan
Junta Wu
Tencent Hunyuan
Jianke Zhu
Zhejiang University
Corresponding Author
Corresponding Author
Chunchao Guo
Tencent Hunyuan
Institutional Affiliations
- Zhejiang University: Leading research university in China, home to the Computer Vision and Graphics Lab
- Tencent Hunyuan: Tencent’s AI research division focusing on multimodal foundation models
Abstract
Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules.Key Innovation
Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model’s attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.Results
These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training.Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach.
Paper Access
arXiv Preprint
Read the full research paper on arXiv