README.md

April 24, 2026 · View on GitHub

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang^1,2* Chenjie Cao^2* Tengfei Wang^2† Xuhui Zuo² Junta Wu² Jianke Zhu^1‡ Chunchao Guo²

¹Zhejiang University ²Tencent Hunyuan
*Equal Contribution †Project Lead ‡Corresponding Author

📅 News

[2026.02] 🎉 WorldStereo is accepted by CVPR 2026!
[2026.03] 📄 Paper is now available on arXiv: https://arxiv.org/abs/2603.02049
[2026.04] 🚀 Code and model weights of WorldStereo 2.0 are now released!
[2026.04] 🚀 HY-World 2.0 are now released: https://github.com/Tencent-Hunyuan/HY-World-2.0 !

✅ TODO List

Release inference code and model weights of WorldStereo 2.0
Release data pre-processing pipelines for panoramic and multi-trajectory scenes

📖 Abstract

We propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules:

Global-Geometric Memory (GGM) enables precise camera control while injecting coarse structural priors through incrementally updated point clouds via a ControlNet branch.
Spatial-Stereo Memory (SSM) constrains the model's attention receptive fields with 3D correspondences to focus on fine-grained details from the memory bank.

Together, these components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, WorldStereo shows impressive efficiency by leveraging a distribution-matching distilled (DMD) VDM backbone without joint training.

🎬 Results

3D Reconstruction from a Single Image

Given a single reference image, WorldStereo generates multi-view consistent videos and reconstructs a dense 3D point cloud. Below are example results on two scenes.

Scene: Kitchen | Input image → Point cloud (5 views)

Kitchen input Kitchen pcd 1 Kitchen pcd 2 Kitchen pcd 3 Kitchen pcd 4 Kitchen pcd 5

Camera Control

Methods	Camera Metrics			Visual Quality
Methods	RotErr ↓	TransErr ↓	ATE ↓	Q-Align ↑	CLIP-IQA+ ↑	Laion-Aes ↑	CLIP-I ↑
SEVA	1.690	1.578	2.879	3.232	0.479	4.623	77.16
Gen3C	0.944	1.580	2.789	3.353	0.489	4.863	82.33
WorldStereo	0.762	1.245	2.141	4.149	0.547	5.257	89.05
WorldStereo 2.0	0.492	0.968	1.768	4.205	0.544	5.266	89.43

Single-View-Generated Reconstruction

Methods	Tanks-and-Temples				MipNeRF360
Methods	Precision ↑	Recall ↑	F1-Score ↑	AUC ↑	Precision ↑	Recall ↑	F1-Score ↑	AUC ↑
SEVA	33.59	35.34	36.73	51.03	22.38	55.63	28.75	46.81
Gen3C	46.73	25.51	31.24	42.44	23.28	75.37	35.26	52.10
Lyra	50.38	28.67	32.54	43.05	30.02	58.60	36.05	49.89
FlashWorld	26.58	20.72	22.29	30.45	35.97	53.77	42.60	53.86
WorldStereo 2.0	43.62	41.02	41.43	58.19	43.19	65.32	51.27	65.79
WorldStereo 2.0 (DMD)	40.41	44.41	43.16	60.09	42.34	64.83	50.52	65.64

🆕 WorldStereo 2.0 vs. 1.0

WorldStereo 2.0 introduces four key improvements over the original version:

	WorldStereo 1.0	WorldStereo 2.0
Latent Space	Standard video latent space	Keyframe latent space — encodes each frame independently, substantially improving visual quality of generated novel views and completely supporting parallel encoding/decoding
Memory Mechanism	Cross-attention to retrieved reference frames	Stereo stitching in the main branch — reference views are spatially concatenated with target frames along the width dimension in the main DiT branch, enabling stronger and more direct memory fusion
Backbone Fine-tuning	Frozen backbone	Partial backbone fine-tuning — backbone weights are selectively updated to adapt to the keyframe latent space and improve overall generation quality
Training Data	Limited camera trajectories	Expanded UE rendering data — significantly more Unreal Engine rendered scenes with diverse and precise camera motions, leading to stronger camera control and memory capabilities

More details of WorldStereo 2.0 are shown in HY-World 2.0.

⚙️ Installation

1. Clone the repository:

git clone https://github.com/FuchengSu/WorldStereo.git
cd WorldStereo

2. Install core dependencies:

conda create -n worldstereo python=3.11
conda activate worldstereo
pip install -r requirements.txt

3. Install PyTorch3D (required for point cloud rendering):

pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"

4. Install MoGe (monocular depth estimation):

pip install git+https://github.com/microsoft/MoGe.git@0286b495230a074aadf1c76cc5c679e943e5d1c6

5. (Optional) Install third-party reconstruction module for WorldMirror reconstruction:

mkdir third_party
cd third_party
git clone https://github.com/Tencent-Hunyuan/HY-World-2.0.git
pip install -r HY-World-2.0/requirements.txt

Note: third_party/HY-World-2.0 is required only for apply_worldmirror post-processing (multi-view depth consistency and Gaussian Splatting reconstruction). You can skip it for basic video generation.

🚀 Quick Start

Model Variants

WorldStereo ships three model variants, each suited to a different use case:

Model Type	Entry Point	Description
`worldstereo-camera`	`run_camera_control.py`	Camera control only; single-view input
`worldstereo-memory`	`run_camera_control.py` / `run_multi_traj.py`	Full model with GGM + SSM; multi-view consistent generation; best quality
`worldstereo-memory-dmd`	`run_camera_control.py` / `run_multi_traj.py`	DMD distillation variant; 4-step inference, fastest

Models are automatically downloaded from HuggingFace Hub (hanshanxue/WorldStereo) on first run.

Single-View Camera Control

Generate a video from a single image along a specified camera trajectory:

python run_camera_control.py \
    --model_type worldstereo-camera \
    --input_path examples/images \
    --output_path outputs \
    --seed 1024

Multi-GPU Inference (Sequence Parallel)

Scale to multiple GPUs using Sequence Parallelism (SP) and FSDP:

torchrun --nproc_per_node=8 run_camera_control.py \
    --model_type worldstereo-memory \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

Multi-Trajectory Inference (Panorama / Reconstruction)

For panoramic scene generation or 3D reconstruction from multiple trajectories:

# Panoramic scene generation
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory \
    --task_type panorama \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

# Panoramic scene generation (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory-dmd \
    --task_type panorama \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

# 3D scene reconstruction
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory \
    --task_type reconstruction \
    --input_path examples/reconstruction \
    --output_path outputs \
    --fsdp

# 3D scene reconstruction (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory-dmd \
    --task_type reconstruction \
    --input_path examples/reconstruction \
    --output_path outputs \
    --fsdp

WorldMirror 3D Reconstruction (Optional)

After running run_multi_traj.py, the memory bank is automatically exported to a WorldMirror-compatible format under <output_path>/<scene>/world_mirror_data/<model_type>/. You can then run feedforward 3D reconstruction with HY-World 2.0:

# Requires: pip install -r third_party/HY-World-2.0/requirements.txt

cd third_party/HY-World-2.0
torchrun --nproc_per_node=8 -m hyworld2.worldrecon.pipeline --input_path ../../outputs/<scene>/world_mirror_data/<model_type>/images \
          --prior_cam_path ../../outputs/<scene>/world_mirror_data/<model_type>/cameras.json \
          --strict_output_path ../../outputs/<scene>/world_mirror_data/<model_type>/results \
          --target_size 832 --use_fsdp --enable_bf16 --no_save_normal --no_save_gs --no_sky_mask \
          --apply_edge_mask --apply_confidence_mask --confidence_percentile 15.0 --compress_pts --no_interactive \
          --disable_heads gs points

This produces metric-scale depth, surface normals, camera poses, a dense point cloud (.ply), and optionally Gaussian Splat renderings from the generated multi-view frames.

Python API

import torch
from models.worldstereo_wrapper import WorldStereo

device = torch.device("cuda:0")

worldstereo = WorldStereo.from_pretrained(
    "hanshanxue/WorldStereo",
    subfolder="worldstereo-memory",   # or "worldstereo-camera" / "worldstereo-memory-dmd"
    sp_world_size=1,
    fsdp=False,
    device=device,
)

output = worldstereo(**pipeline_inputs)

CLI Reference

run_camera_control.py

Flag	Default	Description
`--model_type`	`worldstereo-camera`	Model variant to use
`--input_path`	`examples/images`	Input scene directory
`--output_path`	`outputs`	Output directory
`--local_files_only`	`False`	Use locally cached weights instead of downloading
`--fsdp`	`False`	Enable FSDP model sharding
`--seed`	`1024`	Random seed

run_multi_traj.py (additional flags)

Flag	Default	Description
`--task_type`	`panorama`	`panorama` or `reconstruction`
`--align_nframe`	`8`	Frames per clip saved for updating the memory bank

📂 Input Data Format

Camera-Only Inference (`examples/images/`)

<scene>/
├── image.png                 # reference image
├── prompt.json               # text descriptions at three verbosity levels
│   # {"short caption": ..., "medium caption": ..., "long caption": ...}
└── camera.json               # camera trajectory, extrinsic is [4x4] w2c in the opencv coordinate
    # {"motion_list": [...], "extrinsic": [...], "intrinsic": [...]}

Memory-Augmented Multi-Trajectory (`examples/panorama/`, `examples/reconstruction/`)

<scene>/
├── panorama.png              # (optional) full panorama — triggers VLM single-path inference
├── meta_info.json            # {"scene_type": "perspective" | "panorama"}
├── start_frame.png           # reference start image for depth initialization
└── render_results/
    └── <view_id>/
        └── <traj_id>/
            ├── render.mp4         # pre-rendered geometry video (point cloud warp)
            ├── render_mask.mp4    # binary occlusion mask video
            └── camera.json        # {"extrinsic": [...], "intrinsic": [...]}, extrinsic is [4x4] w2c in the opencv coordinate

🔧 Architecture

Model Variants

WorldStereo defines two transformer architectures in models/worldstereo.py, both extending WanTransformer3DModel from diffusers:

WorldStereoModel — Wan DiT backbone + ControlNet. Used by worldstereo-camera. The ControlNet encodes rendered point cloud geometry and camera embeddings, injecting residuals at each transformer block.
WorldStereoRefSModel — Extends WorldStereoModel with WanTransformerSparseSpatialBlock layers. These SSM blocks perform sparse attention over retrieved reference frames, guided by 3D correspondences. Used by worldstereo-memory and worldstereo-memory-dmd.

Inference Pipelines

Three pipelines are provided under models/pipelines/, selected automatically based on model_type in the config:

Pipeline	Class	Mode
`pipeline_pcd_keyframe.py`	`KFPCDControllerPipeline`	Camera; standard DDIM sampling
`pipeline_ref_keyframe.py`	`KFPCDControllerRefPipeline`	Camera + GGM + SSM; standard DDIM sampling
`pipeline_dmd_keyframe.py`	`RefKFDMDGeneratorPipeline`	Camera + GGM + SSM; 4-step DMD distillation

3D Memory Bank

The memory bank (src/retrieval_wm.py) manages the growing 3D representation across trajectories:

Init — MoGe depth estimation on the start frame lifts it to a point cloud.
Retrieve — For each new target trajectory, the most relevant reference frames are selected via FOV-overlap scoring combined with DINOv2 image features and quality-aware furthest-point sampling.
Update — After generation, new frames and their estimated depths are appended to the bank.
Reconstruction — Feedforward reconstruction via HY-World 2.0 WorldMirror enforces multi-view depth consistency; final global alignment produces a unified point cloud.

Distributed Inference

WorldStereo supports two parallelism strategies:

Sequence Parallel (SP) — The sequence dimension is sharded across the SP group at each attention layer (models/attention.py). Controlled by torchrun --nproc_per_node.
FSDP — Full-Sharded Data Parallel wraps both the transformer and auxiliary encoders. Enabled with --fsdp. Requires a device_mesh with ("rep", "shard") dimensions.

🤝 Acknowledgements

WorldStereo builds upon the following excellent works:

Wan — Video DiT backbone
HunyuanVideo-1.5 — Components of sequence parallel and video generation model
MoGe — Monocular geometry estimation
HY-World 2.0 — WorldMirror reconstruction module
diffusers — Pipeline and model utilities

📝 Citation

If you find WorldStereo useful in your research, please cite:

@article{zhang2026worldstereo,
  title={WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories},
  author={Zhang, Yisu and Cao, Chenjie and Wang, Tengfei and Zuo, Xuhui and Wu, Junta and Zhu, Jianke and Guo, Chunchao},
  journal={arXiv preprint arXiv:2603.02049},
  year={2026}
}