README.md

April 24, 2026 · View on GitHub

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang1,2*   Chenjie Cao2*   Tengfei Wang2†   Xuhui Zuo2   Junta Wu2   Jianke Zhu1‡   Chunchao Guo2

1Zhejiang University    2Tencent Hunyuan
*Equal Contribution   †Project Lead   ‡Corresponding Author

arXiv HuggingFace CVPR 2026


📅 News


✅ TODO List

  • Release inference code and model weights of WorldStereo 2.0
  • Release data pre-processing pipelines for panoramic and multi-trajectory scenes

📖 Abstract

We propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules:

  • Global-Geometric Memory (GGM) enables precise camera control while injecting coarse structural priors through incrementally updated point clouds via a ControlNet branch.
  • Spatial-Stereo Memory (SSM) constrains the model's attention receptive fields with 3D correspondences to focus on fine-grained details from the memory bank.

Together, these components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, WorldStereo shows impressive efficiency by leveraging a distribution-matching distilled (DMD) VDM backbone without joint training.


🎬 Results

3D Reconstruction from a Single Image

Given a single reference image, WorldStereo generates multi-view consistent videos and reconstructs a dense 3D point cloud. Below are example results on two scenes.

Scene: Kitchen   |   Input image → Point cloud (5 views)

Kitchen input   Kitchen pcd 1 Kitchen pcd 2 Kitchen pcd 3 Kitchen pcd 4 Kitchen pcd 5

Camera Control

Methods Camera Metrics Visual Quality
RotErr ↓TransErr ↓ATE ↓ Q-Align ↑CLIP-IQA+ ↑Laion-Aes ↑CLIP-I ↑
SEVA1.6901.5782.8793.2320.4794.62377.16
Gen3C0.9441.5802.7893.3530.4894.86382.33
WorldStereo0.7621.2452.1414.1490.5475.25789.05
WorldStereo 2.00.4920.9681.7684.2050.5445.26689.43

Single-View-Generated Reconstruction

Methods Tanks-and-Temples MipNeRF360
Precision ↑ Recall ↑ F1-Score ↑ AUC ↑ Precision ↑ Recall ↑ F1-Score ↑ AUC ↑
SEVA 33.59 35.34 36.73 51.03 22.38 55.63 28.75 46.81
Gen3C 46.73 25.51 31.24 42.44 23.28 75.37 35.26 52.10
Lyra 50.38 28.67 32.54 43.05 30.02 58.60 36.05 49.89
FlashWorld 26.58 20.72 22.29 30.45 35.97 53.77 42.60 53.86
WorldStereo 2.0 43.62 41.02 41.43 58.19 43.19 65.32 51.27 65.79
WorldStereo 2.0 (DMD) 40.41 44.41 43.16 60.09 42.34 64.83 50.52 65.64

🆕 WorldStereo 2.0 vs. 1.0

WorldStereo 2.0 introduces four key improvements over the original version:

WorldStereo 1.0WorldStereo 2.0
Latent SpaceStandard video latent spaceKeyframe latent space — encodes each frame independently, substantially improving visual quality of generated novel views and completely supporting parallel encoding/decoding
Memory MechanismCross-attention to retrieved reference framesStereo stitching in the main branch — reference views are spatially concatenated with target frames along the width dimension in the main DiT branch, enabling stronger and more direct memory fusion
Backbone Fine-tuningFrozen backbonePartial backbone fine-tuning — backbone weights are selectively updated to adapt to the keyframe latent space and improve overall generation quality
Training DataLimited camera trajectoriesExpanded UE rendering data — significantly more Unreal Engine rendered scenes with diverse and precise camera motions, leading to stronger camera control and memory capabilities

More details of WorldStereo 2.0 are shown in HY-World 2.0.


⚙️ Installation

1. Clone the repository:

git clone https://github.com/FuchengSu/WorldStereo.git
cd WorldStereo

2. Install core dependencies:

conda create -n worldstereo python=3.11
conda activate worldstereo
pip install -r requirements.txt

3. Install PyTorch3D (required for point cloud rendering):

pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"

4. Install MoGe (monocular depth estimation):

pip install git+https://github.com/microsoft/MoGe.git@0286b495230a074aadf1c76cc5c679e943e5d1c6

5. (Optional) Install third-party reconstruction module for WorldMirror reconstruction:

mkdir third_party
cd third_party
git clone https://github.com/Tencent-Hunyuan/HY-World-2.0.git
pip install -r HY-World-2.0/requirements.txt

Note: third_party/HY-World-2.0 is required only for apply_worldmirror post-processing (multi-view depth consistency and Gaussian Splatting reconstruction). You can skip it for basic video generation.


🚀 Quick Start

Model Variants

WorldStereo ships three model variants, each suited to a different use case:

Model TypeEntry PointDescription
worldstereo-camerarun_camera_control.pyCamera control only; single-view input
worldstereo-memoryrun_camera_control.py / run_multi_traj.pyFull model with GGM + SSM; multi-view consistent generation; best quality
worldstereo-memory-dmdrun_camera_control.py / run_multi_traj.pyDMD distillation variant; 4-step inference, fastest

Models are automatically downloaded from HuggingFace Hub (hanshanxue/WorldStereo) on first run.


Single-View Camera Control

Generate a video from a single image along a specified camera trajectory:

python run_camera_control.py \
    --model_type worldstereo-camera \
    --input_path examples/images \
    --output_path outputs \
    --seed 1024

Multi-GPU Inference (Sequence Parallel)

Scale to multiple GPUs using Sequence Parallelism (SP) and FSDP:

torchrun --nproc_per_node=8 run_camera_control.py \
    --model_type worldstereo-memory \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

Multi-Trajectory Inference (Panorama / Reconstruction)

For panoramic scene generation or 3D reconstruction from multiple trajectories:

# Panoramic scene generation
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory \
    --task_type panorama \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

# Panoramic scene generation (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory-dmd \
    --task_type panorama \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

# 3D scene reconstruction
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory \
    --task_type reconstruction \
    --input_path examples/reconstruction \
    --output_path outputs \
    --fsdp

# 3D scene reconstruction (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory-dmd \
    --task_type reconstruction \
    --input_path examples/reconstruction \
    --output_path outputs \
    --fsdp

WorldMirror 3D Reconstruction (Optional)

After running run_multi_traj.py, the memory bank is automatically exported to a WorldMirror-compatible format under <output_path>/<scene>/world_mirror_data/<model_type>/. You can then run feedforward 3D reconstruction with HY-World 2.0:

# Requires: pip install -r third_party/HY-World-2.0/requirements.txt

cd third_party/HY-World-2.0
torchrun --nproc_per_node=8 -m hyworld2.worldrecon.pipeline --input_path ../../outputs/<scene>/world_mirror_data/<model_type>/images \
          --prior_cam_path ../../outputs/<scene>/world_mirror_data/<model_type>/cameras.json \
          --strict_output_path ../../outputs/<scene>/world_mirror_data/<model_type>/results \
          --target_size 832 --use_fsdp --enable_bf16 --no_save_normal --no_save_gs --no_sky_mask \
          --apply_edge_mask --apply_confidence_mask --confidence_percentile 15.0 --compress_pts --no_interactive \
          --disable_heads gs points

This produces metric-scale depth, surface normals, camera poses, a dense point cloud (.ply), and optionally Gaussian Splat renderings from the generated multi-view frames.


Python API

import torch
from models.worldstereo_wrapper import WorldStereo

device = torch.device("cuda:0")

worldstereo = WorldStereo.from_pretrained(
    "hanshanxue/WorldStereo",
    subfolder="worldstereo-memory",   # or "worldstereo-camera" / "worldstereo-memory-dmd"
    sp_world_size=1,
    fsdp=False,
    device=device,
)

output = worldstereo(**pipeline_inputs)

CLI Reference

run_camera_control.py

FlagDefaultDescription
--model_typeworldstereo-cameraModel variant to use
--input_pathexamples/imagesInput scene directory
--output_pathoutputsOutput directory
--local_files_onlyFalseUse locally cached weights instead of downloading
--fsdpFalseEnable FSDP model sharding
--seed1024Random seed

run_multi_traj.py (additional flags)

FlagDefaultDescription
--task_typepanoramapanorama or reconstruction
--align_nframe8Frames per clip saved for updating the memory bank

📂 Input Data Format

Camera-Only Inference (examples/images/)

<scene>/
├── image.png                 # reference image
├── prompt.json               # text descriptions at three verbosity levels
│   # {"short caption": ..., "medium caption": ..., "long caption": ...}
└── camera.json               # camera trajectory, extrinsic is [4x4] w2c in the opencv coordinate
    # {"motion_list": [...], "extrinsic": [...], "intrinsic": [...]}

Memory-Augmented Multi-Trajectory (examples/panorama/, examples/reconstruction/)

<scene>/
├── panorama.png              # (optional) full panorama — triggers VLM single-path inference
├── meta_info.json            # {"scene_type": "perspective" | "panorama"}
├── start_frame.png           # reference start image for depth initialization
└── render_results/
    └── <view_id>/
        └── <traj_id>/
            ├── render.mp4         # pre-rendered geometry video (point cloud warp)
            ├── render_mask.mp4    # binary occlusion mask video
            └── camera.json        # {"extrinsic": [...], "intrinsic": [...]}, extrinsic is [4x4] w2c in the opencv coordinate

🔧 Architecture

Model Variants

WorldStereo defines two transformer architectures in models/worldstereo.py, both extending WanTransformer3DModel from diffusers:

  • WorldStereoModel — Wan DiT backbone + ControlNet. Used by worldstereo-camera. The ControlNet encodes rendered point cloud geometry and camera embeddings, injecting residuals at each transformer block.
  • WorldStereoRefSModel — Extends WorldStereoModel with WanTransformerSparseSpatialBlock layers. These SSM blocks perform sparse attention over retrieved reference frames, guided by 3D correspondences. Used by worldstereo-memory and worldstereo-memory-dmd.

Inference Pipelines

Three pipelines are provided under models/pipelines/, selected automatically based on model_type in the config:

PipelineClassMode
pipeline_pcd_keyframe.pyKFPCDControllerPipelineCamera; standard DDIM sampling
pipeline_ref_keyframe.pyKFPCDControllerRefPipelineCamera + GGM + SSM; standard DDIM sampling
pipeline_dmd_keyframe.pyRefKFDMDGeneratorPipelineCamera + GGM + SSM; 4-step DMD distillation

3D Memory Bank

The memory bank (src/retrieval_wm.py) manages the growing 3D representation across trajectories:

  1. Init — MoGe depth estimation on the start frame lifts it to a point cloud.
  2. Retrieve — For each new target trajectory, the most relevant reference frames are selected via FOV-overlap scoring combined with DINOv2 image features and quality-aware furthest-point sampling.
  3. Update — After generation, new frames and their estimated depths are appended to the bank.
  4. Reconstruction — Feedforward reconstruction via HY-World 2.0 WorldMirror enforces multi-view depth consistency; final global alignment produces a unified point cloud.

Distributed Inference

WorldStereo supports two parallelism strategies:

  • Sequence Parallel (SP) — The sequence dimension is sharded across the SP group at each attention layer (models/attention.py). Controlled by torchrun --nproc_per_node.
  • FSDP — Full-Sharded Data Parallel wraps both the transformer and auxiliary encoders. Enabled with --fsdp. Requires a device_mesh with ("rep", "shard") dimensions.

🤝 Acknowledgements

WorldStereo builds upon the following excellent works:

  • Wan — Video DiT backbone
  • HunyuanVideo-1.5 — Components of sequence parallel and video generation model
  • MoGe — Monocular geometry estimation
  • HY-World 2.0 — WorldMirror reconstruction module
  • diffusers — Pipeline and model utilities

📝 Citation

If you find WorldStereo useful in your research, please cite:

@article{zhang2026worldstereo,
  title={WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories},
  author={Zhang, Yisu and Cao, Chenjie and Wang, Tengfei and Zuo, Xuhui and Wu, Junta and Zhu, Jianke and Guo, Chunchao},
  journal={arXiv preprint arXiv:2603.02049},
  year={2026}
}