README.md
April 5, 2025 · View on GitHub
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma*, Huachen Gao*, Haoge Deng*, Zhengxiong Luo, Tiejun Huang, Lulu Tang†, Xinlong Wang†
Beijing Academy of Artificial Intelligence, BAAI
* Equal Contribution, † Corresponding Author
Paper | Project Page | Models(Google Drive) |🤗HF Models| Dataset
Benefiting from the proposed web-scale dataset WebVi3D, See3D enables both object- and scene-level 3D creation, including sparse-view-to-3D, (text-) image-to-3D, and 3D editing. It can also be used for Gaussian Splatting to extract meshes or render images.
Highlights
- We present See3D, a scalable visual-conditional MVD model for open-world 3D creation, which can be trained on web-scale video collections without pose annotations.
- We curate WebVi3D, a multi-view images dataset containing static scenes with sufficient multi-view observations, and establish an automated pipeline for video data curation to train the MVD model.
- We introduce a novel warping-based 3D generation framework with See3D, which supports long-sequence generation with complex camera trajectories.
- We achieve state-of-the-art results in single and sparse views reconstruction, demonstrating remarkable zero-shot and open-world generation capability, offering a novel perspective on scalable 3D generation.
News
[04/05/2025] :fire: See3D is selected as Highlight presentation!
[03/11/2025] :rocket: We have released the WebVi3D curation pipeline. Please refer to this guide.
[02/27/2025] :fire: See3D is accepted by CVPR 2025.
[12/13/2024] :rocket: We have released the pretrained models and example test data in Huggingface🤗.
[12/10/2024] :rocket: We have released the pretrained models and inference code. You can download models and example test data here
Installation
git clone https://github.com/baaivision/See3D.git
cd See3D
pip install -r requirements.txt
Inference Code
We provide inference code for multi-view generation based on single-view and sparse-view inputs. Please add or remove the --super_resolution parameter according to your needs. The multi-view super-resolution model will upscale the default 512 resolution to a consistent 1024 resolution across multiple views, which requires more inference time and GPU memory. Please download the example test data here and put it in the dataset folder.
Generation Based on Single View Input
bash single_infer.sh
Generation Based on Sparse Views Input
bash sparse_infer.sh
Data Curation Pipeline of WebVi3D
We provide the demo code of data curation pipeline given your own video dataset. Please refer to this dataset README for more details.
TODO List
- Release pretrained models.
- Release inference code.
- Release data curation pipeline from Internet Video.
- Release training scripts.
- Release 3D generation framework utilizing the warping-based pipeline.
- Release the evaluation code.
Acknowledgement
See3D is built using the awesome open-source projects: Stable Diffusion, MVDream, ViewCrafter, FrozenRecon.
Thanks to the maintainers of these projects for their contribution to the community!
Citation
If you find See3D helpful, please consider citing:
@inproceedings{Ma2025See3D,
title = {You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale},
author = {Baorui Ma and Huachen Gao and Haoge Deng and Zhengxiong Luo and Tiejun Huang and Lulu Tang and Xinlong Wang},
booktitle={IEEE/CVF conference on computer vision and pattern recognition},
year={2025}
}