README.md

April 5, 2025 · View on GitHub

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma*, Huachen Gao*, Haoge Deng*, Zhengxiong Luo, Tiejun Huang, Lulu Tang†, Xinlong Wang†

Beijing Academy of Artificial Intelligence, BAAI
^* Equal Contribution, ^† Corresponding Author

Paper | Project Page | Models(Google Drive) |🤗HF Models| Dataset

We present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data --- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Additionally, our model naturally supports other image-conditioned 3D creation tasks, such as 3D editing, without further fine-tuning.

Benefiting from the proposed web-scale dataset WebVi3D, See3D enables both object- and scene-level 3D creation, including sparse-view-to-3D, (text-) image-to-3D, and 3D editing. It can also be used for Gaussian Splatting to extract meshes or render images.

Highlights

We present See3D, a scalable visual-conditional MVD model for open-world 3D creation, which can be trained on web-scale video collections without pose annotations.
We curate WebVi3D, a multi-view images dataset containing static scenes with sufficient multi-view observations, and establish an automated pipeline for video data curation to train the MVD model.
We introduce a novel warping-based 3D generation framework with See3D, which supports long-sequence generation with complex camera trajectories.
We achieve state-of-the-art results in single and sparse views reconstruction, demonstrating remarkable zero-shot and open-world generation capability, offering a novel perspective on scalable 3D generation.

News

[04/05/2025] :fire: See3D is selected as Highlight presentation!

[03/11/2025] :rocket: We have released the WebVi3D curation pipeline. Please refer to this guide.

[02/27/2025] :fire: See3D is accepted by CVPR 2025.

[12/13/2024] :rocket: We have released the pretrained models and example test data in Huggingface🤗.

[12/10/2024] :rocket: We have released the pretrained models and inference code. You can download models and example test data here

Installation

git clone https://github.com/baaivision/See3D.git
cd See3D

pip install -r requirements.txt

Inference Code

We provide inference code for multi-view generation based on single-view and sparse-view inputs. Please add or remove the --super_resolution parameter according to your needs. The multi-view super-resolution model will upscale the default 512 resolution to a consistent 1024 resolution across multiple views, which requires more inference time and GPU memory. Please download the example test data here and put it in the dataset folder.

Generation Based on Single View Input

bash single_infer.sh

Generation Based on Sparse Views Input

bash sparse_infer.sh

Data Curation Pipeline of WebVi3D

We provide the demo code of data curation pipeline given your own video dataset. Please refer to this dataset README for more details.

TODO List

Release pretrained models.
Release inference code.
Release data curation pipeline from Internet Video.
Release training scripts.
Release 3D generation framework utilizing the warping-based pipeline.
Release the evaluation code.

Acknowledgement

See3D is built using the awesome open-source projects: Stable Diffusion, MVDream, ViewCrafter, FrozenRecon.

Thanks to the maintainers of these projects for their contribution to the community!

Citation

If you find See3D helpful, please consider citing:

@inproceedings{Ma2025See3D,
    title = {You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale},
    author = {Baorui Ma and Huachen Gao and Haoge Deng and Zhengxiong Luo and Tiejun Huang and Lulu Tang and Xinlong Wang},
    booktitle={IEEE/CVF conference on computer vision and pattern recognition},
    year={2025}
}