README.md
July 2, 2026 Β· View on GitHub
Direct 3D-Aware Object Insertion via Decomposed Visual Proxies
π₯ ICML 2026
Rui Zhao4β Ming-Ming Cheng1,5ββ Qibin Hou1β Chen Change Loy2
DIRECT enables pose-controllable object insertion with explicit geometric guidance from a reconstructed 3D proxy.
For more visual results, please check out our project page.
π¬ News
- [2026.07] Release training dataset, training code, and preprocessing code.
- [2026.06] Release inference code, interactive demo, and model weights.
- [2026.05] DIRECT was accepted by ICML 2026! The repository and project page are now available.
π TODO
- Release inference code and interactive demo.
- Release dataset.
- Release training and preprocessing code.
π Overview
π§ Installation
The environment is tested with Python 3.10.18, PyTorch 2.4.0, and CUDA 11.8.
git clone https://github.com/Gong1130/DIRECT.git
cd DIRECT
conda create -n direct python=3.10.18 -y
conda activate direct
Install PyTorch for CUDA 11.8:
pip install torch==2.4.0+cu118 torchvision==0.19.0+cu118 --index-url https://download.pytorch.org/whl/cu118
Install the remaining dependencies:
pip install --no-build-isolation -r requirements.txt
pip install -e .
Some dependencies are compiled CUDA extensions. If the build cannot find CUDA, set CUDA_HOME to your local CUDA 11.8 toolkit path before installing the requirements.
πͺ Interactive Demo
Run the demo with:
python demo/demo.py --gradio_port 7860 --viser_port 8081
On the first run, the demo will automatically download DIRECT, FLUX.1-Fill-dev, TRELLIS-image-large, SigLIP2, and RMBG-2.0 from Hugging Face. FLUX.1-Fill-dev and RMBG-2.0 are gated models, so please accept their licenses and authenticate with
huggingface-cli loginor by setting yourHF_TOKENbefore running the demo.
Open the Gradio interface at http://localhost:7860. The Viser 3D viewer runs on http://localhost:8081 and is embedded inside the Gradio page.
After launching the demo, an interactive interface will appear as follows.
If you run the demo on a remote server, forward both ports:
ssh -L 7860:localhost:7860 -L 8081:localhost:8081 <user>@<server>
After port forwarding, open http://localhost:7860 in your local browser to use the full demo.
π¦ Dataset Download
DIRECT training uses the released DIRECT dataset and the mask templates from MISATO for Shape-Decomposed Mask Augmentation, described in Section 3.4 of our paper.
Download and extract the DIRECT dataset:
cd <path-to-DIRECT-dataset>
for t in MVImgNet/*.tar; do
tar -xf "$t" -C MVImgNet
rm "$t"
done
for t in SA1B/*.tar; do
tar -xf "$t" -C SA1B
rm "$t"
done
Download MISATO and keep only the object mask templates used by DIRECT:
cd <path-to-MISATO>
unzip asuka_training_mask.zip \
'asuka_training_mask/object_masks/*' \
-x 'asuka_training_mask/object_masks/humanparsing_masks/*'
find . -mindepth 1 -maxdepth 1 ! -name 'asuka_training_mask' -exec rm -rf {} +
After downloading the datasets, update dataset_root and mask_template_path in dataset_config/direct_stage1_512.yaml and dataset_config/direct_stage2_1024.yaml to match your local paths.
ποΈ Training
We train DIRECT with Accelerate. Training is divided into two stages.
Stage 1 trains at 512 resolution:
bash training/train_direct_stage1.sh
Stage 2 trains at 1024 resolution and initializes from the Stage 1 checkpoint:
bash training/train_direct_stage2.sh
In our experiments, we train Stage 1 with 4 GPUs and Stage 2 with 8 GPUs.
π§© Preprocess
We provide example preprocessing code for Geometric Alignment, described in Section 3.4 of our paper.
Given an object image, Geometric Alignment estimates its 6D pose in the TRELLIS-generated 3D object. This pipeline can be used as a reference for preparing DIRECT training data on other datasets.
Please see preprocess for details.
π BibTeX
If you find DIRECT useful for your research, please consider citing our paper:
@inproceedings{gong2026direct,
title = {Direct 3D-Aware Object Insertion via Decomposed Visual Proxies},
author = {Jingbo Gong and Yikai Wang and Yushi Lan and Yuhao Wan and Ziheng Ouyang and Rui Zhao and Ming-Ming Cheng and Qibin Hou and Chen Change Loy},
booktitle = {ICML},
year = {2026}
}
π Acknowledgements
This codebase builds on TRELLIS, FLUX, EasyControl, and the Hugging Face Diffusers ecosystem.
βοΈ Contact
If you have any questions, please feel free to contact us at jingbogong@mail.nankai.edu.cn. We are also actively improving DIRECT, and we welcome any failure cases or feedback encountered during use!