README.md

July 2, 2026 · View on GitHub

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

🔥 ICML 2026

Jingbo Gong^1,3 Yikai Wang^2✉ Yushi Lan² Yuhao Wan¹ Ziheng Ouyang¹
Rui Zhao⁴ Ming-Ming Cheng^1,5✉ Qibin Hou¹ Chen Change Loy²

¹VCIP, NKU ²S-Lab, NTU ³ZGCA ⁴SenseTime Research ⁵NKIARI, Shenzhen Futian

DIRECT enables pose-controllable object insertion with explicit geometric guidance from a reconstructed 3D proxy.

DIRECT teaser

For more visual results, please check out our project page.

📬 News

[2026.07] Release training dataset, training code, and preprocessing code.
[2026.06] Release inference code, interactive demo, and model weights.
[2026.05] DIRECT was accepted by ICML 2026! The repository and project page are now available.

📅 TODO

Release inference code and interactive demo.
Release dataset.
Release training and preprocessing code.

🔧 Installation

The environment is tested with Python 3.10.18, PyTorch 2.4.0, and CUDA 11.8.

git clone https://github.com/Gong1130/DIRECT.git
cd DIRECT

conda create -n direct python=3.10.18 -y
conda activate direct

Install PyTorch for CUDA 11.8:

pip install torch==2.4.0+cu118 torchvision==0.19.0+cu118 --index-url https://download.pytorch.org/whl/cu118

Install the remaining dependencies:

pip install --no-build-isolation -r requirements.txt
pip install -e .

Some dependencies are compiled CUDA extensions. If the build cannot find CUDA, set CUDA_HOME to your local CUDA 11.8 toolkit path before installing the requirements.

🪄 Interactive Demo

Run the demo with:

python demo/demo.py --gradio_port 7860 --viser_port 8081

On the first run, the demo will automatically download DIRECT, FLUX.1-Fill-dev, TRELLIS-image-large, SigLIP2, and RMBG-2.0 from Hugging Face. FLUX.1-Fill-dev and RMBG-2.0 are gated models, so please accept their licenses and authenticate with huggingface-cli login or by setting your HF_TOKEN before running the demo.

Open the Gradio interface at http://localhost:7860. The Viser 3D viewer runs on http://localhost:8081 and is embedded inside the Gradio page. After launching the demo, an interactive interface will appear as follows.

DIRECT interactive demo

If you run the demo on a remote server, forward both ports:

ssh -L 7860:localhost:7860 -L 8081:localhost:8081 <user>@<server>

After port forwarding, open http://localhost:7860 in your local browser to use the full demo.

📦 Dataset Download

DIRECT training uses the released DIRECT dataset and the mask templates from MISATO for Shape-Decomposed Mask Augmentation, described in Section 3.4 of our paper.

Download and extract the DIRECT dataset:

https://huggingface.co/datasets/superGong/DIRECT-dataset

cd <path-to-DIRECT-dataset>

for t in MVImgNet/*.tar; do
  tar -xf "$t" -C MVImgNet
  rm "$t"
done

for t in SA1B/*.tar; do
  tar -xf "$t" -C SA1B
  rm "$t"
done

Download MISATO and keep only the object mask templates used by DIRECT:

https://huggingface.co/datasets/yikaiwang/MISATO

cd <path-to-MISATO>

unzip asuka_training_mask.zip \
  'asuka_training_mask/object_masks/*' \
  -x 'asuka_training_mask/object_masks/humanparsing_masks/*'

find . -mindepth 1 -maxdepth 1 ! -name 'asuka_training_mask' -exec rm -rf {} +

After downloading the datasets, update dataset_root and mask_template_path in dataset_config/direct_stage1_512.yaml and dataset_config/direct_stage2_1024.yaml to match your local paths.

🏋️ Training

We train DIRECT with Accelerate. Training is divided into two stages.

Stage 1 trains at 512 resolution:

bash training/train_direct_stage1.sh

Stage 2 trains at 1024 resolution and initializes from the Stage 1 checkpoint:

bash training/train_direct_stage2.sh

In our experiments, we train Stage 1 with 4 GPUs and Stage 2 with 8 GPUs.

🧩 Preprocess

We provide example preprocessing code for Geometric Alignment, described in Section 3.4 of our paper.

Given an object image, Geometric Alignment estimates its 6D pose in the TRELLIS-generated 3D object. This pipeline can be used as a reference for preparing DIRECT training data on other datasets.

Please see preprocess for details.

📝 BibTeX

If you find DIRECT useful for your research, please consider citing our paper:

@inproceedings{gong2026direct,
  title     = {Direct 3D-Aware Object Insertion via Decomposed Visual Proxies},
  author    = {Jingbo Gong and Yikai Wang and Yushi Lan and Yuhao Wan and Ziheng Ouyang and Rui Zhao and Ming-Ming Cheng and Qibin Hou and Chen Change Loy},
  booktitle = {ICML},
  year      = {2026}
}

👏 Acknowledgements

This codebase builds on TRELLIS, FLUX, EasyControl, and the Hugging Face Diffusers ecosystem.

✉️ Contact

If you have any questions, please feel free to contact us at jingbogong@mail.nankai.edu.cn. We are also actively improving DIRECT, and we welcome any failure cases or feedback encountered during use!