README.md

May 18, 2026 · View on GitHub

BOOTPLACE: Bootstrapped Object Placement
with Detection Transformers

BOOTPLACE is a paradigm that formulates object placement as a placement-by-detection problem. It begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, it enforces meaningful placements through extensive paired data augmentation.

Check out our Project Page for more visual demos!

⏩ Updates

03/20/2025

Release training code and pretrained models.

06/24/2025

Release inference code and data.

📦 Installation

Prerequisites

System: The code is currently tested only on Linux.
Hardware: An NVIDIA GPU with at least 16GB of memory is necessary. The code has been verified on NVIDIA A6000 GPUs.
Software:
- Conda is recommended for managing dependencies.
- Python version 3.6 or higher is required.
Create a new conda environment named BOOTPLACE and install the dependencies:
```
conda env create --file=BOOTPLACE.yml
```
Download DETR-R50 pretrained models for finetuning here and put it in the directory weights/detr-r50-e632da11.pth.

🤖 Pretrained Models

We provide the following pretrained models:

Model	Description	#Params	Download
BOOTPLACE_Cityscapes	Multiple supervision	523M	Download

📚 Dataset

We provide a large-scale street-scene vehicle placement dataset Download curated from Cityscapes. The file structures are:

├── train
    ├── backgrounds:
        ├── imgID.png
        ├── ……
    ├── objects:
        ├── imgID:
            ├── object_name_ID.png
            ├── ……
        ├── ……
    ├── location:
        ├── imgID:
            ├── object_name_ID.txt
            ├── ……
        ├── ……
    ├── annotations.json
├── test
    ├── backgrounds:
        ├── imgID.png
        ├── ……
    |── backgrounds_single
        ├── imgID.png
        ├── ……
    ├── objects:
        ├── imgID:
            ├── object_name_ID.png
            ├── ……
        ├── ……
    ├── objects_single:
        ├── imgID:
            ├── object_name_ID.png
            ├── ……
        ├── ……
    ├── location:
        ├── imgID:
            ├── object_name_ID.txt
            ├── ……
        ├── ……
    ├── location_single:
        ├── imgID:
            ├── object_name_ID.txt
            ├── ……
        ├── ……
    ├── annotations.json

Training

To train a model on Cityscapes:

python -m main \
    --epochs 200 \
    --batch_size 2 \
    --save_freq 10 \
    --set_cost_class 1 \
    --ce_loss_coef 1 \
    --num_queries 120 \
    --eos_coef 0.1 \
    --lr 1e-4 \
    --data_path data/Cityscapes \
    --output_dir results/Cityscapes_ckpt \
    --resume weights/detr-r50-e632da11.pth

Inference

python test.py \
    --num_queries 120 \
    --data_path data/Cityscapes \
    --pretrained_model 'results/Cityscapes_ckpt/checkpoint.pth' \
    --im_root 'data/Cityscapes/test' \
    --output_dir 'results/Cityscape_inference'

⚖️ License

This project is licensed under the terms of the MIT license.

📜 Citation

If you find this work helpful, please consider citing our paper:

@inproceedings{zhou2025bootplace,
  title={BOOTPLACE: Bootstrapped Object Placement with Detection Transformers},
  author={Zhou, Hang and Zuo, Xinxin and Ma, Rui and Cheng, Li},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={19294--19303},
  year={2025}
}