NADA

December 10, 2024 ยท View on GitHub

Official code for No Annotations for Object Detection in Art through Stable Diffusion (WACV 2025)

[๐Ÿ“– Paper] [๐Ÿ–ฅ๏ธ Project Page]

Setup

This repository is composed of three folders corresponding to different parts of training or evaluting NADA. The code is organized this way to prevent conflicting dependicies.

  • prompt-to-prompt

    This folder contains code for the class proposers not based on LLaVA and the class-conditioned detector. This uses code from Google's prompt-to-prompt repository and DAAM.

    Create a Python virtual environment and pip install the corresponding requirements file to set up the folder.

    • For the class-conditioned detector and weakly-supervised class proposer

      cd prompt-to-prompt
      python -m venv env
      source env/bin/activate
      pip install -r requirements.txt
      
    • For the non-LLaVA zero-shot class proposers

      cd prompt-to-prompt
      python -m venv cp_env
      source cp_env/bin/activate
      pip install -r cp_requirements.txt
      
  • detectron2

    Code for evaluating predictions made by NADA. Bounding boxes are saved in the COCO format, so we use Meta's Detectron2 library to evaluate them.

    Create a virtual environment and pip install from requirements.txt to set it up.

    cd detectron2
    python -m venv env
    source env/bin/activate
    pip install -r requirements.txt
    
  • LLaVA

    Code for generating outputs with LLaVA. We use LLaVA for our zero-shot class proposer and for caption prompt construction. This uses code from the official LLaVA repository.

    Create a Python environment and install from folder to set it up.

    cd LLaVA
    python -m venv env
    source env/bin/activate
    pip install -e .
    

Preparing data

Download ArtDL and IconArt and place the ArtDL and IconArt_v1 folders in a data folder at the root of the repository.

Using NADA

Using the class proposer

Weakly-supervised class proposer

Run prompt-to-prompt/classify/fc.py to train and perform inference (to create labels for use with the class-conditioned detector) with the weakly-supervised class proposer.

cd prompt-to-prompt
python classify/fc.py \
--dataset {artdl, iconart} \
--classification-type {single, multi} \
--data-type images \
--modes {train, eval, label} \
--num-layers {2, 3} \
--checkpoint checkpoints/{artdl, iconart}/checkpoint.ckpt \
--save-dir labels/{ex. artdl_wscp}

Specify --eval-label-split {} when eval or label (inference) is includes in --modes. Refer to prompt-to-prompt/data/classify_with_labels.py for the splits per dataset. Items in {} are options/examples.

Zero-shot class proposer

Run LLaVA/classify.py to train the zero-shot class proposer.

cd LLaVA
python classify.py \
--dataset {artdl, iconart} \
--prompt {who, score}
--dataset-split {}
--save-dir ../prompt-to-prompt/labels/{ex. artdl_zscp}

Use --prompt who (the choice prompt in the paper) for artdl and --prompt score (the score prompt in the paper) for iconart.

Using the class-conditioned detector

The class-conditioned detector uses the labels inferred by the class proposer to perform detection requires no training. The detector relies on a text prompt, and we support two kinds of prompt construction.

Template prompt construction

Template prompt construction inserts the labels into templates ร  la CLIP. Run prompt-to-prompt/generate.py:

cd prompt-to-prompt
python generate.py \
--dataset {artdl, iconart} \
--dataset-split {} \
--prompt-type {} \
--save-dir annotations/{ex. artdl_wscp} \
--label-dir labels/{ex. artdl_wscp}

In the paper, we use --prompt-type wikipedia for artdl and --prompt-type custom_1 for iconart.

Caption prompt construction

Caption prompt construction uses a caption containing the label as a prompt. First, create captions using LLaVA/caption.py:

cd LLaVA
python caption.py \
--dataset {artdl, iconart \
--dataset-split {} \
--prompt-type \
--label-dir {ex. ../prompt-to-prompt/labels/artdl_wscp} \
--save-dir {ex. ../prompt-prompt/captions/artdl_wscp}

Then run LLaVA/check_captions.py to check if the captions contain the labels at indices within the maximum input length of the diffusion model, and modify them if necessary.

--dataset {artdl, iconart \
--dataset-split {} \
--prompt-type \
--save-dir {ex. ../prompt-prompt/captions/artdl_wscp}

Once the captions are ready, use prompt-to-prompt/generate.py like in template prompt construction, but instead of --label-dir, use --caption-dir.

Evaluation

Use the nada_eval.ipynb notebook in LLaVA.

Citation

@InProceedings{Ramos_2025_WACV,
    author    = {Ramos, Patrick and Gonthier, Nicolas and Khan, Selina and Nakashima, Yuta and Garcia, Noa},
    title     = {No Annotations for Object Detection in Art through Stable Diffusion},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {February},
    year      = {2025}
}