Application to CLIP

December 23, 2023 · View on GitHub

Installation

The code for applying CLIM to CLIP model is adapted from OpenCLIP-v2.16.0. Run the following command to install the package

cd CLIM/
pip install -e . -v

Data Preparation

The main experiments are conducted using images from COCO and CC3M Please prepare datasets and organize them like the following:

CLIM/
├── data
    ├── coco
        ├── annotations
            ├── panoptic_val2017.json
            ├── panoptic_val2017     # panoptic masks
        ├── wusize
            ├── captions_train2017_tags_allcaps.json
        ├── train2017
        ├── val2017
    ├── cc3m
        ├── cc3m_captions_train.json
        ├── train

The json file captions_train2017_tags_allcaps.json for coco captions can be obtained from GoogleDrive. For CC3M dataset, please download the image using the csv file from the official website, and then generate the json file following the COCO format. The json file cc3m_captions_train.json might look like:

{'images': 
  [
    {'id': 1, 'file_name': 'train/0/0.jpg', 'captions': ['a very typical bus station']},
    {'id': 4, 'file_name': 'train/3/3.jpg', 'captions': ['interior design of modern living room with fireplace in a new house']},
  ]
}

Run

Original Models

To run CLIM, first obtain the original models using these links, and put them under checkpoints/ like the following:

CLIM/
├── checkpoints
    ├── ViT-B-16.pt
    ├── RN50x64.pt

Applying CLIM

We provide the scripts to run CLIM. For example, if we want to refine ViT-B/16 on the COCO dataset, simply run:

bash scripts/train_clim_coco_100e_openai_vitb16.sh

We also provide the checkpoints of the models trained by CLIM in Google Drive.

Open-Vocabulary Object Detection

To build open-vocabulary detectors using the models trained by CLIM, please refer to the instructions in this README.