F-ViT: Build Open-Vocabulary Object Detectors Upon Frozen CLIP ViTs

October 3, 2023 · View on GitHub

Requirements

The detection framework is built upon MMDetection2.x. To install MMDetection2.x, run

cd ~/your/project/directory
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.7.0
MMCV_WITH_OPS=1 pip install -e . -v
cd ..
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout v2.28.1
pip install -e . -v

For other installation methods, please refer to the official website of MMCV and MMDetection.

Data Preparation

The main experiments are conducted on COCO and LVIS datasets. We also perform transfer evaluation on Objects365v1. Please prepare datasets and organize them like the following:

CLIPSelf/F-ViT
├── data         # use soft link to save storage on the disk
    ├── coco
        ├── annotations
            ├── instances_val2017.json       # for transfer evaluation
        ├── train2017
        ├── val2017
        ├── zero-shot         # obtain the files from the drive 
            ├── instances_val2017_all_2.json
            ├── instances_train2017_seen_2_65_cat.json
    ├── lvis_v1
        ├── annotations
            ├── lvis_v1_train_seen_1203_cat.json  # obtain the files from the drive 
            ├── lvis_v1_val.json 
        ├── train2017    # the same with coco
        ├── val2017      # the same with coco
    ├── Objects365v1
        ├── objects365_reorder_val.json         # obtain the files from the drive 
        ├── val

For open-vocabulary detection, we provide some preprocessed json files in Drive. Put instances_val2017_all_2.json and instances_train2017_seen_2_65_cat.json under data/coco/zero-shot/, lvis_v1_train_seen_1203_cat.json under data/lvis_v1/annotations/, and objects365_reorder_val.json under data/Objects365v1/.

CLIPSelf Checkpoints

Obtain the checkpoints from Drive. And they can be organized as follows:

CLIPSelf/FViT/  
├── checkpoints  # use soft link to save storage on the disk
    ├── eva_vitb16_coco_clipself_patches.pt     # 1
    ├── eva_vitb16_coco_clipself_proposals.pt   # 2
    ├── eva_vitb16_coco_regionclip.pt           # 3
    ├── eva_vitl14_coco_clipself_patches.pt     # 4
    ├── eva_vitl14_coco_clipself_proposals.pt   # 5
    ├── eva_vitl14_coco_regionclip.pt           # 6
    ├── eva_vitb16_lvis_clipself_patches.pt     # 7
    ├── eva_vitl14_lvis_clipself_patches.pt     # 8

Detectors

The detectors on OV-COCO are summarized as follows:

#	Backbone	CLIP Refinement	Proposals	AP50novel	Config	Checkpoint
1	ViT-B/16	CLIPSelf	-	33.6	cfg	model
2	ViT-B/16	CLIPSelf	+	37.6	cfg	model
3	ViT-B/16	RegionCLIP	+	34.4	cfg	model
4	ViT-L/14	CLIPSelf	-	38.4	cfg	model
5	ViT-L/14	CLIPSelf	+	44.3	cfg	model
6	ViT-L/14	RegionCLIP	+	38.7	cfg	model

The detectors on OV-LVIS are summarized as follows:

#	Backbone	CLIP Refinement	Proposals	mAPr	Config	Checkpoint
7	ViT-B/16	CLIPSelf	-	25.3	cfg	model
8	ViT-L/14	CLIPSelf	-	34.9	cfg	model

Test

We provide the checkpoints of the object detectors in Drive. And they can be organized as follows:

CLIPSelf/FViT/  
├── checkpoints  # use soft link to save storage on the disk
    ├── fvit_eva_vitb16_ovcoco_clipself_patches.pth     # 1
    ├── fvit_eva_vitb16_ovcoco_clipself_proposals.pth   # 2
    ├── fvit_eva_vitb16_ovcoco_regionclip.pth           # 3
    ├── fvit_eva_vitb16_ovlvis_clipself_patches.pth     # 4
    ├── fvit_eva_vitl14_ovcoco_clipself_patches.pth     # 5
    ├── fvit_eva_vitl14_ovcoco_clipself_proposals.pth   # 6
    ├── fvit_eva_vitl14_ovcoco_regionclip.pth           # 7
    ├── fvit_eva_vitl14_ovlvis_clipself_patches.pth     # 8

An example of evaluation on OV-COCO

bash dist_test.sh configs/ov_coco/fvit_vitb16_upsample_fpn_bs64_3e_ovcoco_eva_clipself_proposals.py \
     checkpoints/fvit_eva_vitb16_ovcoco_clipself_proposals.pth  8  \
     --work-dir your/working/directory --eval bbox

An example of evaluation on OV-LVIS

bash dist_test.sh configs/ov_lvis/fvit_vitl14_upsample_fpn_bs64_4x_ovlvis_eva_clipself_patches.py \
     checkpoints/fvit_eva_vitl14_ovlvis_clipself_patches.pth   8  \
     --work-dir your/working/directory --eval segm

Transfer

Transfer evaluation on COCO:

bash dist_test.sh configs/transfer/fvit_vitl14_upsample_fpn_transfer2coco.py \
     checkpoints/fvit_eva_vitl14_ovlvis_clipself_patches.pth  8  \
     --work-dir your/working/directory --eval bbox

Transfer evaluation on Objects365v1:

bash dist_test.sh configs/transfer/fvit_vitl14_upsample_fpn_transfer2objects365v1.py \
     checkpoints/fvit_eva_vitl14_ovlvis_clipself_patches.pth   8  \
     --work-dir your/working/directory --eval bbox

Train

Prepare the CLIPSelf/RegionCLIP checkpoints as shown in the previous section. An example of training on OV-COCO:

bash dist_train.sh  configs/ov_coco/fvit_vitb16_upsample_fpn_bs64_3e_ovcoco_eva_clipself_proposals.py \
                   8 --work-dir your/working/directory

An example of training on OV-LVIS:

bash dist_train.sh configs/ov_lvis/fvit_vitl14_upsample_fpn_bs64_4x_ovlvis_eva_clipself_patches.py \
                  8 --work-dir your/working/directory

To use multiple machines (e.g., 2x8=16 GPUs) to expedite the training on OV-LVIS, refer to the tutorial of MMDetection. We have set auto_scale_lr = dict(enable=True, base_batch_size=64) in the config files, so the learning rate will be modified automatically.