F-ViT: Build Open-Vocabulary Object Detectors Upon Frozen CLIP ViTs
October 3, 2023 · View on GitHub
Requirements
The detection framework is built upon MMDetection2.x. To install MMDetection2.x, run
cd ~/your/project/directory
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.7.0
MMCV_WITH_OPS=1 pip install -e . -v
cd ..
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout v2.28.1
pip install -e . -v
For other installation methods, please refer to the official website of MMCV and MMDetection.
Data Preparation
The main experiments are conducted on COCO and LVIS datasets. We also perform transfer evaluation on Objects365v1. Please prepare datasets and organize them like the following:
CLIPSelf/F-ViT
├── data # use soft link to save storage on the disk
├── coco
├── annotations
├── instances_val2017.json # for transfer evaluation
├── train2017
├── val2017
├── zero-shot # obtain the files from the drive
├── instances_val2017_all_2.json
├── instances_train2017_seen_2_65_cat.json
├── lvis_v1
├── annotations
├── lvis_v1_train_seen_1203_cat.json # obtain the files from the drive
├── lvis_v1_val.json
├── train2017 # the same with coco
├── val2017 # the same with coco
├── Objects365v1
├── objects365_reorder_val.json # obtain the files from the drive
├── val
For open-vocabulary detection, we provide some preprocessed json files in
Drive.
Put instances_val2017_all_2.json and instances_train2017_seen_2_65_cat.json under data/coco/zero-shot/,
lvis_v1_train_seen_1203_cat.json under data/lvis_v1/annotations/, and objects365_reorder_val.json under
data/Objects365v1/.
CLIPSelf Checkpoints
Obtain the checkpoints from Drive. And they can be organized as follows:
CLIPSelf/FViT/
├── checkpoints # use soft link to save storage on the disk
├── eva_vitb16_coco_clipself_patches.pt # 1
├── eva_vitb16_coco_clipself_proposals.pt # 2
├── eva_vitb16_coco_regionclip.pt # 3
├── eva_vitl14_coco_clipself_patches.pt # 4
├── eva_vitl14_coco_clipself_proposals.pt # 5
├── eva_vitl14_coco_regionclip.pt # 6
├── eva_vitb16_lvis_clipself_patches.pt # 7
├── eva_vitl14_lvis_clipself_patches.pt # 8
Detectors
The detectors on OV-COCO are summarized as follows:
| # | Backbone | CLIP Refinement | Proposals | AP50novel | Config | Checkpoint |
|---|---|---|---|---|---|---|
| 1 | ViT-B/16 | CLIPSelf | - | 33.6 | cfg | model |
| 2 | ViT-B/16 | CLIPSelf | + | 37.6 | cfg | model |
| 3 | ViT-B/16 | RegionCLIP | + | 34.4 | cfg | model |
| 4 | ViT-L/14 | CLIPSelf | - | 38.4 | cfg | model |
| 5 | ViT-L/14 | CLIPSelf | + | 44.3 | cfg | model |
| 6 | ViT-L/14 | RegionCLIP | + | 38.7 | cfg | model |
The detectors on OV-LVIS are summarized as follows:
| # | Backbone | CLIP Refinement | Proposals | mAPr | Config | Checkpoint |
|---|---|---|---|---|---|---|
| 7 | ViT-B/16 | CLIPSelf | - | 25.3 | cfg | model |
| 8 | ViT-L/14 | CLIPSelf | - | 34.9 | cfg | model |
Test
We provide the checkpoints of the object detectors in Drive. And they can be organized as follows:
CLIPSelf/FViT/
├── checkpoints # use soft link to save storage on the disk
├── fvit_eva_vitb16_ovcoco_clipself_patches.pth # 1
├── fvit_eva_vitb16_ovcoco_clipself_proposals.pth # 2
├── fvit_eva_vitb16_ovcoco_regionclip.pth # 3
├── fvit_eva_vitb16_ovlvis_clipself_patches.pth # 4
├── fvit_eva_vitl14_ovcoco_clipself_patches.pth # 5
├── fvit_eva_vitl14_ovcoco_clipself_proposals.pth # 6
├── fvit_eva_vitl14_ovcoco_regionclip.pth # 7
├── fvit_eva_vitl14_ovlvis_clipself_patches.pth # 8
An example of evaluation on OV-COCO
bash dist_test.sh configs/ov_coco/fvit_vitb16_upsample_fpn_bs64_3e_ovcoco_eva_clipself_proposals.py \
checkpoints/fvit_eva_vitb16_ovcoco_clipself_proposals.pth 8 \
--work-dir your/working/directory --eval bbox
An example of evaluation on OV-LVIS
bash dist_test.sh configs/ov_lvis/fvit_vitl14_upsample_fpn_bs64_4x_ovlvis_eva_clipself_patches.py \
checkpoints/fvit_eva_vitl14_ovlvis_clipself_patches.pth 8 \
--work-dir your/working/directory --eval segm
Transfer
Transfer evaluation on COCO:
bash dist_test.sh configs/transfer/fvit_vitl14_upsample_fpn_transfer2coco.py \
checkpoints/fvit_eva_vitl14_ovlvis_clipself_patches.pth 8 \
--work-dir your/working/directory --eval bbox
Transfer evaluation on Objects365v1:
bash dist_test.sh configs/transfer/fvit_vitl14_upsample_fpn_transfer2objects365v1.py \
checkpoints/fvit_eva_vitl14_ovlvis_clipself_patches.pth 8 \
--work-dir your/working/directory --eval bbox
Train
Prepare the CLIPSelf/RegionCLIP checkpoints as shown in the previous section. An example of training on OV-COCO:
bash dist_train.sh configs/ov_coco/fvit_vitb16_upsample_fpn_bs64_3e_ovcoco_eva_clipself_proposals.py \
8 --work-dir your/working/directory
An example of training on OV-LVIS:
bash dist_train.sh configs/ov_lvis/fvit_vitl14_upsample_fpn_bs64_4x_ovlvis_eva_clipself_patches.py \
8 --work-dir your/working/directory
To use multiple machines (e.g., 2x8=16 GPUs) to expedite the training on OV-LVIS, refer to the tutorial of
MMDetection. We have set
auto_scale_lr = dict(enable=True, base_batch_size=64) in the config files, so the learning rate will be
modified automatically.