Distillation Pyramid for Multimodal Open-Vocabulary Object Detection

January 6, 2026 · View on GitHub

Installation

Prerequisites

Python 3.11
CUDA 11.8
PyTorch 2.1.0

Environment Setup

Create a conda environment

conda create -n mmdet3 python=3.11 -y
conda activate mmdet3

Install PyTorch

pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118 --index-url https://download.pytorch.org/whl/cu118

Install MMDetection ecosystem using OpenMIM

pip install -U openmim
mim install mmengine==0.10.5
mim install mmcv==2.1.0
mim install mmdet==3.3.0

Install other dependencies

pip install -r requirements.txt

Preparation

Datasets

The expected directory structure for datasets:

data/
├── coco
│   ├── annotations
│   ├── mdetr_annotations
│   ├── train2014
│   ├── train2017
│   └── val2017
├── flickr30k_entities
│   ├── flickr30k_images
│   └── flickr_train_vg7.jsonl
├── gqa
│   ├── gqa_train_vg7.jsonl
│   └── images
├── mmovod
│   ├── merged.json
│   ├── pseudo_list.pth
│   └── samples
├── objects365
│   ├── annotations
│   └── train
├── qwen
│   ├── annotations
│   └── features
├── retrival
│   └── object_detection.json
└── v3det
    ├── annotations
    └── images

Dataset Annotations

All required annotations have been uploaded to Google Drive. After downloading, extract and place them in the corresponding directories as shown in the structure above.

Pretrained Models

Download the pretrained MM-Grounding-DINO models:

mm_grounding_dino/
├── grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
├── grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
└── grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth

You can download them using:

wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det/grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth

OADP Features

Before using the distillation technique, we need to extract features offline. You can use the code in the main branch for extraction. The specific command is:

Training

Hardware Requirements

All experiments were conducted on 8x NVIDIA RTX 4090 (24GB) GPUs.

Text-based Model Training

Train the text-based distillation model using EVA-CLIP features:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/train.py \
    configs/ov_distill_shortest_edge.py \
    --work-dir work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest_edge \
    --cfg-options \
        model.obj_loss_weight=0.025 \
        model.block_loss_weight=0.25 \
        model.global_loss_weight=0.025 \
        load_from=mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
    --resume \
    --launcher pytorch

Image-based Model Training

Train the image-based distillation model using LLM-extracted features:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/train.py \
    configs/fs_llm_features_distill.py \
    --work-dir work_dirs/fs_distill_0.03_0.8_0.8 \
    --cfg-options \
        model.w_distill=0.03 \
        model.w_global=0.8 \
        model.w_structure=0.8 \
        load_from=mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
    --resume \
    --launcher pytorch

Evaluation

Text-based Model Evaluation

Evaluate the text-based distillation model on LVIS open-vocabulary detection:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/test.py \
    configs/evaluation/lvis_val_ov.py \
    work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest-edge/iter_150000.pth \
    --work-dir work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest-edge/150000 \
    --launcher pytorch

Image-based Model Evaluation

Evaluate the image-based distillation model on LVIS validation:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/test.py \
    configs/evaluation/lvis_val.py \
    mm_work_dirs/fs_distill_0.03_0.8_0.8/iter_16000.pth \
    --work-dir work_dirs/fs_distill_0.03_0.8_0.8/iter_16000 \
    --launcher pytorch