MM Grounding DINO

February 5, 2024 · View on GitHub

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Abstract

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.

Dataset Preparation

Please refer to dataset_prepare.md or 中文版数据准备

✨ What's New

💎 We have released the pre-trained weights for Swin-B and Swin-L, welcome to try and give feedback.

Usage

Please refer to usage.md or 中文版用法说明

Zero-Shot COCO Results and Models

ModelBackboneStyleCOCO mAPPre-Train DataConfigDownload
GDINO-TSwin-TZero-shot46.7O365
GDINO-TSwin-TZero-shot48.1O365,GoldG
GDINO-TSwin-TZero-shot48.4O365,GoldG,Cap4Mconfigmodel
MM-GDINO-TSwin-TZero-shot48.5(+1.8)O365config
MM-GDINO-TSwin-TZero-shot50.4(+2.3)O365,GoldGconfigmodel | log
MM-GDINO-TSwin-TZero-shot50.5(+2.1)O365,GoldG,GRITconfigmodel | log
MM-GDINO-TSwin-TZero-shot50.6(+2.2)O365,GoldG,V3Detconfigmodel | log
MM-GDINO-TSwin-TZero-shot50.4(+2.0)O365,GoldG,GRIT,V3Detconfigmodel | log
MM-GDINO-BSwin-BZero-shot52.5O365,GoldG,V3Detconfigmodel | log
MM-GDINO-B*Swin-B-59.5O365,ALLconfigmodel | log
MM-GDINO-LSwin-LZero-shot53.0O365V2,OpenImageV6,GoldGconfigmodel | log
MM-GDINO-L*Swin-L-60.3O365V2,OpenImageV6,ALLconfigmodel | log
  • This * indicates that the model has not been fully trained yet. We will release the final weights in the future.
  • ALL: GoldG,V3det,COCO2017,LVISV1,COCO2014,GRIT,RefCOCO,RefCOCO+,RefCOCOg,gRefCOCO.

Zero-Shot LVIS Results

ModelMiniVal APrMiniVal APcMiniVal APfMiniVal APVal1.0 APrVal1.0 APcVal1.0 APfVal1.0 APPre-Train Data
GDINO-T18.824.234.728.810.115.329.920.1O365,GoldG,Cap4M
MM-GDINO-T28.130.242.035.7(+6.9)17.122.436.527.0(+6.9)O365,GoldG
MM-GDINO-T26.632.441.836.5(+7.7)17.322.636.427.1(+7.0)O365,GoldG,GRIT
MM-GDINO-T33.036.045.940.5(+11.7)21.525.540.230.6(+10.5)O365,GoldG,V3Det
MM-GDINO-T34.237.446.241.4(+12.6)23.627.640.531.9(+11.8)O365,GoldG,GRIT,V3Det

Zero-Shot ODinW (Object Detection in the Wild) Results

Results and models of ODinW13

MethodGDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
AerialMaritimeDrone0.1730.1330.1550.1770.151
Aquarium0.1950.2520.2610.2660.283
CottontailRabbits0.7990.7710.8100.7780.786
EgoHands0.6080.4990.5370.5060.519
NorthAmericaMushrooms0.5070.3310.4620.6690.767
Packages0.6870.7070.6870.7100.706
PascalVOC0.5630.5650.5800.5560.566
pistols0.7260.5850.7090.6710.729
pothole0.2150.1360.2850.1990.243
Raccoon0.5490.4690.5110.5530.535
ShellfishOpenImages0.3930.3210.4370.5190.488
thermalDogsAndPeople0.6570.5560.6030.4930.542
VehiclesOpenImages0.6130.5660.6030.6140.615
Average0.5140.4530.5110.5160.533
  • The MM-GDINO-T config file is odinw13

Results and models of ODinW35

MethodGDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
AerialMaritimeDrone_large0.1730.1330.1550.1770.151
AerialMaritimeDrone_tiled0.2060.1700.2250.1840.206
AmericanSignLanguageLetters0.0020.0160.0200.0110.007
Aquarium0.1950.2520.2610.2660.283
BCCD0.1610.0690.1180.0830.077
boggleBoards0.0000.0020.0010.0010.002
brackishUnderwater0.0210.0330.0210.0250.025
ChessPieces0.0000.0000.0000.0000.000
CottontailRabbits0.8060.7710.8100.7780.786
dice0.0040.0020.0050.0010.001
DroneControl0.0420.0470.0970.0880.074
EgoHands_generic0.6080.5270.5370.5060.519
EgoHands_specific0.0020.0010.0050.0070.003
HardHatWorkers0.0460.0480.0700.0700.108
MaskWearing0.0040.0090.0040.0110.009
MountainDewCommercial0.4300.4530.4650.1940.430
NorthAmericaMushrooms0.4710.3310.4620.6690.767
openPoetryVision0.0000.0010.0000.0000.000
OxfordPets_by_breed0.0030.0020.0040.0060.004
OxfordPets_by_species0.0110.0190.0160.0200.015
PKLot0.0010.0040.0020.0080.007
Packages0.6950.7070.6870.7100.706
PascalVOC0.5630.5650.5800.5660.566
pistols0.7260.5850.7090.6710.729
plantdoc0.0050.0050.0070.0080.011
pothole0.2150.1360.2190.0770.168
Raccoons0.5490.4690.5110.5530.535
selfdrivingCar0.0890.0910.0760.0940.083
ShellfishOpenImages0.3930.3210.4370.5190.488
ThermalCheetah0.0870.0630.0810.0300.045
thermalDogsAndPeople0.6570.5560.6030.4930.543
UnoCards0.0060.0120.0100.0090.005
VehiclesOpenImages0.6130.5660.6030.6140.615
WildfireSmoke0.1340.1060.1540.0420.127
websiteScreenshots0.0120.020.0160.0160.016
Average0.2270.2020.2280.2140.284
  • The MM-GDINO-T config file is odinw35

Zero-Shot Referring Expression Comprehension Results

MethodGDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
RefCOCO val @1,5,1050.8/89.5/94.953.1/89.9/94.753.4/90.3/95.552.1/89.8/95.053.1/89.7/95.1
RefCOCO testA @1,5,1057.4/91.3/95.659.7/91.5/95.958.8/91.70/96.258.4/86.8/95.659.1/91.0/95.5
RefCOCO testB @1,5,1045.0/86.5/92.946.4/86.9/92.246.8/87.7/93.345.4/86.2/92.646.8/87.8/93.6
RefCOCO+ val @1,5,1051.6/86.4/92.653.1/87.0/92.853.5/88.0/93.752.5/86.8/93.252.7/87.7/93.5
RefCOCO+ testA @1,5,1057.3/86.7/92.758.9/87.3/92.959.0/88.1/93.758.1/86.7/93.558.7/87.2/93.1
RefCOCO+ testB @1,5,1046.4/84.1/90.747.9/84.3/91.047.9/85.5/92.746.9/83.7/91.548.4/85.8/92.1
RefCOCOg val @1,5,1060.4/92.1/96.261.2/92.6/96.162.7/93.3/97.061.7/92.9/96.662.9/93.3/97.2
RefCOCOg test @1,5,1059.7/92.1/96.361.1/93.3/96.762.6/94.9/97.161.0/93.1/96.862.9/93.9/97.4
Methodthresh_scoreGDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc0.539.3/70.439.4/67.5
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc0.640.5/83.840.6/83.1
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc0.741.3/91.839.8/84.740.7/89.740.3/88.841.0/91.3
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc0.841.5/96.841.1/96.4
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc0.531.9/70.433.1/69.5
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc0.629.3/82.929.2/84.3
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc0.727.2/90.226.3/89.026.0/91.925.4/91.826.1/93.0
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc0.825.1/96.323.8/97.2
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc0.530.9/72.533.0/69.6
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc0.630.0/86.131.6/96.7
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc0.729.7/93.531.3/84.830.6/90.230.7/89.930.4/92.3
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc0.829.1/97.429.5/84.2
  • The MM-GDINO-T config file is here

Zero-Shot Description Detection Dataset(DOD)

pip install ddd-dataset
MethodmodeGDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
FULL/short/middle/long/very longconcat17.2/18.0/18.7/14.8/16.315.6/17.3/16.7/14.3/13.117.0/17.7/18.0/15.7/15.716.2/17.4/16.8/14.9/15.417.5/23.4/18.3/14.7/13.8
FULL/short/middle/long/very longparallel22.3/28.2/24.8/19.1/13.921.7/24.7/24.0/20.2/13.722.5/25.6/25.1/20.5/14.922.3/25.6/24.5/20.6/14.722.9/28.1/25.4/20.4/14.4
PRES/short/middle/long/very longconcat17.8/18.3/19.2/15.2/17.316.4/18.4/17.3/14.5/14.217.9/19.0/18.3/16.5/17.516.6/18.8/17.1/15.1/15.018.0/23.7/18.6/15.4/13.3
PRES/short/middle/long/very longparallel21.0/27.0/22.8/17.5/12.521.3/25.5/22.8/19.2/12.921.5/25.2/23.0/19.0/15.021.6/25.7/23.0/19.5/14.821.9/27.4/23.2/19.1/14.2
ABS/short/middle/long/very longconcat15.4/17.1/16.4/13.6/14.913.4/13.4/14.5/13.5/11.914.5/13.1/16.7/13.6/13.314.8/12.5/15.6/14.3/15.815.9/22.2/17.1/12.5/14.4
ABS/short/middle/long/very longparallel26.0/32.0/33.0/23.6/15.522.8/22.2/28.7/22.9/14.725.6/26.8/33.9/24.5/14.724.1/24.9/30.7/23.8/14.726.0/30.3/34.1/23.9/14.6

Note:

  1. Considering that the evaluation time for Inter-scenario is very long and the performance is low, it is temporarily not supported. The mentioned metrics are for Intra-scenario.
  2. concat is the default inference mode for Grounding DINO, where it concatenates multiple sub-sentences with "." to form a single sentence for inference. On the other hand, "parallel" performs inference on each sub-sentence in a for-loop.
  3. The MM-GDINO-T config file is concat_dod and parallel_dod

Pretrain Flickr30k Results

ModelPre-Train DataVal R@1Val R@5Val R@10Test R@1Test R@5Test R@10
GLIP-TO365,GoldG84.994.996.385.695.496.7
GLIP-TO365,GoldG,CC3M,SBU85.395.596.986.095.997.2
GDINO-TO365,GoldG,Cap4M87.896.698.088.196.998.2
MM-GDINO-TO365,GoldG85.595.697.286.295.797.4
MM-GDINO-TO365,GoldG,GRIT86.795.897.687.096.297.7
MM-GDINO-TO365,GoldG,V3Det85.995.797.486.395.797.4
MM-GDINO-TO365,GoldG,GRIT,V3Det86.796.097.687.296.297.7

Note:

  1. @1,5,10 refers to precision at the top 1, 5, and 10 positions in a predicted ranked list.
  2. The MM-GDINO-T config file is here

Validating the generalization of a pre-trained model through fine-tuning

RTTS

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-501x48.1
Cascade R-CNNR-501x50.8
ATSSR-501x48.2
TOODR-501X50.8
MM-GDINO(zero-shot)Swin-T49.8
MM-GDINOSwin-T1x69.1

RUOD

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-501x52.4
Cascade R-CNNR-501x55.3
ATSSR-501x55.7
TOODR-501X57.4
MM-GDINO(zero-shot)Swin-T29.8
MM-GDINOSwin-T1x65.5

Brain Tumor

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-5050e43.5
Cascade R-CNNR-5050e46.2
DINOR-5050e46.4
Cascade-DINOR-5050e48.6
MM-GDINOSwin-T50e47.5

Cityscapes

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-5050e30.1
Cascade R-CNNR-5050e31.8
DINOR-5050e34.5
Cascade-DINOR-5050e34.8
MM-GDINO(zero-shot)Swin-T34.2
MM-GDINOSwin-T50e51.5

People in Painting

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-5050e17.0
Cascade R-CNNR-5050e18.0
DINOR-5050e12.0
Cascade-DINOR-5050e13.4
MM-GDINO(zero-shot)Swin-T23.1
MM-GDINOSwin-T50e38.9

COCO

(1) Closed-set performance

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-501x37.4
Cascade R-CNNR-501x40.3
ATSSR-501x39.4
TOODR-501X42.4
DINOR-501X50.1
GLIP(zero-shot)Swin-T46.6
GDINO(zero-shot)Swin-T48.5
MM-GDINO(zero-shot)Swin-T50.4
GLIPSwin-T1x55.4
GDINOSwin-T1x58.1
MM-GDINOSwin-T1x58.2
  • The MM-GDINO-T config file is here

(2) Open-set continuing pretraining performance

ArchitectureBackboneLr schdbox AP
GLIP(zero-shot)Swin-T46.7
GDINO(zero-shot)Swin-T48.5
MM-GDINO(zero-shot)Swin-T50.4
MM-GDINOSwin-T1x54.7
  • The MM-GDINO-T config file is here
  • Due to the small size of the COCO dataset, continuing pretraining solely on COCO can easily lead to overfitting. The results shown above are from the third epoch. I do not recommend you train using this approach.

(3) Open vocabulary performance

ArchitectureBackboneLr schdbox APBase box APNovel box APbox AP@50Base box AP@50Novel box AP@50
MM-GDINO(zero-shot)Swin-T51.148.458.966.764.074.2
MM-GDINOSwin-T1x57.256.160.473.673.075.3
  • The MM-GDINO-T config file is here

LVIS 1.0

(1) Open-set continuing pretraining performance

ArchitectureBackboneLr schdMiniVal APrMiniVal APcMiniVal APfMiniVal APVal1.0 APrVal1.0 APcVal1.0 APfVal1.0 AP
GLIP(zero-shot)Swin-T18.121.233.126.710.814.729.019.6
GDINO(zero-shot)Swin-T18.824.234.728.810.115.329.920.1
MM-GDINO(zero-shot)Swin-T34.237.446.241.423.627.640.531.9
MM-GDINOSwin-T1x50.758.860.158.745.250.256.151.7
  • The MM-GDINO-T config file is here

(2) Open vocabulary performance

ArchitectureBackboneLr schdMiniVal APrMiniVal APcMiniVal APfMiniVal AP
MM-GDINO(zero-shot)Swin-T34.237.446.241.4
MM-GDINOSwin-T1x43.257.459.357.1
  • The MM-GDINO-T config file is here

RefEXP

RefCOCO

ArchitectureBackboneLr schdval @1val @5val @10testA @1testA @5testA @10testB @1testB @5testB @10
GDINO(zero-shot)Swin-T50.889.594.957.591.395.645.086.592.9
MM-GDINO(zero-shot)Swin-T53.189.795.159.191.095.546.887.893.6
GDINOSwin-TUNK89.291.986.0
MM-GDINOSwin-T5e89.598.699.491.499.299.886.697.999.1
  • The MM-GDINO-T config file is here

RefCOCO+

ArchitectureBackboneLr schdval @1val @5val @10testA @1testA @5testA @10testB @1testB @5testB @10
GDINO(zero-shot)Swin-T51.686.492.657.386.792.746.484.190.7
MM-GDINO(zero-shot)Swin-T52.787.793.558.787.293.148.485.892.1
GDINOSwin-TUNK81.187.474.7
MM-GDINOSwin-T5e82.197.899.287.599.299.774.096.396.4
  • The MM-GDINO-T config file is here

RefCOCOg

ArchitectureBackboneLr schdval @1val @5val @10test @1test @5test @10
GDINO(zero-shot)Swin-T60.492.196.259.792.196.3
MM-GDINO(zero-shot)Swin-T62.993.397.262.993.997.4
GDINOSwin-TUNK84.284.9
MM-GDINOSwin-T5e85.598.499.485.898.699.4
  • The MM-GDINO-T config file is here

gRefCOCO

ArchitectureBackboneLr schdval Pr@(F1=1, IoU≥0.5)val N-acctestA Pr@(F1=1, IoU≥0.5)testA N-acctestB Pr@(F1=1, IoU≥0.5)testB N-acc
GDINO(zero-shot)Swin-T41.391.827.290.229.793.5
MM-GDINO(zero-shot)Swin-T41.091.326.193.030.492.3
MM-GDINOSwin-T5e45.164.742.565.540.363.2
  • The MM-GDINO-T config file is here

Citation

If you find this project useful in your research, please consider citing:

@article{zhao2024open,
  title={An Open and Comprehensive Pipeline for Unified Object Grounding and Detection},
  author={Zhao, Xiangyu and Chen, Yicheng and Xu, Shilin and Li, Xiangtai and Wang, Xinjiang and Li, Yining and Huang, Haian},
  journal={arXiv preprint arXiv:2401.02361},
  year={2024}
}