Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

August 12, 2025 · View on GitHub

This is the repository of Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Evaluation Radar Results:

Models

Last update on 2025/04/09

VLM Detection Models

First Type

[CVPR 2022] Grounded Language-Image Pre-training [Paper][Code]
[CVPR 2022] RegionCLIP: Region-based Language-Image Pretraining [Paper][Code]
[ECCV 2022] Open Vocabulary Object Detection with Pseudo Bounding-Box Labels [Paper][Code]
[NeulIPS 2022] DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [Paper]
[ECCV 2022] Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
[NeurIPS 2023] Scaling Open-Vocabulary Object Detection [Paper][Code]
[CVPR 2023] DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [Paper]
[CVPR 2024] DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [Paper]
[ECCV 2024] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
[CVPR 2024] YOLO-World: Real-Time Open-Vocabulary Object Detection [Paper][Code]
[arxiv] OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [Paper][Code]

Second Type

[ECCV 2022] Detecting Twenty-thousand Classes using Image-level Supervision [Paper][Code]
[ICLR 2023] Learning Object-Language Alignments for Open-Vocabulary Object Detection [Paper][Code]
[CVPR 2022] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model [Paper][Code]
[ECCV 2022] Open-Vocabulary DETR with Conditional Matching [Paper][Code]
[ICLR 2022] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation [Paper][Code]
[CVPR 2022] Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation [Paper][Code]
[ECCV 2022] PromptDet: Towards Open-vocabulary Detection using Uncurated Images [Paper][Code]
[CVPR 2023] Aligning Bag of Regions for Open-Vocabulary Object Detection [Paper][Code]
[NeurIPS 2023] CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection [Paper][Code]
[CVPR 2023] CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [Paper][Code]
[ICCV 2023] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection [Paper][Code]
[arxiv] DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection[Paper][Code]
[ICCV 2023] EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [Paper]
[ICLR 2023] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models [Paper][Code]
[ICML 2023] Multi-Modal Classifiers for Open-Vocabulary Object Detection [Paper][Code]
[CVPR 2023] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection [Paper][Code]
[arxiv] Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection [Paper]
[CVPR 2023] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
[CVPR 2024] Taming Self-Training for Open-Vocabulary Object Detection [Paper][Code]
[ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper][Code]
[WACV 2024] LP-OVOD: Open-Vocabulary Object Detection by Linear Probing [Paper][Code]

VLM Segmentation Models

[ICLR 2022] Language-driven Semantic Segmentation [Paper][code]
[CVPR 2024] CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [Paper][code]
[CVPR 2023] Side Adapter Network for Open-Vocabulary Semantic Segmentation [Paper]code
[ECCV 2022] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model [Paper] [code]
[ICML 2023] Open-Vocabulary Universal Image Segmentation with MaskCLIP[Paper][code]
[ICCV 2023] Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network[Paper][code]
[NeurIPS 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP[Paper] [code]
[NeurIPS 2023] Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [Paper] [code]
[CVPR 2024] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation[Paper][code]
[CVPR 2024] Open-Vocabulary Segmentation with Semantic-Assisted Calibration[Paper][code]
[CVPR 2024] Transferable and Principled Efficiency for Open-Vocabulary Segmentation[Paper] [code]
[CVPR 2024] Open-Vocabulary Semantic Segmentation with Image Embedding Balancing[Paper][code]
[CVPR 2022] Decoupling Zero-Shot Semantic Segmentation[Paper][code]
[CVPR 2023] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation[Paper] [code]
[CVPR 2023] Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation[Paper] [code]
[ICML 2024] Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation[Paper] [code]
[ICML2024] SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation[Paper] [code]
[CVPR 2023] Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs[Paper] [code]

Datasets

Datasets for Detection

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
COCO 2014 Detection	2014	80	83,000	41,000	Box mAP	Project
COCO 2017 Detection	2017	80	118,287	5,000	Box mAP	Project
PASCAL VOC	2012	20	5,717	5,823	Box mAP	Project
LVIS	2019	1203	100,170	19,809	Box mAP	Project
ODinW	2022	314	132,413	20,070	Box mAP	Project
Objects365	2019	365	600,000	38,000	Box mAP	Project
Stanford Dogs	2011	120	12,000	8,580	Box mAP	Project
CUB-200-2011	2011	200	5,994	5,794	Box mAP	Project
Cityscapes	2016	8	2,975	500	Box mAP	Project
Foggy Cityscapes	2018	8	2,975	500	Box mAP	Project
WaterColor	2018	6	1,000	-	Box mAP	Project
Comic	2018	6	1,000	-	Box mAP	Project
KITTI	2012	1	7,481	-	Box mAP	Project
Sim10K	2016	1	10,000	-	Box mAP	Project
VOC-C	2019	20	543,115	553,185	Box mAP	Project
COCO-C	2019	80	11,237,265	475,000	Box mAP	Project
Cityscapes-C	2019	8	282,625	47,500	Box mAP	Project
CrowdHuman	2018	1	15,000	4,370	Box mAP	Project
OCHuman	2019	1	-	2,500	Box mAP	Project
WiderPerson	2019	1	7,891	1,000	Box mAP	Project

Datasets for Segmentation

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
COCO-Stuff	2018	172	118k	20k	mIoU	Project
PASCAL VOC 2012	2012	20	1,464	1,449	mIoU	Project
PASCAL Content	2014	459	4,998	5,105	mIoU	Project
Cityscapes	2016	19	2,975	500	mIoU	Project
ADE20k	2017	150	25,574	2,000	mIoU	Project
MESS*	2023	-	-	-	mIoU	Project
PASCAL-Part	2023	116	8432	851	mIoU	Project
ADE20k-Part-234	2023	234	7348	1017	mIoU	Project
PASCAL-5i**	2015	20	-	-	mIoU, FB-IoU	Project
COCO-20i**	2014	80	-	-	mIoU, FB-IoU	Project
FSS-1000	2020	1000	5200	2400	mIoU, FB-IoU	Project
OCHuman	2019	1	-	2231	AP, AP50, AP75	Project
CIS	2023	1	-	459	AP, AP50, AP75	Project
COCO-OCC	2021	80	-	1005	AP, AP50, AP75	Project
CamVid	2008	11	467	233	mIoU	Project
UAVid	2018	9	200	100	mIoU	Projects
UDD6	2018	12	205	45	mIoU	Project

*The benchmark includes a wide range of domain-specific datasets.

**The benchmark has different training and testing sets under various settings.

Bibtex

If our work is helpful for your research, please consider citing the following BibTeX entry.

@article{feng2025vision,
  title={Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation},
  author={Feng, Yongchao and Liu, Yajie and Yang, Shuai and Cai, Wenrui and Zhang, Jinqing and Zhan, Qiqi and Huang, Ziyue and Yan, Hongxi and Wan, Qiao and Liu, Chenguang and others},
  journal={arXiv preprint arXiv:2504.09480},
  year={2025}
}