Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

August 12, 2025 ยท View on GitHub

This is the repository of Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Evaluation Radar Results:

Models

Last update on 2025/04/09

VLM Detection Models

First Type

  • [CVPR 2022] Grounded Language-Image Pre-training [Paper][Code]
  • [CVPR 2022] RegionCLIP: Region-based Language-Image Pretraining [Paper][Code]
  • [ECCV 2022] Open Vocabulary Object Detection with Pseudo Bounding-Box Labels [Paper][Code]
  • [NeulIPS 2022] DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [Paper]
  • [ECCV 2022] Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
  • [NeurIPS 2023] Scaling Open-Vocabulary Object Detection [Paper][Code]
  • [CVPR 2023] DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [Paper]
  • [CVPR 2024] DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [Paper]
  • [ECCV 2024] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
  • [CVPR 2024] YOLO-World: Real-Time Open-Vocabulary Object Detection [Paper][Code]
  • [arxiv] OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [Paper][Code]

Second Type

  • [ECCV 2022] Detecting Twenty-thousand Classes using Image-level Supervision [Paper][Code]

  • [ICLR 2023] Learning Object-Language Alignments for Open-Vocabulary Object Detection [Paper][Code]

  • [CVPR 2022] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model [Paper][Code]

  • [ECCV 2022] Open-Vocabulary DETR with Conditional Matching [Paper][Code]

  • [ICLR 2022] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation [Paper][Code]

  • [CVPR 2022] Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation [Paper][Code]

  • [ECCV 2022] PromptDet: Towards Open-vocabulary Detection using Uncurated Images [Paper][Code]

  • [CVPR 2023] Aligning Bag of Regions for Open-Vocabulary Object Detection [Paper][Code]

  • [NeurIPS 2023] CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection [Paper][Code]

  • [CVPR 2023] CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [Paper][Code]

  • [ICCV 2023] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection [Paper][Code]

  • [arxiv] DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection[Paper][Code]

  • [ICCV 2023] EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [Paper]

  • [ICLR 2023] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models [Paper][Code]

  • [ICML 2023] Multi-Modal Classifiers for Open-Vocabulary Object Detection [Paper][Code]

  • [CVPR 2023] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection [Paper][Code]

  • [arxiv] Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection [Paper]

  • [CVPR 2023] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]

  • [CVPR 2024] Taming Self-Training for Open-Vocabulary Object Detection [Paper][Code]

  • [ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper][Code]

  • [WACV 2024] LP-OVOD: Open-Vocabulary Object Detection by Linear Probing [Paper][Code]

VLM Segmentation Models

  • [ICLR 2022] Language-driven Semantic Segmentation [Paper][code]
  • [CVPR 2024] CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [Paper][code]
  • [CVPR 2023] Side Adapter Network for Open-Vocabulary Semantic Segmentation [Paper]code
  • [ECCV 2022] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model [Paper] [code]
  • [ICML 2023] Open-Vocabulary Universal Image Segmentation with MaskCLIP[Paper][code]
  • [ICCV 2023] Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network[Paper][code]
  • [NeurIPS 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP[Paper] [code]
  • [NeurIPS 2023] Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [Paper] [code]
  • [CVPR 2024] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation[Paper][code]
  • [CVPR 2024] Open-Vocabulary Segmentation with Semantic-Assisted Calibration[Paper][code]
  • [CVPR 2024] Transferable and Principled Efficiency for Open-Vocabulary Segmentation[Paper] [code]
  • [CVPR 2024] Open-Vocabulary Semantic Segmentation with Image Embedding Balancing[Paper][code]
  • [CVPR 2022] Decoupling Zero-Shot Semantic Segmentation[Paper][code]
  • [CVPR 2023] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation[Paper] [code]
  • [CVPR 2023] Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation[Paper] [code]
  • [ICML 2024] Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation[Paper] [code]
  • [ICML2024] SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation[Paper] [code]
  • [CVPR 2023] Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs[Paper] [code]

Datasets

Datasets for Detection

DatasetYearClassesTrainingTestingEvaluation MetricProject
COCO 2014 Detection20148083,00041,000Box mAPProject
COCO 2017 Detection201780118,2875,000Box mAPProject
PASCAL VOC2012205,7175,823Box mAPProject
LVIS20191203100,17019,809Box mAPProject
ODinW2022314132,41320,070Box mAPProject
Objects3652019365600,00038,000Box mAPProject
Stanford Dogs201112012,0008,580Box mAPProject
CUB-200-201120112005,9945,794Box mAPProject
Cityscapes201682,975500Box mAPProject
Foggy Cityscapes201882,975500Box mAPProject
WaterColor201861,000-Box mAPProject
Comic201861,000-Box mAPProject
KITTI201217,481-Box mAPProject
Sim10K2016110,000-Box mAPProject
VOC-C201920543,115553,185Box mAPProject
COCO-C20198011,237,265475,000Box mAPProject
Cityscapes-C20198282,62547,500Box mAPProject
CrowdHuman2018115,0004,370Box mAPProject
OCHuman20191-2,500Box mAPProject
WiderPerson201917,8911,000Box mAPProject

Datasets for Segmentation

DatasetYearClassesTrainingTestingEvaluation MetricProject
COCO-Stuff2018172118k20kmIoUProject
PASCAL VOC 20122012201,4641,449mIoUProject
PASCAL Content20144594,9985,105mIoUProject
Cityscapes2016192,975500mIoUProject
ADE20k201715025,5742,000mIoUProject
MESS*2023---mIoUProject
PASCAL-Part20231168432851mIoUProject
ADE20k-Part-234202323473481017mIoUProject
PASCAL-5i**201520--mIoU, FB-IoUProject
COCO-20i**201480--mIoU, FB-IoUProject
FSS-10002020100052002400mIoU, FB-IoUProject
OCHuman20191-2231AP, AP50, AP75Project
CIS20231-459AP, AP50, AP75Project
COCO-OCC202180-1005AP, AP50, AP75Project
CamVid200811467233mIoUProject
UAVid20189200100mIoUProjects
UDD620181220545mIoUProject

*The benchmark includes a wide range of domain-specific datasets.

**The benchmark has different training and testing sets under various settings.

Bibtex

If our work is helpful for your research, please consider citing the following BibTeX entry.

@article{feng2025vision,
  title={Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation},
  author={Feng, Yongchao and Liu, Yajie and Yang, Shuai and Cai, Wenrui and Zhang, Jinqing and Zhan, Qiqi and Huang, Ziyue and Yan, Hongxi and Wan, Qiao and Liu, Chenguang and others},
  journal={arXiv preprint arXiv:2504.09480},
  year={2025}
}