Introduction

November 18, 2025 · View on GitHub

Awesome PR's Welcome

Towards Visual Grounding: A Survey

Accepted by TPAMI, 2025
Linhui Xiao · Xiaoshan Yang · Xiangyuan Lan · Yaowei Wang · Changsheng Xu

TPAMI arXiv PDF

An Illustration of Visual Grounding

A Decade of Visual Grounding

This repo is used for recording, tracking, and benchmarking several recent visual grounding methods to supplement our Grounding Survey.

Links: TPAMI, ArXiv, 《机器之心》中文解读

Please leave a STAR ⭐ if you like this project!

🔥 Add Your Paper in our Repo and Survey!

  • If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

  • You are welcome to give us an issue or PR (pull request) for your visual grounding related works!

  • Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

🔥 New

  • 🔥🔥🔥 Our Grounding survey (TPAMI, Arxiv, Project) has been accepted by TPAMI on October 30, 2025 !!!

  • 🔥 We made our survey paper public and created this repository on December 28, 2024.

  • Our advanced one-tower grounding work OneRef (Paper, Code) has been accepted by top conference NeurIPS 2024 in October 2024!

  • Our advanced grounding work HiVG (Paper, Code) has been accepted by top conference ACM MM 2024 in July 2024!

  • Our grounding work CLIP-VG (Paper, Code) has been accepted by top journal TMM in September 2023!

🔥 Highlight!!

  • A comprehensive survey for Visual Grounding, including Referring Expression Comprehension and Phrase Grounding.

  • It includes the newly concepts, such as Grounding Multi-modal LLMs, Generalized Visual Grounding, and VLP-based grounding transfer works.

  • We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.

  • We provide a list of future research insights.

Introduction

We are the first survey in the past five years to systematically track and summarize the development of visual grounding over the last decade. By extracting common technical details, this review encompasses the most representative work in each subtopic.

This survey is also currently the most comprehensive review in the field of visual grounding. We aim for this article to serve as a valuable resource not only for beginners seeking an introduction to grounding but also for researchers with an established foundation, enabling them to navigate and stay up-to-date with the latest advancements.

A Decade of Visual Grounding

Mainstream Settings in Visual Grounding

Typical Framework Architectures for Visual Grounding

Our Paper Structure

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{xiao2025groundingsurvey,
      title={Towards Visual Grounding: A Survey}, 
      author={Linhui Xiao and Xiaoshan Yang and Xiangyuan Lan and Yaowei Wang and Changsheng Xu},
      journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
      year={2025},
      volume={1},
      number={1},
      pages={1-20},
      doi={10.1109/TPAMI.2025.3630635}
}

It should be noted that, due to the typesetting restrictions of the journal, there are small differences in the typesetting between the Arxiv version and review version.

The following will be the relevant grounding papers and associated code links in this paper:

Summary of Contents

This content corresponds to the main text.

1. Methods: A Survey

1.1 Fully Supervised Setting

A. Traditional CNN-based Methods

YearVenueWork NamePaper Title / Paper LinkCode / Project
2016CVPRNMIGeneration and Comprehension of Unambiguous Object DescriptionsCode
2016ECCVSNLESegmentation from Natural Language ExpressionsN/A
2018TPAMISimilarity NetworkLearning Two-Branch Neural Networks for Image-Text Matching TasksN/A
2018ECCVCITEConditional Image-Text Embedding NetworksCode
2018IJCAIDDPNRethinking Diversified and Discriminative Proposal Generation for Visual GroundingCode
2014EMNLPReferitgameReferitgame: Referring to objects in photographs of natural scenesCode
2015CVPRDMSMFrom captions to visual concepts and backProject
2016CVPRSCRCNatural language object retrievalCode
2018ACCVPIRCPirc net: Using proposal indexing, relationships and context for phrase groundingN/A
2016ECCVVisdifModeling context in referring expressionsData
2018CVPRMattnetMattnet: Modular attention network for referring expression comprehensionCode
2020AAAICMCCLearning cross-modal context graph for visual groundingcode
2016CVPRYOLOYou only look once: Unified, real-time object detectionProject
2018CVPRYOLOv3Yolov3: An incremental improvementProject
2017ICCVAttributeReferring Expression Generation and Comprehension via AttributesN/A
2017CVPRCGComprehension-guided referring expressionsN/A
2017CVPRCMNModeling relationships in referential expressions with compositional modular networksCode
2018CVPRPLANParallel attention: A unifi ed framework for visual object discovery through dialogs and queriesN/A
2018CVPRVCGrounding Referring Expressions in Images by Variational Contextcode
2018ArXivSSGReal-time referring expression comprehension by single-stage grounding networkN/A
2018CVPRA-ATTVisual grounding via accumulated attentionN/A
2019ICCVDGADynamic Graph Attention for Referring Expression ComprehensionN/A
2020CVPRRCCFA real-time cross-modality correlation fi ltering method for referring expression comprehensionN/A
2021CVPRLBYLLook before you leap: Learning landmark features for one-stage visual groundingcode
2019CVPRCM-Att-EImproving Referring Expression Grounding with Cross-modal Attention-guided ErasingN/A
2019ICCVFAOAA Fast and Accurate One-Stage Approach to Visual GroundingN/A
2016ECCVNeg BagModeling context between objects for referring expression understandingN/A
2020ECCVReSCImproving one-stage visual grounding by recursive sub-query constructionCode

B. Transformer-based Methods

YearVenueWork NamePaper Title / Paper LinkCode / Project
2021ICCVTransVGTransvg: End-to-end Visual Grounding with TransformersCode
2023TPAMITransVG++TransVG++: End-to-End Visual Grounding with Language Conditioned Vision TransformerN/A
2022CVPRQRNetShifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual GroundingCode
2024ACM MMMMCAVisual grounding with multimodal conditional adaptationCode

C. VLP-based Methods

YearVenueNamePaper Title / Paper LinkCode / Project
2023TMMCLIP-VGCLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual GroundingCode
2023TPAMID-MDETRDynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual GroundingCode
2022TNNLSWord2PixWord2Pix: Word to Pixel Cross-Attention Transformer in Visual GroundingCode
2023AAAILADSReferring Expression Comprehension Using Language Adaptive InferenceN/A
2023TIMJMRIVisual Grounding With Joint Multimodal Representation and InteractionN/A
2024ACM MMHiVGHiVG: Hierarchical Multimodal Fine-grained Modulation for Visual GroundingCode
2023AAAIDQ-DETRDQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingCode
2022NeurIPSFIBERCoarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode
2022EMNLPmPLUGmPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode
2022CVPRCrisCris: Clip driven referring image segmentationCode
2024NAACLRISCLIPExtending clip’s image-text alignment to referring image segmentationN/A

D. Grounding-oriented Pre-training

YearVenueNamePaper Title / Paper LinkCode / Project
2021ICCVMDETRTransvg: End-to-end Visual Grounding with TransformersCode
2022ICMLOFAOFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkCode
2022ECCVUniTABUniTAB: Unifying Text and Box Outputs for Grounded Vision-Language ModelingCode
2024ECCVGVCLlava-grounding: Grounded visual chat with large multimodal modelsN/A
2022CVPRGLIPGrounded language-image pretrainingCode
2021CVPROVR-CNNOpen-vocabulary object detection using captionsCode
2021CVPRMDETRMDETR - Modulated Detection for End-to-End Multi-Modal UnderstandingCode
2024NeurIPSOneRefOneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring ModelingCode
2022ICMLOFAOFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkCode
2020ECCVUNITERUNITER: UNiversal Image-TExt Representation LearningCode
2020NeurIPSVILLALarge-Scale Adversarial Training for Vision-and-Language Representation LearningCode
2022NeurIPSGlipv2Glipv2: Unifying localization and vision-language understandingCode
2024NeurIPSHIPIEHierarchical open-vocabulary universal image segmentationCode
2023CVPRUNINEXTUniversal instance perception as object discovery and retrievalCode
2019NeurIPSVilbertVilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasksCode
2020ICLRVl-bertVl-bert: Pre-training of generic visual-linguistic representationsCode Project
2023arXivONE-PEACEOne-peace: Exploring one general representation model toward unlimited modalitiesCode
2022FTCGVN/AVision-language pre-training: Basics, recent advances, and future trendsN/A
2023MIRN/ALarge-scale multi-modal pre-trained models: A comprehensive surveyN/A

E. Grounding Multimodal LLMs

YearVenueNamePaper Title / Paper LinkCode / Project
2024ArxivInst-ITInst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode / Project
2023ArxivShikraShikra: Unleashing Multimodal LLM's Referential Dialogue MagicCode
2022NeurIPSChinchillaTraining Compute-Optimal Large Language ModelsN/A
2019OpenAIGPT-2Language Models are Unsupervised Multitask LearnersN/A
2020NeurIPSGPT-3Language Models are Few-Shot LearnersN/A
2024ICLRFerretFerret: Refer And Ground Anything Anywhere At Any GranularityCode
2024CVPRLIONLION: Empowering Multimodal Large Language Model With Dual-Level Visual KnowledgeCode
2022ECCVYOROYORO - Lightweight End to End Visual GroundingCode
2022NeurIPSAdaptformerAdaptformer: Adapting vision transformers for scalable visual recognitionCode
2023ICMLBlip-2Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language modelsCode
2022CVPRGlammGlamm: Pixel grounding large multimodal modelCode
2024CVPRLisaLisa: Reasoning segmentation via large language modelGitHub
2024CVPRGSVAGSVA: Generalized segmentation via multimodal large language modelsGitHub
2024CoRRUnifiedMLLMUnifiedMLLM: Enabling unified representation for multi-modal multi-tasks with large language modelGitHub
2024arXivF-LMMF-LMM: Grounding frozen large multimodal modelsGitHub
2024arXivVigorVigor: Improving visual grounding of large vision language models with fine-grained reward modelingGitHub
2023arXivBuboGPTBuboGPT: Enabling visual grounding in multi-modal LLMsGitHub
2024ICLRMiniGPT-4MiniGPT-4: Enhancing vision-language understanding with advanced large language modelsGitHub
2024CVPRRegionGPTRegionGPT: Towards region understanding vision language modelGitHub
2024arXivTextHawkTextHawk: Exploring efficient fine-grained perception of multimodal large language modelsGitHub
2024ACM TMMPEARMultimodal PEAR: Chain-of-thought reasoning for multimodal sentiment analysisGitHub
2024ECCVGrounding DINOGrounding DINO: Marrying DINO with grounded pre-training for open-set object detectionGitHub
2023CVPRPolyformerPolyformer: Referring image segmentation as sequential polygon generationGitHub
2024ACM TMMUniQRNetUniQRNet: Unifying referring expression grounding and segmentation with QRNetGitHub
2022CVPRLAVTLAVT: Language-aware vision transformer for referring image segmentationGitHub
2024NeurIPSSimVGSimVG: A simple framework for visual grounding with decoupled multi-modal fusionGitHub
2024ICLRKOSMOS-2GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLDCode
2019OpenAIGPT-2Language Models are Unsupervised Multitask LearnersN/A
2020NeurIPSGPT-3Language Models are Few-Shot LearnersN/A
2023ArxivQWen-LQwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyondCode
2023ArxivLennaLenna:Language enhanced reasoning detection assistantCode
2023Arxivu-LLaVAu-LLaVA: Unifying Multi-Modal Tasks via Large Language ModelCode
2024ArxivCogvlmCogvlm: Visual expert for pretrained language modelsCode
2024CVPRVistaLLMJack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language ModelN/A
2024CORRVisCoTVisual cot: Unleashing chain-of-thought reasoning in multimodal language modelsCode
2024ICLRFerretFerret: Refer And Ground Anything Anywhere At Any GranularityCode
2024CVPRLIONLION: Empowering Multimodal Large Language Model With Dual-Level Visual KnowledgeCode
2024COLMFerretFerret-v2: An Improved Baseline for Referring and Grounding with Large Language ModelsN/A
2022ECCVYOROYORO - Lightweight End to End Visual Grounding
2023arXivNExT-ChatNExT-Chat: An LMM for Chat, Detection and SegmentationCode
2023arXivMiniGPT-v2MiniGPT-v2: large language model as a unified interface for vision-language multi-task learningCode
2024ACLG-GPTGroundingGPT: Language Enhanced Multi-modal Grounding ModelCode
2024ECCVGromaGroma: Localized Visual Tokenization for Grounding Multimodal Large Language ModelsCode
2023NeurIPSVisionLLMVisionllm: Large language model is also an open-ended decoder for vision-centric tasksCode
2022NeurIPSInstructGPTTraining language models to follow instructions with human feedbackCode
2023arXivGPT-4Gpt-4 technical reportCode
2023arXivLlamaLlama: Open and efficient foundation language modelsCode
2023JMLRPalmPalm: Scaling language modeling with pathwaysCode
2023N/AAlpacaStanford alpaca: An instruction-following llama modelCode Project
2023arXivN/AInstruction tuning with gpt-4Code Project
2023NeurIPSKOSMOS-1Language is not all you need: Aligning perception with language modelsCode
2024TMLRDinov2Dinov2: Learning robust visual features without supervisionCode

1.2 Weakly Supervised Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2016ECCVGroundRGrounding of Textual Phrases in Images by ReconstructionN/A
2017CVPRN/AWeakly-supervised Visual Grounding of Phrases with Linguistic StructuresN/A
2014EMNLPGloveGloVe: Global Vectors for Word RepresentationProject
2015CVPRN/ADeep Visual-Semantic Alignments for Generating Image DescriptionsProject Code
2016ECCVGroundRGrounding of textual phrases in images by reconstructionN/A
2017ICCVMask R-CNNMask R-CNNCode
2017ICCVGrad-CAMGrad-CAM:Visual Explanations from Deep Networks via Gradient-based LocalizationCode
2018CVPRKACKnowledge Aided Consistency for Weakly Supervised Phrase GroundingCode
2018arXivCPCRepresentation learning with contrastive predictive codingCode
2019ACM MMKPRNKnowledgeguided pairwise reconstruction network for weakly supervised referring expression groundingCode
2021ICCVGbSDetector-free weakly supervised grounding by separationCode
2021TPAMIDTWREGDiscriminative Triad Matching and Reconstruction for Weakly Referring Expression GroundingCode
2021CVPRReIRRelation-aware Instance Refinement for Weakly Supervised Visual GroundingCode
2022ICMLBLIPBLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationCode
2022CVPRMask2FormerMasked-attention Mask Transformer for Universal Image SegmentationProject Code
2023ACM MMCACMRNClient-adaptive cross-model reconstruction network for modality-incomplete multimodal federated learningN/A
2023CVPRg++Similarity Maps for Self-Training Weakly-Supervised Phrase GroundingCode
2023CVPRRefCLIPRefCLIP: A Universal Teacher for Weakly Supervised Referring Expression ComprehensionCode
2024TOMMURUniversal Relocalizer for Weakly Supervised Referring Expression GroundingN/A
2024ICASSPVPT-WSVGVisual prompt tuning for weakly supervised phrase groundingN/A
2024MMMPPTPart-Aware Prompt Tuning for Weakly Supervised Referring Expression GroundingN/A
2016ECCVGroundRGrounding of Textual Phrases in Images by ReconstructionN/A
2018CVPRMATNWeakly Supervised Phrase Localization With Multi-Scale Anchored Transformer NetworkN/A
2019ICCVARNAdaptive Reconstruction Network for Weakly Supervised Referring Expression GroundingCode
2019ICCVAlign2GroundAlign2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption AlignmentN/A
2020ECCVinfo-groundContrastive Learning for Weakly Supervised Phrase GroundingProject
2020EMNLPMAFMAF: Multimodal Alignment Framework for Weakly-Supervised Phrase GroundingCode
2020NeurIPSCCLCounterfactual Contrastive Learning for Weakly-Supervised Vision-Language GroundingN/A
2021CVPRNCE-DistillImproving Weakly Supervised Visual Grounding by Contrastive Knowledge DistillationN/A
2022TPAMIEARNEntity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression GroundingCode
2022IMCLX-VLMMulti-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsCode
2023TMMDRLFA Dual Reinforcement Learning Framework for Weakly Supervised Phrase GroundingN/A
2023TIPCycleCycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual RepresentationsCode
2023ICRATGKDWeakly Supervised Referring Expression Grounding via Target-Guided Knowledge DistillationCode
2023ICCVCPLConfidence-aware Pseudo-label Learning for Weakly Supervised Visual GroundingCode
2024CVPRRSMPLRegressor-Segmenter Mutual Prompt Learning for Crowd CountingCode
2024TCSVTPSRNProgressive Semantic Reconstruction Network for Weakly Supervised Referring Expression GroundingCode
2024ACM MMQueryMatchQueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual GroundingCode

1.3 Semi-supervised Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2023ICASSPPQG-DistilPseudo-Query Generation For Semi-Supervised Visual Grounding With Knowledge DistillationN/A
2021WACVLSEPUtilizing Every Image Object for Semi-supervised Phrase GroundingN/A
2022CRVSS-GroundSemi-supervised Grounding Alignment for Multi-modal Feature LearningN/A
2021AAAICurriculum LabelingCurriculum labeling: Revisiting pseudo-labeling for semi-supervised learningCode
2023ICASSPPQG-DistilPseudo-Query Generation For Semi-Supervised Visual Grounding With Knowledge DistillationN/A
2024CoRRACTRESSActress: Active retraining for semi-supervised visual groundingN/A
2021WACVLSEPUtilizing Every Image Object for Semi-supervised Phrase GroundingN/A
2022CRVSS-GroundSemi-supervised Grounding Alignment for Multi-modal Feature LearningN/A
2019IJCAIN/ALearning unsupervised visual grounding through semantic self-supervisionN/A
2019ICCVN/APhrase Localization Without Paired Training ExamplesN/A
2022CVPRPseudo-QPseudo-q: Generating pseudo language queries for visual groundingCode
2023NeurocomputingBiCMUnpaired referring expression grounding via bidirectional cross-modal matchingN/A
2024NeurocomputingN/ASelf-training: A surveyN/A
2024CVPROmni-q Omni-q: Omni-directional scene understanding for unsupervised visual groundingN/A
2018CVPRN/AUnsupervised Textual Grounding: Linking Words to Image ConceptsN/A

1.4 Unsupervised Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2022CVPRPseudo-QPseudo-Q: Generating pseudo language queries for visual groundingCode
2018CVPRN/AUnsupervised Textual Grounding: Linking Words to Image ConceptsN/A
2023TMMCLIP-VGCLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual GroundingCode
2024ICMEVG-annotatorVG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual GroundingN/A
2023TMMCLIPRECCLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression ComprehensionN/A
2019IJCAIN/ALearning unsupervised visual grounding through semantic self-supervisionN/A
2019ICCVN/APhrase Localization Without Paired Training ExamplesN/A
2022CVPRPseudo-QPseudo-q: Generating pseudo language queries for visual groundingCode
2023NeurocomputingBiCMUnpaired referring expression grounding via bidirectional cross-modal matchingN/A
2024NeurocomputingN/ASelf-training: A surveyN/A
2024CVPROmni-q Omni-q: Omni-directional scene understanding for unsupervised visual groundingN/A
2018CVPRN/AUnsupervised Textual Grounding: Linking Words to Image ConceptsN/A

1.5 Zero-shot Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2019ICCVZSGNetZero-shot Grounding of Objects from Natural Language QueriesCode
2022ACLReCLIPReCLIP: A Strong Zero-shot Baseline for Referring Expression ComprehensionCode
2024NeurocomputingOV-VGOV-VG: A Benchmark for Open-Vocabulary Visual GroundingCode
2023TMMCLIPRECCLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression ComprehensionN/A
2024NeurocomputingN/AZero-shot visual grounding via coarse-to-fine representation learningCode
2022Arxivadapting-CLIPAdapting CLIP For Phrase Localization Without Further TrainingCode
2023ICLRChatRefLanguage models can do zero-shot visual referring expression comprehensionCode
2024AI OpenCptCPT: Colorful Prompt Tuning for pre-trained vision-language modelsCode
2021CVPRVinVLVinVL: Revisiting Visual Representations in Vision-Language ModelsCode
2024CVPRVR-VLAZero-shot referring expression comprehension via structural similarity between images and captionsCode
2024AAAIGroundVLPGroundvlp: Harnessing zeroshot visual grounding from vision-language pre-training and openvocabulary object detectionCode
2024TCSVTMCCE-RECMCCE-REC: MLLM-driven Cross-modal Contrastive Entropy Model for Zero-shot Referring Expression ComprehensionN/A
2024ECCVCRGContrastive Region Guidance: Improving Grounding in Vision-Language Models Without TrainingCode
2024IJCNNPSAIRPsair: A neurosymbolic approach to zero-shot visual groundingN/A
2024TPAMITransCPContext disentangling and prototype inheriting for robust visual groundingCode
2024TPAMIN/ATowards Open Vocabulary Learning: A SurveyCode
2024CVPRGEMGrounding everything: Emerging localization properties in vision-language transformersCode
2023ArxivGRILLGrill: Grounded vision-language pre-training via aligning text and image regionsN/A
2017ICCVGrad-CAMGrad-CAM:Visual Explanations from Deep Networks via Gradient-based LocalizationCode
2019ICCVZSGNetZero-shot grounding of objects from natural language queriesCode
2022ACLReCLIPReclip: A strong zero-shot baseline for referring expression comprehensionCode
2022CVPRGLIPGrounded language-image pretrainingCode
2022AAAIMMKGImproving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial RelationsN/A
2021CVPROVR-CNNOpen-vocabulary object detection using captionsCode
2024ICLRKOSMOS-2GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLDCode
2024NeurocomputingOV-VGOV-VG: A Benchmark for Open-Vocabulary Visual GroundingCode

1.6 Multi-task Setting

A. REC with REG Multi-task Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2024ArxivVLM-VGLearning visual grounding from generative vision and language modelN/A
2024ArxivEEVGAn efficient and effective transformer decoder-based framework for multi-task visual groundingCode
2006INLGCN/ABuilding a Semantically Transparent Corpus for the Generation of Referring ExpressionsProject
2010ACLN/ANatural reference to objects in a visual domainCode
2012CLSurveyComputational generation of referring expressions: A surveyN/A
2013NAACLN/AGenerating expressions that refer to visible objectCode
2016CVPRNMIGeneration and comprehension of unambiguous object descriptionsCode
2017ICCVAttributeReferring Expression Generation and Comprehension via AttributesN/A
2017CVPRSLRA Joint Speaker-Listener-Reinforcer Model for Referring ExpressionsN/A
2017CVPRCGComprehension-guided referring expressionsN/A
2024AAAICyCoA Joint Speaker-Listener-Reinforcer Model for Referring ExpressionsN/A

B. REC with RES Multi-task Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2020CVPRMCNMulti-task Collaborative Network for Joint Referring Expression Comprehension and Segmentationcode
2021NeurIPSRefTRReferring Transformer: A One-step Approach to Multi-task Visual Groundingcode
2022ECCVSeqTRSeqTR: A Simple yet Universal Network for Visual Groundingcode
2023CVPRVG-LAWLanguage Adaptive Weight Generation for Multi-task Visual Groundingcode
2024NeurocomputingM2IFImproving visual grounding with multi-modal interaction and auto-regressive vertex generationCode

C. Other Multi-task Setting

YearVenueNamePaper Title / Paper LinkCode / Project
2016EMNLPMCB Multimodal compact bilinear pooling for visual question answering and visual groundinCode
2024CVPRRefCountReferring expression countingCode
2022CVPRVizWiz-VQA-GroundingGrounding Answers for Visual Questions Asked by Visually Impaired PeopleProject
2022ECCVN/AWeakly supervised grounding for VQA in vision-language transformersCode
2020ACLN/AA Negative Case Analysis of Visual Grounding Methods for VQACode
2024ArxivTrueVGUncovering the Full Potential of Visual Grounding Methods in VQACode
2020IVCN/AExplaining VQA predictions using visual grounding and a knowledge baseN/A
2019CVPRN/AMulti-task Learning of Hierarchical Vision-Language RepresentationN/A

1.7 Generalized Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2021CVPROVR-CNNOpen-Vocabulary Object Detection Using CaptionsCode
2021ICCVVLTVision-Language Transformer and Query Generation for Referring SegmentationCode
2023ArxivGRECGREC:Generalized Referring Expression ComprehensionCode
2024EMNLPRECANTFormerRecantformer: Referring expression comprehension with varying numbers of targetsN/A
2023CVPRgRefCOCOGRES: Generalized Referring Expression SegmentationCode
2024ICCVRef-ZOMBeyond One-to-One: Rethinking the Referring Image SegmentationCode

2. Advanced Topics

2.1 NLP Language Structure Parsing in Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2019ICCVNMTreeLearning to assemble neural module tree networks for visual groundingN/A
2017CVPRCMNModeling relationships in referential expressions with compositional modular networksCode
2015EMNLPN/AAn improved non-monotonic transitionsystem for dependency parsingN/A
2014EMNLPN/AA fast and accurate dependency parser using neural networksN/A
2020NSPNLPPythonNatural language processing with Python and spaCy: A practical introductionN/A
2020ArxivAtanzaStanza: A Python Natural Language Processing Toolkit for Many Human LanguagesProject
2016ECCVN/AStructured matching for phrase localizationN/A
2017ICCVN/APhrase localization and visual relationship detection with comprehensive image-language cuesCode
2022CVPRGLIPGrounded language-image pretrainingCode
2017ICCVQRC NetQuery-guided regression network with context policy for phrase groundingN/A
2006ACLNLTKNltk: the natural language toolkitCode
2019SNAMSOpenNLPA Replicable Comparison Study of NER Software:StanfordNLP,NLTK, OpenNLP, SpaCy, GateN/A
2018PacktGensimNatural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and KerasN/A
2013ACLKerasParsing with compositional vector grammarsN/A
2018AAAIGroundNetUsing Syntax to GroundReferring Expressions in Natural ImagesCode
2019TPAMIRVGTreeLearning to Compose and Reason with Language Tree Structures for Visual GroundingN/A
2024CVPRARPGroundingInvestigating Compositional Challenges in Vision-Language Models for Visual GroundingN/A

2.2 Spatial Relation and Graph Networks

YearVenueNamePaper Title / Paper LinkCode / Project
2023TMMCLIPRECCLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression ComprehensionN/A
2024ACM MMResVGResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Groundingcode
2023ArxivShikraCode
2023ACM MMTAGRLTowards adaptable graph representation learning: An adaptive multi-graph contrastive transformerN/A
2020AAAICMCCLearning cross-modal context graph for visual groundingcode
2019CVPRLGRANsNeighbourhood watch: Referring expression comprehension via language-guided graph attention networksN/A
2019CVPRCMRINCross-Modal Relationship Inference for Grounding Referring ExpressionsN/A
2019ICCVDGADynamic Graph Attention for Referring Expression ComprehensionN/A
2024TPAMIN/AA Survey on Graph Neural Networks and GraphTransformers in Computer Vision: A Task-Oriented PerspectiveN/A

2.3 Modular Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2018CVPRMattnetMattnet: Modular attention network for referring expression comprehensionCode
2017CVPRCMNModeling relationships in referential expressions with compositional modular networksCode
2016CVPRNMNNeural Module Networkscode
2019CVPRMTGCRModularized Textual Grounding for Counterfactual ResilienceN/A

3. Applications

YearVenueNamePaper Title / Paper LinkCode / Project
2019CVPRCAGDCContext and Attribute Grounded Dense CaptioningN/A

3.1 Grounded Object Detection

YearVenueNamePaper Title / Paper LinkCode / Project
2024NeurlPSMQ-DetMulti-modal queried object detection in the wildcode
2023ArxivShikraCode
2022CVPRGLIPGrounded language-image pretrainingCode
2024CVPRScanFormerScanFormer: Referring Expression Comprehension by Iteratively ScanningN/A
2024ArxivRef-L4Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Modelscode

3.2 Referring Counting

YearVenueNamePaper Title / Paper LinkCode / Project
2024CVPRRefCountReferring expression countingCode

3.3 Remote Sensing Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2024TGRSRrsis“Rrsis: Referring remote sensing image segmentationcode
2024TGRSLQVGLanguage query based transformer with multi-scale cross-modal alignment for visual grounding on remote sensing imagescode
2024TGRSRINetA regionally indicated visual grounding network for remote sensing imagescode
2024GRSLMSAMMulti-stage synergistic aggregation network for remote sensing visual groundingcode
2024GRSLVSMRVisual selection and multi-stage reasoning for rsvgN/A
2024TGRSLPVALanguage-guided progressive attention for visual grounding in remote sensing imagescode
2024ArxivGeoGroundGeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Groundingcode
2023TGRSRSVGRSVG: Exploring Data and Models for Visual Grounding on Remote Sensing DataN/A
2022ACM MMRSVGVisual grounding in remote sensing imagescode

3.4 Medical Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2023MICCAIMedRPGMedical Grounding with Region-Phrase Context Contrastive AlignmentN/A
2024ArxivPFMVGParameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Groundingunavailable
2022ECCVCXR-BERTMaking the most of text semantics to improve biomedical vision–language processingcode
2017CVPRChestX-ray8Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseasesN/A
2019ArxivMIMIC-CXR-JPGMIMIC-CXR-JPG, a large publicly available database of labeled chest radiographsCode
2024ArxivMedRGMedRG: Medical Report Grounding with Multi-modal Large Language ModelN/A
2024ArxivVividMedVividMed: Vision Language Model with Versatile Visual Grounding for MedicineCode
2023ArxivViLaMViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization CapabilityCode

3.5 3D Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2022CVPR3D-SPS3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive SelectionCode
2021ACMMMTransRefer3DTransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual GroundingCode
2020ECCVScanreferScanrefer: 3d object localization in rgb-d scans using natural languageCode
2020ECCVReferIt3DReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World ScenesCode
2024Arxiv-A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future DirectionsN/A

3.6 Video Object Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2020CVPRVOGNetVideo object grounding using semantic roles in language descriptionCode
2024Arxiv-Described Spatial-Temporal Video Detectionunavailable
2023TOMM-A survey on temporal sentence grounding in videosN/A
2023TPAMI-Temporal sentence grounding in videos: A survey and future directionsN/A
2024CVPRMC-TTAModality-Collaborative Test-Time Adaptation for Action RecognitionN/A
2023CVPRTransRMOTReferring multi-object trackingcode

3.7 Robotic and Multimodal Agent Applications

YearVenueNamePaper Title / Paper LinkCode / Project
2018CVPRVLNVision-and-language navigation: Interpreting visually-grounded navigation instructions in real environmentsData
2019RASDynamic-SLAMDynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environmentCode
2019WCSPN/AIntegrated Wearable Indoor Positioning System Based On Visible Light Positioning And Inertial Navigation Using Unscented Kalman FilterN/A
2019ICRAGround then NavigateGround then Navigate: Language-guided Navigation in Dynamic ScenesCode
2023MEAS SCI TECHNOLFDO-CalibrFDO-Calibr: visual-aided IMU calibration based on frequency-domain optimizationN/A
2024arxivHiFi-CSTowards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language ModelsN/A
2025ECCVFerret-UIGrounded Mobile UI Understanding with Multimodal LLMsN/A

3.8 Multi-image Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2025ACLMigicianMigician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language ModelsCode/ Project
2025arxivUniVG-R1UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement LearningCode/ Project

4. Datasets and Benchmarks

4.1 The Five Datasets for Classical Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2010CVIUN/AThe segmented and annotated iapr tc-12 benchmarkN/A
2014ECCVMS COCOMicrosoft COCO: Common Objects in ContextProject
2014TACLN/AFrom image descriptions to visual denotations:Newsimilarity metrics for semantic inference over event descriptionsN/A
2015ICCVFlickr30k EntitiesFlickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence modelsCode
2016ECCVRefCOCOg-umdModeling context between objects for referring expression understandingN/A
2016CVPRRefCOCOg-gGeneration and comprehension of unambiguous object descriptionsCode
2016ECCVRefCOCO/+Modeling context in referring expressionsData
2017IJCVVisual genomeVisual genome: Connecting language and vision using crowdsourced dense image annotationsN/A
2019CVPRTD-SDRTOUCHDOWN:NaturalLanguageNavigationandSpatialReasoning inVisualStreetEnvironmentsCode
2019CVPRCLEVRCLEVR:ADiagnostic Dataset for Compositional Language and Elementary Visual ReasoningCode
2020CVPRREVERIEREVERIE: Remote Embodied Visual Referring Expression in Real Indoor EnvironmentsCode
2020CVPRPANDAPANDA: AGigapixel-level Human-centric Video DatasetCode
2024arxivDINO-XDINO-X:AUnifiedVisionModelfor Open-WorldObjectDetectionandUnderstandingCode
2024arxivMC-BenchMC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMsCode
2025arxivT-Rex2T-Rex2: Towards Generic Object Detection via Text-Visual Prompt SynergyCode

4.2 The Other Datasets for Classical Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2024ArxivVLM-VGLearning visual grounding from generative vision and language modelN/A
2011NeurIPSSBUIm2text: Describing images using 1 million captioned photographsN/A
2016CVPRVisual7WVisual7W: Grounded Question Answering in ImagesCode
2017CVPRGuessWhat?!GuessWhat?! Visual object discovery through multi-modal dialogue
2018ACLCC3MConceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioningCode
2019CVPRClevr-ref+Clevr-ref+: Diagnosing visual reasoning with referring expressionsCode
2019arxivObject365Objects as PointsCode
2020IJCVOpen ImageThe open images dataset v4: Unified image classification, object detection, and visual relationship detection at scaleCode
2020CVPRCops-refCops-ref: A new dataset and task on compositional referring expression comprehensionCode
2020ACLRefer360Refer360: A referring expression recognition dataset in 360 imagesCode
2021CVPRCC12MConceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual conceptsCode
2023ICCVSAMSegment AnythingCode
2024EMNLPFineCops-RefFineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode

4.3 Dataset for the Newly Curated Scenarios

YearVenueNamePaper Title / Paper LinkCode / Project
2024NeurIPSD3^3Described Object Detection: Liberating Object Detection with Flexible ExpressionsCode

A. Dataset for Generalized Visual Grounding

YearVenueNamePaper Title / Paper LinkCode / Project
2023CVPRgRefCOCOGRES: Generalized Referring Expression SegmentationCode
2024ICCVRef-ZOMBeyond One-to-One: Rethinking the Referring Image SegmentationCode

B. Datasets and Benchmarks for GMLLMs

YearVenueNamePaper Title / Paper LinkCode / Project
2024ArxivInst-ITInst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode / Project
2024NeurIPSHC-RefLoCoA Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM EraCode
2024ECCVGVCLlava-grounding: Grounded visual chat with large multimodal modelsN/A
2024ICLRKOSMOS-2GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLDCode

C. Dataset for Other Newly Curated Scenarios

YearVenueNamePaper Title / Paper LinkCode / Project
2024CVPRGigaGroundWhen visual grounding meets gigapixel-level large-scale scenes: Benchmark and approachCode
2025ACLMGrounding-630k/MIG-BenchMigician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language ModelsCode/ Project
2025ArxivPixMMVP, PixCV-Bench (Pixel-level VG)PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?Code/ Project
2025ArxivMoCentric-Bench (Pixel-level VG)PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?Code/ Project

5. Challenges And Outlook

YearVenueNamePaper Title / Paper LinkCode / Project
2024Arxiv-AI Models Collapse When Trained on Recursively Generated DataN/A
2024CVPRRefCountReferring expression countingCode
2024CVPRGigaGroundWhen visual grounding meets gigapixel-level large-scale scenes: Benchmark and approachCode
2022CVPRGLIPGrounded language-image pretrainingCode
2025ACLMigicianMigician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language ModelsCode/ Project

6. Other Valuable Survey and Project

YearVenueNamePaper Title / Paper LinkCode / Project
2018TPAMIN/AMultimodal machine learning: A survey and taxonomyN/A
2020TMMN/AReferring expression comprehension: A survey of methods and datasetsN/A
2021Githubawesome-groundingN/AProject
2023TPAMIAwesome-Open-VocabularyTowards Open Vocabulary Learning: A SurveyProject
2023TPAMIN/AMultimodal learning with transformers: A surveyN/A
2024Githubawesome-described-object-detectionN/Aawesome-described-object-detection

Acknowledgement

This survey took half a year to complete, and the process was laborious and burdensome.

Building up this GitHub repository also required significant effort. We would like to thank the following individuals for their contributions to completing this project: Baochen Xiong, Yifan Xu, Yaguang Song, Menghao Hu, Han Jiang, Hao Liu, Chenlin Zhao, Fang Peng, Xudong Yao, Zibo Shao, Kaichen Li, Jianhao Huang, Xianbing Yang, Shuaitong Li, Jisheng Yin, Yupeng Wu, Shaobo Xie, etc.

Contact

Email: xiaolinhui16@mails.ucas.ac.cn. Any kind discussions are welcomed!

Star History

Star History Chart