Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

February 20, 2026 ยท View on GitHub

arXiv Website ACL

This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.

๐Ÿ“ข News

  • January 9, 2026: Happy New Year! Weโ€™ve added some new papers in the field.
  • September 19, 2025: We've just added new papers to our repository.
  • August 20, 2025: The poster and slide for this survey paper have been added to the repository for readers.
  • August 1, 2025: We've just added new papers to our repository; a major update!
  • June 2, 2025: A new enhanced version of our paper is out now on arXiv! This update also includes new related papers and covers new topics such as agentic interaction and audio-centric retrieval.
  • May 15, 2025: This paper has been accepted for publication in the ACL 2025 Findings.
  • April 18, 2025: Our website for this topic is up now.
  • February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation. Feel free to cite, contribute, or open a pull request to add recent related papers!

๐Ÿ“‘ List of Contents


๐Ÿ”Ž General Pipeline

MM-RAG (1)

๐ŸŒฟ Taxonomy of Recent Advances and Enhancements

Multimodal_Retrieval_Augmented_Generation__A_Survey___acl_final_organized

โš™ Taxonomy of Application Domains

applications

๐Ÿ“ Abstract

Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

๐Ÿ–ผ Image-Text

NameStatistics and DescriptionModalitiesLink
MAVIS157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documentsImage, TextMAVIS
M4-RAG80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalitiesImage, TextM4-RAG
LAION-400M200M imageโ€“text pairs; used for pre-training multimodal models.Image, TextLAION-400M
Conceptual-Captions (CC)15M imageโ€“caption pairs; multilingual Englishโ€“German image descriptions.Image, TextConceptual Captions
CIRR36,554 triplets from 21,552 images; focuses on natural image relationships.Image, TextCIRR
MS-COCO330K images with captions; used for caption-to-image and image-to-caption generation.Image, TextMS-COCO
Flickr30K31K images annotated with five English captions per image.Image, TextFlickr30K
Multi30K30K German captions from native speakers and human-translated captions.Image, TextMulti30K
NoCapsFor zero-shot image captioning evaluation; 15K images.Image, TextNoCaps
Laion-5B5B imageโ€“text pairs used as external memory for retrieval.Image, TextLAION-5B
COCO-CN20,341 images for cross-lingual tagging and captioning with Chinese sentences.Image, TextCOCO-CN
CIRCO1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval.Image, TextCIRCO

๐ŸŽž Video-Text

NameStatistics and DescriptionModalitiesLink
BDD-X77 hours of driving videos with expert textual explanations; for explainable driving behavior.Video, TextBDD-X
YouCook22,000 cooking videos with aligned descriptions; focused on videoโ€“text tasks.Video, TextYouCook2
ActivityNet20,000 videos with multiple captions; used for video understanding and captioning.Video, TextActivityNet
SoccerNetVideos and metadata for 550 soccer games; includes transcribed commentary and key event annotations.Video, TextSoccerNet
MSR-VTT10,000 videos with 20 captions each; a large video description dataset.Video, TextMSR-VTT
MSVD1,970 videos with approximately 40 captions per video.Video, TextMSVD
LSMDC118,081 videoโ€“text pairs from 202 movies; a movie description dataset.Video, TextLSMDC
DiDemo10,000 videos with four concatenated captions per video; with temporal localization of events.Video, TextDiDemo
Breakfast1,712 videos of breakfast preparation; one of the largest fully annotated video datasets.Video, TextBreakfast
COIN11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis.Video, TextCOIN
MSRVTT-QAVideo question answering benchmark.Video, TextMSRVTT-QA
MSVD-QA1,970 video clips with approximately 50.5K QA pairs; video QA dataset.Video, TextMSVD-QA
ActivityNet-QA58,000 humanโ€“annotated QA pairs on 5,800 videos; benchmark for video QA models.Video, TextActivityNet-QA
EpicKitchens-100700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset.Video, TextEPIC-KITCHENS-100
Ego4D4.3M videoโ€“text pairs for egocentric videos; massive-scale egocentric video dataset.Video, TextEgo4D
HowTo100M136M video clips with captions from 1.2M YouTube videos; for learning textโ€“video embeddings.Video, TextHowTo100M
CharadesEgo68,536 activity instances from egoโ€“exo videos; used for evaluation.Video, TextCharades-Ego
ActivityNet Captions20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos.Video, TextActivityNet Captions
VATEX34,991 videos, each with multiple captions; a multilingual video-and-language dataset.Video, TextVATEX
Charades9,848 video clips with textual descriptions; a multimodal research dataset.Video, TextCharades
WebVid10M videoโ€“text pairs (refined to WebVid-Refined-1M).Video, TextWebVid
Youku-mPLUGChinese dataset with 10M videoโ€“text pairs (refined to Youku-Refined-1M).Video, TextYouku-mPLUG

๐Ÿ”Š Audio-Text

NameStatistics and DescriptionModalitiesLink
LibriSpeech1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks.Audio, TextLibriSpeech
SpeechBrown55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction.Audio, TextSpeechBrown
AudioCap46K audio clips paired with human-written text captions.Audio, TextAudioCaps
AudioSet2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental).AudioAudioSet

๐Ÿฉบ Medical

NameStatistics and DescriptionModalitiesLink
MIMIC-CXR125,417 labeled chest X-rays with reports; widely used for medical imaging research.Image, TextMIMIC-CXR
CheXpert224,316 chest radiographs of 65,240 patients; focused on medical analysis.Image, TextCheXpert
MIMIC-IIIHealth-related data from over 40K patients; includes clinical notes and structured data.TextMIMIC-III
IU-Xray7,470 pairs of chest X-rays and corresponding diagnostic reports.Image, TextIU-Xray
PubLayNet100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis.Image, TextPubLayNet

๐Ÿ‘— Fashion

NameStatistics and DescriptionModalitiesLink
Fashion-IQ77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics.Image, TextFashion-IQ
FashionGen260.5K imageโ€“text pairs of fashion images and item descriptions.Image, TextFashionGen
VITON-HD83K images for virtual try-on; high-resolution clothing items dataset.Image, TextVITON-HD
Fashionpedia48,000 fashion images annotated with segmentation masks and fine-grained attributes.Image, TextFashionpedia
DeepFashionApproximately 800K diverse fashion images for pseudo triplet generation.Image, TextDeepFashion

๐Ÿ’ก QA

NameStatistics and DescriptionModalitiesLink
VQA400K QA pairs with images for visual question-answering tasks.Image, TextVQA
PAQ65M text-based QA pairs; a large-scale dataset for open-domain QA tasks.TextPAQ
ELI5270K complex questions augmented with web pages and images; designed for long-form QA tasks.TextELI5
OK-VQA14K questions requiring external knowledge for visual question answering tasks.Image, TextOK-VQA
WebQA46K queries requiring reasoning across text and images; multimodal QA dataset.Text, ImageWebQA
InfoseekFine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages).Image, TextInfoseek
ClueWeb2210 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks.TextClueWeb22
MOCHEG15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence.Text, ImageMOCHEG
VQA v21.1M questions (augmented with VG-QA questions) for fine-tuning VQA models.Image, TextVQA v2
A-OKVQABenchmark for visual question answering using world knowledge; around 25K questions.Image, TextA-OKVQA
XL-HeadTags415K news headline-article pairs spanning 20 languages across six diverse language families.TextXL-HeadTags
SEED-Bench19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions.TextSEED-Bench

๐ŸŒŽ Other

NameStatistics and DescriptionModalitiesLink
ImageNet14M labeled images across thousands of categories; used as a benchmark in computer vision research.ImageImageNet
Oxford Flowers102Dataset of flowers with 102 categories for fine-grained image classification tasks.ImageOxford Flowers102
Stanford CarsImages of different car models (five examples per model); used for fine-grained categorization tasks.ImageStanford Cars
GeoDE61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition.ImageGeoDE

๐Ÿ“„ Papers

๐Ÿ‘“ Retrieval Strategies Advances

๐Ÿ” Efficient-Search and Similarity Retrieval

โ“ Maximum Inner Product Search-MIPS
๐Ÿ’ซ Multi-Modal Encoders

๐ŸŽจ Modality-Centric Retrieval

๐Ÿ“‹ Text-Centric
๐Ÿ“ธ Vision-Centric
๐ŸŽฅ Video-Centric
๐ŸŽถ Audio-Centric
๐Ÿ“ฐ Document Retrieval and Layout Understanding

๐Ÿฅ‡๐Ÿฅˆ Re-ranking Strategies

๐ŸŽฏ Optimized Example Selection
๐Ÿงฎ Relevance Score Evaluation
โณ Filtering Mechanisms

๐Ÿ›  Fusion Mechanisms

๐ŸŽฐ Score Fusion and Alignment

โš” Attention-Based Mechanisms

๐Ÿงฉ Unified Frameworks

๐Ÿš€ Augmentation Techniques

๐Ÿ’ฐ Context-Enrichment

๐ŸŽก Adaptive and Iterative Retrieval

๐Ÿค– Generation Techniques

๐Ÿง  In-Context Learning

๐Ÿ‘จโ€โš–๏ธ Reasoning

๐Ÿคบ Instruction Tuning

๐Ÿ“‚ Source Attribution and Evidence Transparency

๐Ÿ“‚ Agentic Generation and Interaction

๐Ÿ”ง Training Strategies and Loss Function

๐Ÿ›ก๏ธ Robustness and Noise Management

๐Ÿ›  Tasks Addressed by Multimodal RAGs

๐Ÿฉบ Healthcare and Medicine

๐Ÿ’ป Software Engineering

๐Ÿ•ถ๏ธ Fashion and E-Commerce

๐Ÿคน Entertainment and Social Computing

๐Ÿš— Emerging Applications

๐Ÿ“ Evaluation Metrics

๐Ÿ“Š Retrieval Performance

It represents the minimum value between precision (+P) and sensitivity (Se), providing a balanced measure of model performance.

๐Ÿ“ Fluency and Readability

โœ… Relevance and Accuracy

๐Ÿ”— Text Similarity and Overlap Metrics

๐Ÿ“Š Statistical Metrics

โš™๏ธ Efficiency and Computational Performance

๐Ÿฅ Domain-Specific Metrics


This README is a work in progress and will be completed soon. Stay tuned for more updates!


๐Ÿ”— Citations

If you find our paper or repository useful, please cite the paper:

@misc{abootorabi2025askmodalitycomprehensivesurvey,
      title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation}, 
      author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
      year={2025},
      eprint={2502.08826},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08826}, 
}

๐Ÿ“ง Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.

โญ Star History

Star History Chart