Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

February 20, 2026 · View on GitHub

This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.

📢 News

January 9, 2026: Happy New Year! We’ve added some new papers in the field.
September 19, 2025: We've just added new papers to our repository.
August 20, 2025: The poster and slide for this survey paper have been added to the repository for readers.
August 1, 2025: We've just added new papers to our repository; a major update!
June 2, 2025: A new enhanced version of our paper is out now on arXiv! This update also includes new related papers and covers new topics such as agentic interaction and audio-centric retrieval.
May 15, 2025: This paper has been accepted for publication in the ACL 2025 Findings.
April 18, 2025: Our website for this topic is up now.
February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation. Feel free to cite, contribute, or open a pull request to add recent related papers!

📑 List of Contents

🔎 General Pipeline
🌿 Taxonomy of Recent Advances and Enhancements
⚙ Taxonomy of Application Domains
📝 Abstract
📊 Overview of Popular Datasets
📄 Papers
🔗 Citations
📧 Contact

🔎 General Pipeline

MM-RAG (1)

🌿 Taxonomy of Recent Advances and Enhancements

Multimodal_Retrieval_Augmented_Generation__A_Survey___acl_final_organized

⚙ Taxonomy of Application Domains

applications

📝 Abstract

Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

📊 Overview of Popular Datasets

🖼 Image-Text

Name	Statistics and Description	Modalities	Link
MAVIS	157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents	Image, Text	MAVIS
M4-RAG	80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities	Image, Text	M4-RAG
LAION-400M	200M image–text pairs; used for pre-training multimodal models.	Image, Text	LAION-400M
Conceptual-Captions (CC)	15M image–caption pairs; multilingual English–German image descriptions.	Image, Text	Conceptual Captions
CIRR	36,554 triplets from 21,552 images; focuses on natural image relationships.	Image, Text	CIRR
MS-COCO	330K images with captions; used for caption-to-image and image-to-caption generation.	Image, Text	MS-COCO
Flickr30K	31K images annotated with five English captions per image.	Image, Text	Flickr30K
Multi30K	30K German captions from native speakers and human-translated captions.	Image, Text	Multi30K
NoCaps	For zero-shot image captioning evaluation; 15K images.	Image, Text	NoCaps
Laion-5B	5B image–text pairs used as external memory for retrieval.	Image, Text	LAION-5B
COCO-CN	20,341 images for cross-lingual tagging and captioning with Chinese sentences.	Image, Text	COCO-CN
CIRCO	1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval.	Image, Text	CIRCO

🎞 Video-Text

Name	Statistics and Description	Modalities	Link
BDD-X	77 hours of driving videos with expert textual explanations; for explainable driving behavior.	Video, Text	BDD-X
YouCook2	2,000 cooking videos with aligned descriptions; focused on video–text tasks.	Video, Text	YouCook2
ActivityNet	20,000 videos with multiple captions; used for video understanding and captioning.	Video, Text	ActivityNet
SoccerNet	Videos and metadata for 550 soccer games; includes transcribed commentary and key event annotations.	Video, Text	SoccerNet
MSR-VTT	10,000 videos with 20 captions each; a large video description dataset.	Video, Text	MSR-VTT
MSVD	1,970 videos with approximately 40 captions per video.	Video, Text	MSVD
LSMDC	118,081 video–text pairs from 202 movies; a movie description dataset.	Video, Text	LSMDC
DiDemo	10,000 videos with four concatenated captions per video; with temporal localization of events.	Video, Text	DiDemo
Breakfast	1,712 videos of breakfast preparation; one of the largest fully annotated video datasets.	Video, Text	Breakfast
COIN	11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis.	Video, Text	COIN
MSRVTT-QA	Video question answering benchmark.	Video, Text	MSRVTT-QA
MSVD-QA	1,970 video clips with approximately 50.5K QA pairs; video QA dataset.	Video, Text	MSVD-QA
ActivityNet-QA	58,000 human–annotated QA pairs on 5,800 videos; benchmark for video QA models.	Video, Text	ActivityNet-QA
EpicKitchens-100	700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset.	Video, Text	EPIC-KITCHENS-100
Ego4D	4.3M video–text pairs for egocentric videos; massive-scale egocentric video dataset.	Video, Text	Ego4D
HowTo100M	136M video clips with captions from 1.2M YouTube videos; for learning text–video embeddings.	Video, Text	HowTo100M
CharadesEgo	68,536 activity instances from ego–exo videos; used for evaluation.	Video, Text	Charades-Ego
ActivityNet Captions	20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos.	Video, Text	ActivityNet Captions
VATEX	34,991 videos, each with multiple captions; a multilingual video-and-language dataset.	Video, Text	VATEX
Charades	9,848 video clips with textual descriptions; a multimodal research dataset.	Video, Text	Charades
WebVid	10M video–text pairs (refined to WebVid-Refined-1M).	Video, Text	WebVid
Youku-mPLUG	Chinese dataset with 10M video–text pairs (refined to Youku-Refined-1M).	Video, Text	Youku-mPLUG

🔊 Audio-Text

Name	Statistics and Description	Modalities	Link
LibriSpeech	1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks.	Audio, Text	LibriSpeech
SpeechBrown	55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction.	Audio, Text	SpeechBrown
AudioCap	46K audio clips paired with human-written text captions.	Audio, Text	AudioCaps
AudioSet	2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental).	Audio	AudioSet

🩺 Medical

Name	Statistics and Description	Modalities	Link
MIMIC-CXR	125,417 labeled chest X-rays with reports; widely used for medical imaging research.	Image, Text	MIMIC-CXR
CheXpert	224,316 chest radiographs of 65,240 patients; focused on medical analysis.	Image, Text	CheXpert
MIMIC-III	Health-related data from over 40K patients; includes clinical notes and structured data.	Text	MIMIC-III
IU-Xray	7,470 pairs of chest X-rays and corresponding diagnostic reports.	Image, Text	IU-Xray
PubLayNet	100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis.	Image, Text	PubLayNet

👗 Fashion

Name	Statistics and Description	Modalities	Link
Fashion-IQ	77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics.	Image, Text	Fashion-IQ
FashionGen	260.5K image–text pairs of fashion images and item descriptions.	Image, Text	FashionGen
VITON-HD	83K images for virtual try-on; high-resolution clothing items dataset.	Image, Text	VITON-HD
Fashionpedia	48,000 fashion images annotated with segmentation masks and fine-grained attributes.	Image, Text	Fashionpedia
DeepFashion	Approximately 800K diverse fashion images for pseudo triplet generation.	Image, Text	DeepFashion

💡 QA

Name	Statistics and Description	Modalities	Link
VQA	400K QA pairs with images for visual question-answering tasks.	Image, Text	VQA
PAQ	65M text-based QA pairs; a large-scale dataset for open-domain QA tasks.	Text	PAQ
ELI5	270K complex questions augmented with web pages and images; designed for long-form QA tasks.	Text	ELI5
OK-VQA	14K questions requiring external knowledge for visual question answering tasks.	Image, Text	OK-VQA
WebQA	46K queries requiring reasoning across text and images; multimodal QA dataset.	Text, Image	WebQA
Infoseek	Fine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages).	Image, Text	Infoseek
ClueWeb22	10 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks.	Text	ClueWeb22
MOCHEG	15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence.	Text, Image	MOCHEG
VQA v2	1.1M questions (augmented with VG-QA questions) for fine-tuning VQA models.	Image, Text	VQA v2
A-OKVQA	Benchmark for visual question answering using world knowledge; around 25K questions.	Image, Text	A-OKVQA
XL-HeadTags	415K news headline-article pairs spanning 20 languages across six diverse language families.	Text	XL-HeadTags
SEED-Bench	19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions.	Text	SEED-Bench

🌎 Other

Name	Statistics and Description	Modalities	Link
ImageNet	14M labeled images across thousands of categories; used as a benchmark in computer vision research.	Image	ImageNet
Oxford Flowers102	Dataset of flowers with 102 categories for fine-grained image classification tasks.	Image	Oxford Flowers102
Stanford Cars	Images of different car models (five examples per model); used for fine-grained categorization tasks.	Image	Stanford Cars
GeoDE	61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition.	Image	GeoDE

[GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing](https://arxiv.org/pdf/2503.13563
[AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles]
MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG
AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles
Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning
RAGTrans: Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
RA-CM3: Retrieval-Augmented Multimodal Language Modeling
WFGY Problem Map: Semantic Firewall for RAG Failure Modes

Fluency (FL):

✅ Relevance and Accuracy

Accuracy:

Fréchet Inception Distance (FID), CLIP Score, Kernel Inception Distance (KID), and Inception Score (IS):
Consensus-Based Image Description Evaluation (CIDEr):
SPICE:
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
SPIDEr:
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Fréchet Audio Distance (FAD), Overall Quality (OVL), and Text Relevenace (REL):
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- AudioGen: Textually Guided Audio Generation

🔗 Text Similarity and Overlap Metrics

This README is a work in progress and will be completed soon. Stay tuned for more updates!

🔗 Citations

If you find our paper or repository useful, please cite the paper:

@misc{abootorabi2025askmodalitycomprehensivesurvey,
      title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation}, 
      author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
      year={2025},
      eprint={2502.08826},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08826}, 
}

📧 Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.