A curated list of Visual Language Models papers and resources for Earth Observation (VLM4EO) [](https://github.com/geoaigroup/awesome-vision-language-models-for-earth-observation/)

March 23, 2025 · View on GitHub

This list is created and maintained by Ali Koteich and Hasan Moughnieh from the GEOspatial Artificial Intelligence (GEOAI) research group at the National Center for Remote Sensing - CNRS, Lebanon.

We encourage you to contribute to this project according to the following guidelines.

---If you find this repository useful, please consider giving it a ⭐

Table Of Contents

A curated list of Visual Language Models papers and resources for Earth Observation (VLM4EO)

Foundation Models

Year	Title	Paper	Code
2025	Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation	paper
2024	EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain	paper
2024	RemoteCLIP: A Vision Language Foundation Model for Remote Sensing	paper	code
2024	Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models	paper	code
2024	SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	paper	code
2024	VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	paper	code
2023	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	paper	code
2023	Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment	paper

Image Captioning

Year	Title	Paper	Code	Venue
2024	A Lightweight Transformer for Remote Sensing Image Change Captioning	paper	code
2024	RSCaMa: Remote Sensing Image Change Captioning with State Space Model	paper	code
2023	Captioning Remote Sensing Images Using Transformer Architecture	paper		International Conference on Artificial Intelligence in Information and Communication
2023	Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning	paper		MDPI Remote Sensing
2023	Progressive Scale-aware Network for Remote sensing Image Change Captioning	paper
2023	Towards Unsupervised Remote Sensing Image Captioning and Retrieval with Pre-Trained Language Models	paper		Proceedings of the Japanese Association for Natural Language Processing
2022	A Joint-Training Two-Stage Method for Remote Sensing Image Captioning	paper		IEEE TGRS
2022	A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning	paper		MDPI Remote Sensing
2022	Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis	paper		IEEE TGRS
2022	Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning	paper	code	IEEE GRSL
2022	Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach	paper	code	Engineering Applications of Artificial Intelligence
2022	Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image	paper		IEEE TGRS
2022	High-Resolution Remote Sensing Image Captioning Based on Structured Attention	paper		IEEE TGRS
2022	Meta captioning: A meta learning based remote sensing image captioning framework	paper	code	Elsevier PHOTO
2022	Multiscale Multiinteraction Network for Remote Sensing Image Captioning	paper		IEEE JSTARS
2022	NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning	paper	code	IEEE TGRS
2022	Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning	paper		IEEE TGRS
2022	Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset	paper		IEEE TGRS
2022	Transforming remote sensing images to textual descriptions	paper		Int J Appl Earth Obs Geoinf
2022	Using Neural Encoder-Decoder Models with Continuous Outputs for Remote Sensing Image Captioning	paper		IEEE Access
2021	A Novel SVM-Based Decoder for Remote Sensing Image Captioning	paper		IEEE TGRS
2021	SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning	paper	code	IEEE TGRS
2021	Truncation Cross Entropy Loss for Remote Sensing Image Captioning	paper		IEEE TGRS
2021	Word-Sentence Framework for Remote Sensing Image Captioning	paper		IEEE TGRS
2020	A multi-level attention model for remote sensing image captions	paper		MDPI Remote Sensing
2020	Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning	paper		Elservier Knowledge-Based Systems
2020	Toward Remote Sensing Image Retrieval Under a Deep Image Captioning Perspective	paper		IEEE JSTARS
2019	LAM: Remote sensing image captioning with attention-based language model	paper		IEEE TGRS
2019	Learning to Caption Remote Sensing Images by Geospatial Feature Driven Attention Mechanism	paper		IEEE JSTARS
2019	Remote Sensing Image Captioning by Deep Reinforcement Learning with Geospatial Features	paper		IEEE TGRS

Text-Image Retrieval

Year	Title	Paper	Code	Venue
2024	Composed Image Retrieval for Remote Sensing	paper	code
2024	Multi-Spectral Remote Sensing Image Retrieval using Geospatial Foundation Models	paper	code
2024	Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval	paper	code
2023	A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval	paper	code	ACM MM 2023 (Oral)
2023	A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing	paper		MDPI Remote Sensing
2023	An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval	paper		MDPI Mathematics
2023	Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval	paper		MDPI Applied Sciences
2023	Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning	paper	code	IEEE TGRS
2023	Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval	paper		IEEE TGRS
2023	Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval	paper	code	ICMR'23
2022	A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing	paper	code	IEEE TGRS
2022	An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing	paper	code	IEEE ICIP
2022	CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study	paper		Virginia Polytechnic Institute and State University
2022	Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images	paper
2022	MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing	paper	code	Int J Appl Earth Obs Geoinf
2022	Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval	paper		IEEE JSTARS
2022	Multisource Data Reconstruction-Based Deep Unsupervised Hashing for Unisource Remote Sensing Image Retrieval	Paper	code	IEEE TGRS
2022	Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information	paper	code	IEEE TGRS
2022	Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote Sensing	paper	code	IEEE ICASSP
2021	Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval	paper	code	IEEE TGRS
2020	Deep unsupervised embedding for remote sensing image retrieval using textual cues	paper		MDPI Applied Sciences
2020	TextRS: Deep bidirectional triplet network for matching text to remote sensing images	paper		MDPI Remote Sensing
2020	Toward Remote Sensing Image Retrieval under a Deep Image Captioning Perspective	paper		IEEE JSTARS

Visual Grounding

Year	Title	Paper	Code	Venue
2024	GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding	paper	code
2023	LaLGA: Multi-Scale Language-Aware Visual Grounding on Remote Sensing Data	paper	code
2023	Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models	paper	code
2022	RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	paper	code	IEEE TGRS
2022	Visual Grounding in Remote Sensing Images	paper		ACM MM

Visual Question Answering

Year	Title	Paper	Code	Venue
2023	A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering	paper		IEEE TGRS
2023	EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering	paper	code	AAAI 2024
2023	LIT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing	paper	code	IEEE IGARSS
2023	Multistep Question-Driven Visual Question Answering for Remote Sensing	paper	code	IEEE TGRS
2023	RSGPT: A Remote Sensing Vision Language Model and Benchmark	paper	code
2023	RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering	paper	code
2022	Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery	paper		IEEE TGRS
2022	Change Detection Meets Visual Question Answering	paper	code	IEEE TGRS
2022	From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data	paper	code	IEEE TGRS
2022	Language Transformers for Remote Sensing Visual Question Answering	paper		IEEE IGARSS
2022	Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing	paper	code	SPIE Image and Signal Processing for Remote Sensing
2022	Mutual Attention Inception Network for Remote Sensing Visual Question Answering	paper	code	IEEE TGRS
2022	Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering	paper		CVPRW
2021	How to find a good image-text embedding for remote sensing visual question answering?	paper		CEUR Workshop Proceedings
2021	Mutual Attention Inception Network for Remote Sensing Visual Question Answering	paper	code	IEEE TGRS
2021	RSVQA meets BigEarthNet: a new, large-scale, visual question answering dataset for remote sensing	paper	code	IEEE IGARSS
2020	RSVQA: Visual Question Answering for Remote Sensing Data	paper	code	IEEE TGRS

Vision-Language Remote Sensing Datasets

Name	Link	Paper Link	Description
RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model	Link	Paper Link	Size: 5 million remote sensing images with English descriptions Resolution : 256 x 256 Platforms: 11 publicly available image-text paired dataset
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing	Link	paper Link	Size : 5.2 million remote sensing image-text pairs in total, covering more than 29K distinct semantic tags
Remote Sensing Visual Question Answering Low Resolution Dataset(RSVQA LR)	Link	Paper Link	Size: 772 images & 77,232 questions and answers Resolution : 256 x 256 Platforms: Sentinel-2 and Open Street Map Use: Remote Sensing Visual Question Answering
Remote Sensing Visual Question Answering High Resolution Dataset(RSVQA HR)	Link	Paper Link	Size: 10,659 images & 955,664 questions and answers Resolution : 512 x 512 Platforms: USGS and Open Street Map Use: Remote Sensing Visual Question Answering
Remote Sensing Visual Question Answering BigEarthNet Dataset (RSVQA x BEN)	Link	Paper Link	Size: 140,758,150 image/question/answer triplets Resolution : High-resolution (15cm) Platforms: Sentinel-2, BigEarthNet and Open Street Map Use: Remote Sensing Visual Question Answering
Remote Sensing Image Visual Question Answering (RSIVQA)	Link	Paper Link	Size: 37,264 images and 111,134 image-question-answer triplets A small part of RSIVQA is annotated by human. Others are automatically generated using existing scene classification datasets and object detection datasets Use: Remote Sensing Visual Question Answering
FloodNet Visual Question Answering Dataset	Link	Paper Link	Size: 11,000 question-image pairs Resolution : 224 x 224 Platforms: UAV-DJI Mavic Pro quadcopters, after Hurricane Harvey Use: Remote Sensing Visual Question Answering
Change Detection-Based Visual Question Answering Dataset	Link	Paper Link	Size: 2,968 pairs of multitemporal images and more than 122,000 question–answer pairs Classes: 6 Resolution : 512×512 pixels Platforms: It is based on semantic change detection dataset (SECOND) Use: Remote Sensing Visual Question Answering
LAION-EO	link	Paper Link	Size : 24,933 samples with 40.1% english captions as well as other common languages from LAION-5B mean height of 633.0 pixels (up to 9,999) and mean width of 843.7 pixels (up to 19,687) Platforms : Based on LAION-5B
CapERA: Captioning Events in Aerial Videos	Link	Paper Link	Size : 2864 videos and 14,320 captions, where each video is paired with five unique captions
Remote Sensing Image Captioning Dataset (RSICap)	link	Paper Link	RSICap comprises 2,585 human-annotated captions with rich and high-quality information This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc)
Remote Sensing Image Captioning Evaluation Dataset (RSIEval)	link	Paper Link	100 human-annotated captions and 936 visual question-answer pairs with rich information and open-ended questions and answers. Can be used for Image Captioning and Visual Question-Answering tasks
Revised Remote Sensing Image Captioning Dataset (RSCID)	Link	Paper Link	Size: 10,921 images with five captions per image Number of Classes: 30 Resolution : 224 x 224 Platforms: Google Earth, Baidu Map, MapABC and Tianditu Use: Remote Sensing Image Captioning
Revised University of California Merced dataset (UCM-Captions)	Link	Paper Link	Size: 2,100 images with five captions per image Number of Classes: 21 Resolution : 256 x 256 Platforms: USGS National Map Urban Area Imagery collection Use: Remote Sensing Image Captioning
Revised Sydney-Captions Dataset	Link	Paper Link	Size: 613 images with five captions per image Number of Classes: 7 Resolution : 500 x 500 Platforms: GoogleEarth Use: Remote Sensing Image Captioning
LEVIR-CC dataset	Link	Paper Link	Size: 10,077 pairs of RS images and 50,385 corresponding sentences Number of Classes: 10 Resolution : 1024 × 1024 pixels Platforms: Beihang University Use: Remote Sensing Image Captioning
NWPU-Captions dataset	images_Link, info_Link	Paper Link	Size: 31,500 images with 157,500 sentences Number of Classes: 45 Resolution : 256 x 256 pixels Platforms: based on NWPU-RESISC45 dataset Use: Remote Sensing Image Captioning
Remote sensing Image-Text Match dataset (RSITMD)	Link	Paper Link	Size: 23,715 captions for 4,743 images Number of Classes: 32 Resolution : 500 x 500 Platforms: RSCID and GoogleEarth Use: Remote Sensing Image-Text Retrieval
PatterNet	Link	Paper Link	Size: 30,400 images Number of Classes: 38 Resolution : 256 x 256 Platforms: Google Earth imagery and via the Google Map AP Use: Remote Sensing Image Retrieval
Dense Labeling Remote Sensing Dataset (DLRSD)	Link	Paper Link	Size: 2,100 images Number of Classes: 21 Resolution : 256 x 256 Platforms: Extension of the UC Merced Use: Remote Sensing Image Retrieval (RSIR), Classification and Semantic Segmentation
Dior-Remote Sensing Visual Grounding Dataset (RSVGD)	Link	Paper Link	Size: 38,320 RS image-query pairs and 17,402 RS images Number of Classes: 20 Resolution : 800 x 800 Platforms: DIOR dataset Use: Remote Sensing Visual Grounding
OPT-RSVG Dataset	link	Paper Link	Size : 25,452 Images and 48,952 expression in English and Chinese Number of Classes : 14 Resolution : 800 x 800
Visual Grounding in Remote Sensing Images	link	Paper Link	Size : 4,239 images including 5,994 object instances and 7,933 referring expressions Images are 1024×1024 pixels Platforms: multiple sensors and platforms (e.g. Google Earth)
Remote Sensing Image Scene Classification (NWPU-RESISC45)	Link	Paper Link	Size: 31,500 images Number of Classes: 45 Resolution : 256 x 256 pixels Platforms: Google Earth Use: Remote Sensing Image Scene Classification

---Stay tuned for continuous updates and improvements! 🚀