README.md

December 11, 2025 · View on GitHub

RSGPT: A Remote Sensing Vision Language Model and Benchmark

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li☨

☨corresponding author

This is an ongoing project. We are working on increasing the dataset size.

RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Congcong Wen*, Yiting Lin*, Xiaokang Qu, Nan Li, Yong Liao, Hui Lin, Xiang Li

FedRSCLIP: Federated learning for remote sensing scene classification using vision-language models

Hui Lin*, Chao Zhang*, Danfeng Hong, Kexin Dong, and Congcong Wen☨

RS-MoE: A Vision–Language Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering

Hui Lin*, Danfeng Hong*, Shuhang Ge*, Chuyao Luo, Kai Jiang, Hao Jin, and Congcong Wen☨

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Xiang Li, Jian Ding, Mohamed Elhoseiny

Vision-language models in remote sensing: Current progress and future trends

Xiang Li*☨, Congcong Wen*, Yuan Hu*, Zhenghang Yuan, Xiao Xiang Zhu

RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision

Xiang Li, Congcong Wen, Yuan Hu, Nan Zhou

:fire: Updates

[2025.05.08] We release the code for training and testing RSGPT.
[2024.12.18] We release the manual scoring results for RSIEval.
[2024.06.19] We release the VRSBench, A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. VRSBench contains 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. check VRSBench Project Page.
[2024.05.23] We release the RSICap dataset. Please fill out this form to get both RSICap and RSIEval dataset.
[2023.11.10] Our survey about vision-language models in remote sensing. RSVLM.
[2023.10.22] The RSICap dataset and code will be released upon paper acceptance.
[2023.10.22] We release the evaluation dataset RSIEval. Please fill out this form to get both the RSIEval dataset.

Dataset

RSICap: 2,585 image-text pairs with high-quality human-annotated captions.
RSIEval: 100 high-quality human-annotated captions with 936 open-ended visual question-answer pairs.

Code

The idea of finetuning our vision-language model is borrowed from MiniGPT-4. Our model is based on finetuning InstructBLIP using our RSICap dataset.

🚀 Installation

Set up a conda environment using the provided environment.yml file:

Step 1: Create the environment

conda env create -f environment.yml

Step 2: Activate the environment

conda activate rsgpt

Training

torchrun --nproc_per_node=8 train.py --cfg-path train_configs/rsgpt_train.yaml

Testing

Test image captioning:

python test.py --cfg-path eval_configs/rsgpt_eval.yaml --gpu-id 0 --out-path rsgpt/output --task ic

Test visual question answering:

python test.py --cfg-path eval_configs/rsgpt_eval.yaml --gpu-id 0 --out-path rsgpt/output --task vqa

Licensing Information

Our images are borrowed from DOTA dataset. All images and their associated annotations in DOTA can be used for academic purposes only, but any commercial use is prohibited.

Acknowledgement

MiniGPT-4. A popular open-source vision-language model.
InstructBLIP. The model architecture of RSGPT follows InstructBLIP. Don't forget to check out this great open-source work if you don't know it before!
Lavis. This repository is built upon Lavis!
Vicuna. The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!

If you're using RSGPT in your research or applications, please cite using this BibTeX:

@article{hu2025rsgpt,
  title={Rsgpt: A remote sensing vision language model and benchmark},
  author={Hu, Yuan and Yuan, Jianlong and Wen, Congcong and Lu, Xiaonan and Liu, Yu and Li, Xiang},
  journal={ISPRS Journal of Photogrammetry and Remote Sensing},
  volume={224},
  pages={272--286},
  year={2025},
  publisher={Elsevier}
}