README.md
December 11, 2025 · View on GitHub
RSGPT: A Remote Sensing Vision Language Model and Benchmark
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li☨
☨corresponding author
This is an ongoing project. We are working on increasing the dataset size.
Related Projects
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model
Congcong Wen*, Yiting Lin*, Xiaokang Qu, Nan Li, Yong Liao, Hui Lin, Xiang Li
FedRSCLIP: Federated learning for remote sensing scene classification using vision-language models
Hui Lin*, Chao Zhang*, Danfeng Hong, Kexin Dong, and Congcong Wen☨
RS-MoE: A Vision–Language Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering
Hui Lin*, Danfeng Hong*, Shuhang Ge*, Chuyao Luo, Kai Jiang, Hao Jin, and Congcong Wen☨
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
Xiang Li, Jian Ding, Mohamed Elhoseiny
Vision-language models in remote sensing: Current progress and future trends
Xiang Li*☨, Congcong Wen*, Yuan Hu*, Zhenghang Yuan, Xiao Xiang Zhu
RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision
Xiang Li, Congcong Wen, Yuan Hu, Nan Zhou
:fire: Updates
- [2025.05.08] We release the code for training and testing RSGPT.
- [2024.12.18] We release the manual scoring results for RSIEval.
- [2024.06.19] We release the VRSBench, A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. VRSBench contains 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. check VRSBench Project Page.
- [2024.05.23] We release the RSICap dataset. Please fill out this form to get both RSICap and RSIEval dataset.
- [2023.11.10] Our survey about vision-language models in remote sensing. RSVLM.
- [2023.10.22] The RSICap dataset and code will be released upon paper acceptance.
- [2023.10.22] We release the evaluation dataset RSIEval. Please fill out this form to get both the RSIEval dataset.
Dataset
- RSICap: 2,585 image-text pairs with high-quality human-annotated captions.
- RSIEval: 100 high-quality human-annotated captions with 936 open-ended visual question-answer pairs.
Code
The idea of finetuning our vision-language model is borrowed from MiniGPT-4. Our model is based on finetuning InstructBLIP using our RSICap dataset.
🚀 Installation
Set up a conda environment using the provided environment.yml file:
Step 1: Create the environment
conda env create -f environment.yml
Step 2: Activate the environment
conda activate rsgpt
Training
torchrun --nproc_per_node=8 train.py --cfg-path train_configs/rsgpt_train.yaml
Testing
Test image captioning:
python test.py --cfg-path eval_configs/rsgpt_eval.yaml --gpu-id 0 --out-path rsgpt/output --task ic
Test visual question answering:
python test.py --cfg-path eval_configs/rsgpt_eval.yaml --gpu-id 0 --out-path rsgpt/output --task vqa
Licensing Information
Our images are borrowed from DOTA dataset. All images and their associated annotations in DOTA can be used for academic purposes only, but any commercial use is prohibited.
Acknowledgement
- MiniGPT-4. A popular open-source vision-language model.
- InstructBLIP. The model architecture of RSGPT follows InstructBLIP. Don't forget to check out this great open-source work if you don't know it before!
- Lavis. This repository is built upon Lavis!
- Vicuna. The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
If you're using RSGPT in your research or applications, please cite using this BibTeX:
@article{hu2025rsgpt,
title={Rsgpt: A remote sensing vision language model and benchmark},
author={Hu, Yuan and Yuan, Jianlong and Wen, Congcong and Lu, Xiaonan and Liu, Yu and Li, Xiang},
journal={ISPRS Journal of Photogrammetry and Remote Sensing},
volume={224},
pages={272--286},
year={2025},
publisher={Elsevier}
}