RS-MoE: A VisionβLanguage Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering
August 25, 2025 Β· View on GitHub
π¨βπ» Authors: Hui Lin*, Danfeng Hong*, Shuhang Ge*, Chuyao Luo, Kai Jiang, Hao Jin, and Congcong Wenβ
* Equal Contribution Β Β Β β Corresponding Author.
π Journal: IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2025
π Abstract
Remote Sensing Image Captioning (RSIC) presents unique challenges and plays a critical role in applications such as environmental monitoring, urban planning, and disaster management. Traditional RSIC methods often struggle to produce rich and diverse descriptions. Recently, with significant advancements in Vision-Language Models (VLMs), efforts have emerged to integrate these models into the remote sensing domain and to introduce richly descriptive datasets specifically designed to enhance VLM training. However, most current RSIC models generally apply only fine-tuning to these datasets without developing models tailored to the unique characteristics of remote sensing imagery. This paper proposes RS-MoE, the first Mixture of Expert based VLM specifically customized for remote sensing domain. Unlike traditional MoE models, the core of RS-MoE is the MoE Block, which incorporates a novel Instruction Router and multiple lightweight Large Language Models (LLMs) as expert models. The Instruction Router is designed to generate specific prompts tailored for each corresponding LLM, guiding them to focus on distinct aspects of the RSIC task. This design not only allows each expert LLM to concentrate on a specific subset of the task, thereby enhancing the specificity and accuracy of the generated captions, but also improves the scalability of the model by facilitating parallel processing of sub-tasks. Additionally, we present a two-stage training strategy for tuning our RS-MoE model to prevent performance degradation due to sparsity. We fine-tuned our model on the RSICap dataset using our proposed training strategy. Experimental results on the RSICap dataset, along with evaluations on other traditional datasets where no additional fine-tuning was applied, demonstrate that our model achieves state-of-the-art performance in generating precise and contextually relevant captions. Notably, our RS-MoE-1B variant achieves performance comparable to 13B VLMs, demonstrating the efficiency of our model design. Moreover, our model demonstrates promising generalization capabilities by consistently achieving state-of-the-art performance on the Remote Sensing Visual Question Answering (RSVQA) task.
π§ Contributions
- β We are the first to introduce the MoE framework to multimodal remote sensing, leveraging its task decomposition concept with specialized expert models to effectively address the complexity and diversity of remote sensing data across both visual and textual modalities
- β We propose RS-MoE, a novel Vision-Language Model specifically designed for remote sensing image captioning, with an Instruction Router that dynamically generates task-specific prompts and lightweight LLMs as expert models to enhance both effectiveness and efficiency.
- β We present a two-stage training strategy for RS-MoE, incorporating proper initialization to mitigate sparsity-induced degradation and employing LoRA to reduce trainable parameters, improving both efficiency and manageability of the model during the training process.
- β Extensive experiments demonstrate that, with fine-tuning on a single dataset images, our model achieves state-of-the-art performance on five RSIC datasets and strong generalization on two RSVQA datasets. Notably, the lightweight RS-MoE-1B matches the performance of larger 13B VLMs while being significantly more efficient.
Please kindly cite the papers if this code is useful and helpful for your research.
@article{rsmoe,
title={Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering},
author={Lin, Hui and Hong, Danfeng and Ge, Shuhang and Luo, Chuyao and Jiang, Kai and Jin, Hao and Wen, Congcong},
journal={IEEE Transactions on Geoscience and Remote Sensing},
year={2025},
volume={63},
pages={1-18},
publisher={IEEE}
}
π οΈ Implementation Guide
RS-MoE Project Organization
This folder contains all the essential files needed to run the InstructBLIP training pipeline based on run_finetune_instructblip_experiments.sh.
Directory Structure
π scripts/
Core training scripts and configurations
train.py- Main training script (entry point)run_finetune_instructblip_experiments.sh- Bash script to run experimentsfinetune_instructblip_iconqa_33.yaml- Main configuration file for IconQA training- Other training configuration files (*.yaml)
π models/
All model-related files
blip2_models/- BLIP2 architecture implementationsblip2_vicuna_instruct_lora.py- Main model used in training (Vicuna-13B with LoRA)blip2.py- Base BLIP2 implementationQformer.py- Q-Former implementationmodeling_llama.py- Llama/Vicuna model implementations- Other BLIP2 variants
blip_models/- Original BLIP modelsclip_models/- CLIP vision encodereva_vit.py- EVA vision transformerbase_model.py- Base model class
π lavis/
Core LAVIS framework components
π§ common/
config.py- Configuration managementregistry.py- Model/task/dataset registrydist_utils.py- Distributed training utilitieslogger.py- Logging utilitiesoptims.py- Optimizers and schedulersutils.py- General utilitiesvqa_tools/- VQA evaluation tools
π datasets/
builders/- Dataset builder classesvqa_builder.py- VQA dataset builders (includes IconQA)
datasets/- Dataset implementationsiconqa_datasets.py- IconQA dataset classbase_dataset.py- Base dataset class
data_utils.py- Data processing utilities
π― tasks/
base_task.py- Base task classvqa.py- VQA task implementation (includes IconQATask)captioning.py- Image captioning tasks- Other task implementations
π processors/
blip_processors.py- Image and text processors for BLIPbase_processor.py- Base processor class- Image preprocessing and text tokenization
π runners/
runner_base.py- Main training runnerrunner_iter.py- Iteration-based runner- Training loop implementations
π Config Files
defaults.yaml- IconQA dataset configurationblip2_instruct_vicuna13b_lora.yaml- Model configuration__init__.py- Package initialization
π requirements/
requirements.txt- Python package dependencies
Key Model Architecture
The training uses BLIP2-Vicuna-Instruct-LoRA architecture:
- Visual Encoder: EVA-CLIP-G (frozen)
- Q-Former: 32 query tokens for visual-text alignment
- Language Model: Vicuna-13B with LoRA fine-tuning
- Dataset: IconQA for remote sensing VQA
- Training: LoRA applied to FFN layers only (rank=1)
How to Use
-
Set up environment:
pip install -r requirements/requirements.txt -
Run training:
# From the scripts folder bash run_finetune_instructblip_experiments.sh <benchmark> <experiment>Or directly:
python train.py --cfg-path finetune_instructblip_iconqa_33.yaml -
Key configuration parameters (in
finetune_instructblip_iconqa_33.yaml):- Model:
blip2_vicuna_instruct_lora - Base LLM:
vicuna13b - LoRA rank: 1
- LoRA target: FFN layers (
gate_proj,up_proj,down_proj) - Batch size: 1
- Learning rate: 1e-5
- Epochs: 50
- Model:
Dependencies
This is a complete, self-contained copy of all files needed to run the training pipeline. No additional files from the original RS-MoE repository are required.
For MoE Implementation
This current setup represents the baseline architecture that needs to be extended with Mixture of Experts components according to the RS-MoE paper. The MoE architecture is not yet implemented in these files.
Generated: $(Get-Date -Format "yyyy-MM-dd HH:mm")