README.md
February 16, 2026 ยท View on GitHub
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Method | Benchmark | Getting Started | Inference | Citation
2026/2/14๐ Code and model weights are released.2025/6/26]๐ Our paper is accepted by ICCV 2025!2025/3/11๐ LRS-VQA Benchmark is now released!
This project focuses on efficient perception of Large Remote Sensing Images (RSIs) using Large Vision-Language Models (LVLMs) under limited resources, covering the following key aspects:
- Coarse-to-Fine Focusing & Pruning: An iterative process that zooms from coarse, low-resolution overviews into fine-grained, high-resolution views to focus and analyze text-related regions.
- Region Focus Module (RFM): Learns text-aware key region localization capabilities from LVLM through attention distillation, enabling focused analysis on critical image tiles.
- LRS-VQA: A new benchmark for Large RSI perception, featuring 7,333 QA pairs across 8 categories, with images reaching up to 27,328 pixels in length and an average size of 7,099ร6,329 pixels.
๐ TODO
- Release benchmark.
- Release code and model weights.
- Release training script.
๐ ๏ธ Method
Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery.
1. Region Focus Module (RFM)

Schematic illustration of the Region Focus Module (RFM).
The RFM aims at learning text-aware key vision tokens localization from the LLM part of LVLM by employing attention distillation, which allows it to focus on the most relevant parts of an image for detailed analysis.
The phenomenon within the LVLMs was first observed in small-size general images:

2. Coarse-to-Fine Token Pruning

The overall pipeline of our proposed method.
Initially, the DIP is constructed based on the input large RSI. At the low-resolution DIP level, the RFM provides attention distribution for the initial vision tokens, which guides the retrieval of corresponding image tiles from higher-resolution DIP levels or trigger token pruning at the current level. This iterative process could continue through the pyramid until reaching the original resolution.
๐ LRS-VQA Benchmark
Construction process of LRS-VQA.
MME-RealWorld has provided a high-quality benchmark for multiple domains. In the field of remote sensing, we aim to further enrich the types of tasks and reflect the challenges of large RSI perception. LRS-VQA includes 1,657 images ranging in length from 1,024 to 27,328 pixels, covering 8 different types of questions, and contains 7,333 QA pairs.
The accuracy trends of Qwen2-VL across varying input maximum pixels. This demonstrates that accuracy on both the manually annotated MME-RealWorld-RS and our proposed LRS-VQA exhibit a positive correlation with resolution improvement, proving the effectiveness of LRS-VQA in evaluating LVLM's high-resolution RSI perception capabilities.
Download and Evaluation
To get started with the dataset and evaluation scripts, follow these steps:
-
Source Images: The source images are collected from multiple datasets:
- FAIR1M (train set)
- GLH-Bridge (test set)
- STAR (test set)
The LRS-VQA dataset can be found at https://huggingface.co/datasets/ll-13/LRS-VQA.
-
Evaluation: To evaluate your model on LRS-VQA, please refer to our provided script for generating results:
Once you have generated the result file using
llava_eval_LRSVQA.py, you can use the following script to evaluate your model's performance:
Results
Leaderboard and performance comparision. Average Accuracy is reported for each dataset.
Efficiency comparision.
Detailed Result
Click to view detailed results in LRS-FAIR, LRS-Bridge, and LRS-STAR
Detailed results in LRS-FAIR.
Detailed results in LRS-Bridge.
Detailed results in LRS-STAR.
๐ Getting Started
Model Weights
We provide the model weights on ModelScope and Hugging Face.
| Model Base | ModelScope | Hugging Face |
|---|---|---|
| LLaVA-NeXT-7B (Qwen2) | ||
| LLaVA-1.5-7B (Vicuna) |
1. Environment Setup
We recommend using Conda to manage the environment. This project requires Python 3.10 and is tested on NVIDIA A100/A800 GPUs.
# Create and activate the conda environment
conda create -n lrsvqa python=3.10 -y
conda activate lrsvqa
# Upgrade pip
pip install --upgrade pip
2. Install PyTorch
The required PyTorch version depends on your NVIDIA Driver Version (check with nvidia-smi).
- For Modern Drivers (Version >= 525.60):
pip install torch==2.1.2 torchvision==0.16.2 --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)
- For Older Drivers (Version < 525.60, e.g., 470.xx):
If you encounter a
RuntimeErrorrelated to an old NVIDIA driver, you must use the CUDA 11.8 compiled version:
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)
3. Install Dependencies
cd LRS-VQA-Code
# Install base and training requirements
pip install -e .
pip install -e ".[train]"
# Install specific versions of other key packages
pip install flash-attn==2.6.3 --no-build-isolation
pip install bitsandbytes==0.43.3 safetensors==0.4.5 pydantic==2.9.1 peft==0.3.0
โ๏ธ Inference
Step 1: Apply Environment Patch (Crucial)
โ ๏ธ Important: Our model includes custom modifications that require patching your local transformers library installation. Before running inference, please copy our modified modeling files into your conda environment.
# Get the site-packages path of your conda environment
SITE_PACKAGES=$(python -c 'import site; print(site.getsitepackages()[0])')
# Copy the patched files (Execute from project root)
cp ./LRS-VQA-Code/llava/model/multimodal_encoder/transformers/models/llama/modeling_llama.py $SITE_PACKAGES/transformers/models/llama/
cp ./LRS-VQA-Code/llava/model/multimodal_encoder/transformers/models/qwen2/modeling_qwen2.py $SITE_PACKAGES/transformers/models/qwen2/
cp ./LRS-VQA-Code/llava/model/multimodal_encoder/transformers/modeling_outputs.py $SITE_PACKAGES/transformers/
Step 2: Prepare Data
Depending on which benchmark you wish to evaluate, follow the corresponding instructions below.
Option A: For MME-RealWorld (Remote Sensing) Evaluation
- Download: Obtain the dataset from the official MME-RealWorld repository.
- Organize: Arrange your files as shown below. The evaluation script will need the path to this main data directory.
/path/to/your/MME_RealWorld/
โโโ MME_RealWorld_RS.json
โโโ evaluation/
โ โโโ eval_your_results.py
โโโ remote_sensing/
โโโ 03553_Toronto.png
โโโ 03555_Toronto.png
โโโ ...
Option B: For LRS-VQA Benchmark Evaluation
- Download: Download the dataset from our Hugging Face repository.
- Extract: Place all downloaded parts in the same directory. Use
7-Zip(Windows) orp7zip(Linux) to extract the archive by running the command on the first file (.001).
# Example using p7zip on Linux
7z x LRS_VQA.7z.001
- Verify Structure: After extraction, you will have an
LRS_VQAfolder. The evaluation script expects the path to its parent directory. Ensure your final structure looks like this:
/path/to/dataset_parent_directory/
โโโ LRS_VQA/
โโโ LRS_VQA_merged.jsonl
โโโ image/
โโโ 15565.tif
โโโ 9043.tif
โโโ ...
Step 3: Run Inference
We provide shell scripts to simplify the evaluation process.
To evaluate on LRS-VQA:
bash LRS-VQA-Code/eval_lrs_vqa.sh
To evaluate on MME-RealWorld-RS:
bash LRS-VQA-Code/eval_mme-realworld-rs.sh
Citation
If you find this work helpful for your research, please consider giving this repo a star โญ and citing our paper:
@InProceedings{Luo_2025_ICCV,
title={When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning},
author={Luo, Junwei and Zhang, Yingying and Yang, Xue and Wu, Kang and Zhu, Qi and Liang, Lei and Chen, Jingdong and Li, Yansheng},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month={October},
year={2025},
pages={9206-9217}
}
@article{li2024scene,
title={STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery},
author={Li, Yansheng and Wang, Linlin and Wang, Tingzhu and Yang, Xue and Luo, Junwei and Wang, Qi and Deng, Youming and Wang, Wenbin and Sun, Xian and Li, Haifeng and Dang, Bo and Zhang, Yongjun and Yu, Yi and Yan Junchi},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}}
@article{li2024learning,
title={Learning to Holistically Detect Bridges From Large-Size VHR Remote Sensing Imagery},
author={Li, Yansheng and Luo, Junwei and Zhang, Yongjun and Tan, Yihua and Yu, Jin-Gang and Bai, Song},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume={44},
number={11},
pages={7778--7796},
year={2024},
publisher={IEEE}
}
Acknowledgement
We thank the authors of MME-RealWorld and PyramidDrop for their great works and codebases.