Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

June 10, 2025 · View on GitHub

This repository is the official implementation of TGRS 2024 "Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images" at: IEEE TGRS.

Contents

OPT-RSVG Dataset

The dataset contains 25,452 RS images and 48,952 image-query pairs. OPT-RSVG Dataset Training, validation, and test sample numbers for OPT-RSVG datasets.

No.Class NameOPT-RSVG dataset
TrainingValidationTest
C01airplane9792301142
C02ground track field16003652066
C03tennis court10932841313
C04bridge16994522212
C05basketball court10362631385
C06storage tank10502711264
C07ship10842431241
C08baseball diamond14773611744
C09T junction16634252055
C10crossroad16704052088
C11parking lot10492681368
C12harbor758209953
C13vehicle32948114083
C14swimming pool11283081563
-Total19580489524477

Our OPT-RSVG dataset is open source: Google Drive, Baidu Netdisk 提取码: 92yk

The used DIOR-RSVG dataset is available at: Baidu Netdisk 提取码: DIOR

LPVA Framework

OPT-RSVG Dataset The above line introduces the proposed framework of LPVA. It consists of five components: (1) Linguistic Backbone, which extracts linguistic features from referring expressions, (2) Progressive Attention module, which generates dynamic weights and biases for visual backbone conditioned on specific expressions, (3) Visual Backbone, which extracts visual features from raw images and its attention can be modified by language-adaptive weights, (4) Multi-Level Feature Enhancement Decoder, which aggregates visual contextual information to enhance the uniqueness, and (5) Localization Module, which predicts the bounding box.

Performance Comparison

Comparison with the SOTA methods for LPVA on the test set of OPT-RSVG

MethodsVenueVisual EncoderLanguage EncoderPr@0.5Pr@0.6Pr@0.7Pr@0.8Pr@0.9meanIoUcmuIoU
One-stage:
ZSGNetICCV'19ResNet-50BiLSTM48.6447.3243.8527.696.3343.0147.71
FAOAICCV'19DarkNet-53BERT68.1364.3057.1541.83\textcolor{blue}{15.33}58.7965.20
ReSCECCV'20DarkNet-53BERT69.1264.6358.2043.0114.8560.1865.84
LBYL-NetCVPR'21DarkNet-53BERT70.2265.3958.6537.549.4660.5770.28
Transformer-based:
TransVGCVPR'21ResNet-50BERT69.9664.1754.6838.0112.7559.8069.31
QRNetCVPR'22SwinBERT72.0365.9456.9040.7013.3560.8275.39
VLTVGCVPR'22ResNet-50BERT71.8466.5457.7941.6314.6260.7870.69
VLTVGCVPR'22ResNet-101BERT73.5068.1359.9343.4515.3162.4873.86
MGVLFTGRS'23ResNet-50BERT72.1966.8658.0242.5115.3061.5171.80
Ours:
LPVA-ResNet-50BERT78.0373.3262.2249.6025.6166.2076.30

Comparison with the SOTA methods for LPVA on the test set of DIOR-RSVG

MethodsVenueVisual EncoderLanguage EncoderPr@0.5Pr@0.6Pr@0.7Pr@0.8Pr@0.9meanIoUcmuIoU
One-stage:
ZSGNetICCV'19ResNet-50BiLSTM51.6748.1342.3032.4110.1544.1251.65
FAOAICCV'19DarkNet-53BERT67.2164.1859.2350.8734.4459.7663.14
ReSCECCV'20DarkNet-53BERT72.7168.9263.0153.7033.3764.2468.10
LBYL-NetCVPR'21DarkNet-53BERT73.7869.2265.5647.8915.6965.9276.37
Transformer-based:
TransVGCVPR'21ResNet-50BERT72.4167.3860.0549.1027.8463.5676.27
QRNetCVPR'22SwinBERT75.8470.8262.2749.6325.6966.8083.02
VLTVGCVPR'22ResNet-50BERT69.4165.1658.4446.5624.3759.9671.97
VLTVGCVPR'22ResNet-101BERT75.7972.2266.3355.1733.1166.3277.85
MGVLFTGRS'23ResNet-50BERT75.9872.0665.2354.8935.6567.4878.63
Ours:
LPVA-ResNet-50BERT82.2777.4472.2560.9839.5572.3585.11

Requirements

  • Python 3.6.13
  • PyTorch 1.9.0
  • NumPy 1.19.2
  • cuda 11.1
  • opencv 4.5.5
  • torchvision

Citation

If you found this code useful, please cite the paper. Welcome 👍Fork and Star👍, then I will let you know when we update.

@ARTICLE{10584552,
  author={Li, Ke and Wang, Di and Xu, Haojie and Zhong, Haodi and Wang, Cong},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images}, 
  year={2024},
  volume={62},
  pages={1-13},
  doi={10.1109/TGRS.2024.3423663}}