README.md

March 30, 2026 ยท View on GitHub

[EMNLP 2025] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, Qichuan Ding,

๐Ÿ“„ Paper | ๐Ÿค— Web-Person Dataset

๐ŸŽบ News

  • [2026/03/11]: โœจWe update the code of GA-DMS.
  • [2025/09/12]: โœจWe public the paper of GA-DMS.
  • [2025/09/10]: โœจWe release the Web-Person Dataset in ๐Ÿค— Huggingface
  • [2025/08/21]: โœจGA-DMS has been accepted by EMNLP2025 Main.

๐Ÿ’ก Highlights

This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.

Our Proposed Pipeline

We utilize the COYO700M dataset, a large-scale dataset that contains 747M image-text pairs collected from CommonCrawl, as our web-crawled images source. The following is the details of person-centric image filtering and synthetic caption generation pipeline for constructing our WebPerson dataset.

Our Proposed Dataset Pipeline

WebPerson Dataset

The WebPerson dataset can be downloaded here, which includes both 5M and 1M scales. Both the images and their corresponding textual descriptions are available from this source.

Prepare Downstream Datasets

Download the CUHK-PEDES dataset from here, ICFG-PEDES dataset from here and RSTPReid dataset form here.

Environment installation

conda create -n ga_dms python=3.10 -y
conda activate ga_dms

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Pretrain Model Checkpoints

We release the Pretrain Model Checkpoints at here. key=ztlk
To pertrain model,you can simply run sh run_ddp.sh

To fine-tune model, you can simply run sh finetune.sh --finetune checkpoint.pth. After the model training is completed, it will provide the performance of fine-tune setting.

Acknowledgements

This project is based on MLLM4Text-ReID, and IRRA, thanks for their works.

๐Ÿ“– Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{zheng2025gradientattentionguideddualmaskingsynergetic,
      title={Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval}, 
      author={Tianlu Zheng and Yifan Zhang and Xiang An and Ziyong Feng and Kaicheng Yang and Qichuan Ding},
      year={2025},
      eprint={2509.09118},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.09118}, 
}