README.md
March 30, 2026 ยท View on GitHub
[EMNLP 2025] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, Qichuan Ding,
๐ Paper | ๐ค Web-Person Dataset
๐บ News
- [2026/03/11]: โจWe update the code of GA-DMS.
- [2025/09/12]: โจWe public the paper of GA-DMS.
- [2025/09/10]: โจWe release the Web-Person Dataset in ๐ค Huggingface
- [2025/08/21]: โจGA-DMS has been accepted by EMNLP2025 Main.
๐ก Highlights
This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.
We utilize the COYO700M dataset, a large-scale dataset that contains 747M image-text pairs collected from CommonCrawl, as our web-crawled images source. The following is the details of person-centric image filtering and synthetic caption generation pipeline for constructing our WebPerson dataset.
WebPerson Dataset
The WebPerson dataset can be downloaded here, which includes both 5M and 1M scales. Both the images and their corresponding textual descriptions are available from this source.
Prepare Downstream Datasets
Download the CUHK-PEDES dataset from here, ICFG-PEDES dataset from here and RSTPReid dataset form here.
Environment installation
conda create -n ga_dms python=3.10 -y
conda activate ga_dms
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Pretrain Model Checkpoints
We release the Pretrain Model Checkpoints at here. key=ztlk
To pertrain model,you can simply run sh run_ddp.sh
To fine-tune model, you can simply run sh finetune.sh --finetune checkpoint.pth. After the model training is completed, it will provide the performance of fine-tune setting.
Acknowledgements
This project is based on MLLM4Text-ReID, and IRRA, thanks for their works.
๐ Citation
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{zheng2025gradientattentionguideddualmaskingsynergetic,
title={Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval},
author={Tianlu Zheng and Yifan Zhang and Xiang An and Ziyong Feng and Kaicheng Yang and Qichuan Ding},
year={2025},
eprint={2509.09118},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.09118},
}