README.md
May 25, 2026 · View on GitHub
[ACL 2026] ⚓ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
1School of Software, Shandong University2Department of Computing, Hong Kong Polytechnic University
3School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)
✉ Corresponding author
Official Repository: This is an open-source implementation of the paper "TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval".
📌 Introduction
TEMA (Text-oriented Entity Mapping Architecture) is the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios while seamlessly accommodating simple modifications. Prevailing CIR setups rely on simple modification texts, which typically cover only a limited range of salient changes. This induces two critical limitations highly relevant to practical applications: Insufficient Entity Coverage and Clause-Entity Misalignment.
To bring CIR closer to real-world use, we introduce two instruction-rich multi-modification datasets: M-FashionIQ and M-CIRR. Through MMT parsing and entity mapping, TEMA actively perceives and structurally models these complex modifications to achieve precise retrieval.
📢 News
- [2026.04.07] 🔥 TEMA was accepted by ACL 2026!
- [2026.04.06] 🚀 Released all training and evaluation codes.
✨ Key Contributions
- 📊 New Benchmarks (M-FashionIQ & M-CIRR): We construct two instruction-intensive datasets that replace short, simplistic texts with Multi-Modification Texts (MMT). These are generated by MLLM and verified by human annotators to explicitly present constraint structures with multiple entities and clauses.
- 🧠 MMT Parsing Assistant (PA): Designed to address "Insufficient Entity Coverage". It utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities through summarization and checks.
- 🔗 MMT-oriented Entity Mapping (EM): Tackles the "Clause-Entity Misalignment" issue. It introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side, stabilizing "one-to-many" relationship modeling.
- 🏆 Superior Performance: Extensive experiments on four benchmark datasets demonstrate TEMA's superiority in both original and multi-modification scenarios.
🏗️ Architecture
1. Data Generation Pipeline
2. TEMA Framework
🚀 Experimental Results
Table of Contents
- Introduction
- Key Contributions
- Architecture
- Experimental Results
- Installation
- Data Preparation
- Quick Start
- Acknowledgement
- Citation
📦 Installation
1. Clone the repository
git clone https://github.com/lee-zixu/ACL26-TEMA
cd TEMA
2. Setup Python Environment
The code is evaluated on Python 3.10.8 and PyTorch 2.5.1 using an NVIDIA A40 48G GPU. We recommend using Anaconda to create an isolated virtual environment:
conda create -n tema python=3.10.8
conda activate tema
# Install PyTorch
pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install core dependencies
pip install transformers==4.25.0
📂 Data Preparation
We evaluated our framework on our newly proposed M-FashionIQ and M-CIRR datasets. Please prepare the data by following the steps below:
1. Fashion-domain Dataset: M-FashionIQ
First, download the FashionIQ dataset following the instructions in the official repository. After downloading, to obtain our proposed M-FashionIQ dataset, replace the captions folder with our provided mmt_captions.
Ensure the folder structure matches the following:
├── M-FashionIQ
│ ├── mmt_captions
│ │ ├── cap.dress.[train | val].mmt.json
│ │ ├── cap.toptee.[train | val].mmt.json
│ │ ├── cap.shirt.[train | val].mmt.json
│ ├── image_splits
│ │ ├── split.dress.[train | val | test].json
│ │ ├── split.toptee.[train | val | test].json
│ │ ├── split.shirt.[train | val | test].json
│ ├── dress
│ │ ├── [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]
│ ├── shirt
│ │ ├── [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
│ ├── toptee
│ │ ├── [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]
2. Open-domain Dataset: M-CIRR
First, download the CIRR dataset following the instructions in the official repository. After downloading, to obtain our proposed M-CIRR dataset, replace the captions folder with our provided mmt_captions.
Ensure the folder structure matches the following:
├── M-CIRR
│ ├── train
│ │ ├── [0 | 1 | 2 | ...]
│ │ │ ├── [train-10108-0-img0.png | train-10108-0-img1.png | ...]
│ ├── dev
│ │ ├── [dev-0-0-img0.png | dev-0-0-img1.png | ...]
│ ├── test1
│ │ ├── [test1-0-0-img0.png | test1-0-0-img1.png | ...]
│ ├── mcirr
│ │ ├── mmt_captions
│ │ │ ├── cap.rc2.[train | val | test1].mmt.json
│ │ ├── image_splits
│ │ │ ├── split.rc2.[train | val | test1].json
🚀 Quick Start
Training Phase
To start training TEMA on your prepared datasets, execute the following command:
python3 train.py
🤝 Acknowledgement
Our implementation is based on the LAVIS framework. We express our sincere gratitude to their open-source contributions!
🔗 Related Projects
Ecosystem & Other Works from our Team
![]() ConeSep (CVPR'26) Paper | Project | Code | Blog Post (Chinese) |
![]() Air-Know (CVPR'26) Paper | Project | Code | Blog Post (Chinese) |
![]() ReTrack (AAAI'26) Paper | Project | Code |
![]() INTENT (AAAI'26) Paper | Project | Code |
![]() HUD (ACM MM'25) Paper | Project | Code |
![]() OFFSET (ACM MM'25) Paper | Project | Code |
![]() ENCODER (AAAI'25) Paper | Project | Code |
![]() HABIT (AAAI'26) Paper | Project | Code |
📝 Citation
If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider citing our work:
@inproceedings{TEMA,
title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
year={2026}
}
🫡 Support & Contributing
We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:
- Open an Issue for discussions or bug reports.
- Submit a Pull Request to improve the codebase.
📄 License
This project is released under the terms of the LICENSE file included in this repository.







