Dynamic Modality Interaction Modeling for Image-Text Retrieval (DIME)
April 5, 2026 · View on GitHub
PyTorch implementation of SIGIR'21 paper "Dynamic Modality Interaction Modeling for Image-Text Retrieval" with dynamic routing mechanism for image-text retrieval tasks.
Authors
Leigang Qu1, Meng Liu2*, Jianlong Wu1, Zan Gao3, Liqiang Nie1*
1 Shandong University
2 Shandong Jianzhu University
3 Shandong Artificial Intelligence Institute
* Corresponding author
Links
- Paper: SIGIR 2021
- Code Repository: GitHub
- Pretrained Models: Google Drive
- Dataset: SCAN Project
Table of Contents
- Introduction
- Highlights
- Method / Framework
- Project Structure
- Installation
- Checkpoints / Models
- Dataset / Benchmark
- Vocabulary Preparation
- Usage
- Citation
- Acknowledgement
- License
Introduction
This repository provides the official PyTorch implementation of the paper "Dynamic Modality Interaction Modeling for Image-Text Retrieval" (DIME), built on top of VSRN and CAMERA.
Problem: Image-text retrieval is a fundamental task in information retrieval, but remains challenging due to the difficulty of intra-modal reasoning and cross-modal alignment. Existing modality interaction methods heavily rely on expert experience and empirical feedback for interaction pattern design, lacking flexibility.
Solution: DIME develops a novel modality interaction modeling network based on a routing mechanism, which is the first unified and dynamic multimodal interaction framework for image-text retrieval. The framework:
- Designs four types of cells as basic units to explore different levels of modality interactions
- Connects them in a dense strategy to construct a routing space
- Integrates a dynamic router in each cell for pattern exploration
- Learns different activated paths dynamically based on input, enabling adaptive cross-modal alignment
Key Features:
- Dynamic route decision capability conditioned on inputs
- Comprehensive evaluation on Flickr30K and MS-COCO benchmarks
- Significant performance improvements over state-of-the-art baselines
- Provides training, evaluation, and inference scripts
Highlights
- Dynamic routing mechanism for flexible modality interaction
- Four types of interaction cells for different levels of cross-modal understanding
- Support for image-to-text (i2t) and text-to-image (t2i) retrieval tasks
- Evaluation on two benchmark datasets: Flickr30K and MS-COCO
- Pre-trained model checkpoints available for download
- Comprehensive training, inference, and evaluation scripts
Method / Framework
Framework Overview
DIME employs a novel routing-based approach to dynamically select interaction paths for different data instances.

Figure 1. Overall framework of DIME with dynamic routing mechanism for multimodal interaction.
The model consists of:
- Visual Encoder: Processes image features with bottom-up attention
- Text Encoder: BERT-based text encoding
- Interaction Cells: Four types of cells (Self-Attention, Cross-Attention, etc.) for different interaction patterns
- Dynamic Router: Routes inputs through different cell combinations
- Similarity Computation: Final matching scores for retrieval
Project Structure
.
├── models/ # Core model implementations
│ ├── BERT.py # BERT-based text encoder
│ ├── VisNet.py # Visual feature encoder
│ ├── TextNet.py # Text feature encoder
│ ├── SelfAttention.py # Self-attention cells
│ ├── DynamicInteraction.py # Dynamic interaction module
│ ├── Router.py # Dynamic routing mechanism
│ ├── Cells.py # Interaction cell implementations
│ ├── Refinement.py # Feature refinement
│ └── InteractionModule.py # Interaction module
├── misc/ # Utilities
│ ├── cocoeval.py # COCO evaluation metrics
│ ├── rewards.py # Reward computation
│ └── utils.py # Helper functions
├── fig/ # Model framework figure
├── data.py # Data loading utilities
├── evaluation.py # Evaluation functions
├── evaluate_models.py # Evaluation script
├── train.py # Training script
├── loss.py # Loss functions
├── tokenization.py # Text tokenization
├── model.py # Main model definition
├── requirement.txt # Dependencies
└── README.md
Installation
1. Clone the repository
git clone https://github.com/iLearn-Lab/SIGIR21-DIME
cd DIME
2. Create environment
python -m venv .venv
source .venv/bin/activate # Linux / Mac
# .venv\Scripts\activate # Windows
3. Install dependencies
pip install -r requirement.txt
Requirements:
- Python >= 2.7
- PyTorch >= 1.0.1
- NumPy >= 1.16.5
- TensorBoard
- pycocotools
- torchvision
- matplotlib
Checkpoints / Models
Pretrained models are available for download:
- Pretrained Checkpoint: Download
After downloading, place the model in a checkpoints/ directory:
mkdir -p checkpoints
# Place downloaded model files here
Dataset / Benchmark
Dataset Preparation
We use splits produced by Andrej Karpathy. The evaluation is performed on two benchmark datasets:
- Flickr30K: 31,000 images with 5 captions per image
- MS-COCO: 123,287 training images with 5 captions per image
Download Pre-extracted Features
All precomputed image features can be downloaded from SCAN:
wget https://scanproject.blob.core.windows.net/scan-data/data.zip
unzip data.zip
Or from Google Drive: SCAN Dataset
We refer to the extracted data path as $DATA_PATH.
BERT Model Setup
Convert the Google BERT model to PyTorch format following this guide:
# Refer to the link above for detailed conversion steps
# Save the converted model to $BERT_PATH
For more details on data pre-processing, see SCAN preprocessing guide.
Vocabulary Preparation
Text vocabularies are required for tokenization and model training. You have two options:
Option 1: Download Pre-built Vocabularies (Recommended)
Pre-built vocabulary files for both Flickr30K and MS-COCO are available in the SCAN project dataset:
# Download vocab.zip from the SCAN project
# https://www.kaggle.com/datasets/kuanghueilee/scan-features
# Extract to the vocabulary directory
unzip vocab.zip -d ./vocab
The extracted directory should contain:
coco_precomp_vocab.json- Vocabulary for MS-COCO datasetf30k_precomp_vocab.json- Vocabulary for Flickr30K dataset
We refer to the vocabulary path as $VOCAB_PATH (typically ./vocab).
Option 2: Generate Vocabularies Locally
If you prefer to build vocabularies from scratch or are working with a custom dataset, you can use the vocabulary generation script from the SCAN repository:
-
Clone or download the SCAN repository:
git clone https://github.com/kuanghuei/SCAN.git cd SCAN -
Generate vocabulary for your dataset:
# For Flickr30K python vocab.py --data_path $DATA_PATH --data_name f30k_precomp # For MS-COCO python vocab.py --data_path $DATA_PATH --data_name coco_precompReplace
$DATA_PATHwith the path to your extracted dataset features. -
Copy generated vocabularies to DIME directory:
# Copy from SCAN vocab/ directory to DIME project cp -r SCAN/vocab/ /path/to/DIME/vocab/
Vocabulary File Structure:
Each vocabulary JSON file contains:
word2idx: Mapping from words to integer indicesidx2word: Mapping from indices to words- Other metadata used for encoding/decoding text
For more details on vocabulary generation and format, refer to the SCAN data preparation documentation.
Usage
Training
Train DIME model on different datasets and directions:
COCO - Image-to-Text (i2t):
python train.py \
--data_path $DATA_PATH \
--bert_path $BERT_PATH \
--data_name coco_precomp \
--logger_name runs/coco_DIME_i2t \
--direction i2t \
--extra_stc 1 \
--lambda_softmax 4
COCO - Text-to-Image (t2i):
python train.py \
--data_path $DATA_PATH \
--bert_path $BERT_PATH \
--data_name coco_precomp \
--logger_name runs/coco_DIME_t2i \
--direction t2i \
--extra_img 1 \
--lambda_softmax 9
Flickr30K - Image-to-Text (i2t):
python train.py \
--data_path $DATA_PATH \
--bert_path $BERT_PATH \
--data_name f30k_precomp \
--logger_name runs/flickr_DIME_i2t \
--direction i2t \
--extra_stc 1 \
--lambda_softmax 4
Flickr30K - Text-to-Image (t2i):
python train.py \
--data_path $DATA_PATH \
--bert_path $BERT_PATH \
--data_name f30k_precomp \
--logger_name runs/flickr_DIME_t2i \
--direction t2i \
--extra_img 1 \
--lambda_softmax 9
Evaluation
To evaluate trained models:
- Update the model path and data path in
evaluation_models.py - Run the evaluation script:
python evaluate_models.py
Options:
fold5=True: Evaluate on COCO 1K test setfold5=False: Evaluate on COCO 5K test set
Citation
If you find this code useful for your research, please cite the paper:
@inproceedings{qu2021dynamic,
title={Dynamic Modality Interaction Modeling for Image-Text Retrieval},
author={Qu, Leigang and Liu, Meng and Wu, Jianlong and Gao, Zan and Nie, Liqiang},
booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages={1104--1113},
year={2021},
organization={ACM}
}
Acknowledgement
Thanks to:
- The authors of VSRN and CAMERA for foundational work
- SCAN Project for providing pre-extracted image features
- Bottom-up Attention for visual feature extraction
- BERT-pytorch for the BERT implementation
License
This project is licensed under the MIT License. See the LICENSE file for details.