Dynamic Modality Interaction Modeling for Image-Text Retrieval (DIME)

April 5, 2026 · View on GitHub

PyTorch implementation of SIGIR'21 paper "Dynamic Modality Interaction Modeling for Image-Text Retrieval" with dynamic routing mechanism for image-text retrieval tasks.

Authors

Leigang Qu¹, Meng Liu²*, Jianlong Wu¹, Zan Gao³, Liqiang Nie¹*

¹ Shandong University
² Shandong Jianzhu University
³ Shandong Artificial Intelligence Institute

* Corresponding author

Introduction

This repository provides the official PyTorch implementation of the paper "Dynamic Modality Interaction Modeling for Image-Text Retrieval" (DIME), built on top of VSRN and CAMERA.

Problem: Image-text retrieval is a fundamental task in information retrieval, but remains challenging due to the difficulty of intra-modal reasoning and cross-modal alignment. Existing modality interaction methods heavily rely on expert experience and empirical feedback for interaction pattern design, lacking flexibility.

Solution: DIME develops a novel modality interaction modeling network based on a routing mechanism, which is the first unified and dynamic multimodal interaction framework for image-text retrieval. The framework:

Designs four types of cells as basic units to explore different levels of modality interactions
Connects them in a dense strategy to construct a routing space
Integrates a dynamic router in each cell for pattern exploration
Learns different activated paths dynamically based on input, enabling adaptive cross-modal alignment

Key Features:

Dynamic route decision capability conditioned on inputs
Comprehensive evaluation on Flickr30K and MS-COCO benchmarks
Significant performance improvements over state-of-the-art baselines
Provides training, evaluation, and inference scripts

Highlights

Dynamic routing mechanism for flexible modality interaction
Four types of interaction cells for different levels of cross-modal understanding
Support for image-to-text (i2t) and text-to-image (t2i) retrieval tasks
Evaluation on two benchmark datasets: Flickr30K and MS-COCO
Pre-trained model checkpoints available for download
Comprehensive training, inference, and evaluation scripts

Method / Framework

Framework Overview

DIME employs a novel routing-based approach to dynamically select interaction paths for different data instances.

DIME Model Architecture

Figure 1. Overall framework of DIME with dynamic routing mechanism for multimodal interaction.

The model consists of:

Visual Encoder: Processes image features with bottom-up attention
Text Encoder: BERT-based text encoding
Interaction Cells: Four types of cells (Self-Attention, Cross-Attention, etc.) for different interaction patterns
Dynamic Router: Routes inputs through different cell combinations
Similarity Computation: Final matching scores for retrieval

Project Structure

.
├── models/                # Core model implementations
│   ├── BERT.py           # BERT-based text encoder
│   ├── VisNet.py         # Visual feature encoder
│   ├── TextNet.py        # Text feature encoder
│   ├── SelfAttention.py   # Self-attention cells
│   ├── DynamicInteraction.py  # Dynamic interaction module
│   ├── Router.py         # Dynamic routing mechanism
│   ├── Cells.py          # Interaction cell implementations
│   ├── Refinement.py     # Feature refinement
│   └── InteractionModule.py  # Interaction module
├── misc/                  # Utilities
│   ├── cocoeval.py       # COCO evaluation metrics
│   ├── rewards.py        # Reward computation
│   └── utils.py          # Helper functions
├── fig/                   # Model framework figure
├── data.py               # Data loading utilities
├── evaluation.py         # Evaluation functions
├── evaluate_models.py    # Evaluation script
├── train.py              # Training script
├── loss.py               # Loss functions
├── tokenization.py       # Text tokenization
├── model.py              # Main model definition
├── requirement.txt       # Dependencies
└── README.md

Installation

1. Clone the repository

git clone https://github.com/iLearn-Lab/SIGIR21-DIME
cd DIME

2. Create environment

python -m venv .venv
source .venv/bin/activate   # Linux / Mac
# .venv\Scripts\activate    # Windows

3. Install dependencies

pip install -r requirement.txt

Requirements:

Python >= 2.7
PyTorch >= 1.0.1
NumPy >= 1.16.5
TensorBoard
pycocotools
torchvision
matplotlib

Checkpoints / Models

Pretrained models are available for download:

Pretrained Checkpoint: Download

After downloading, place the model in a checkpoints/ directory:

mkdir -p checkpoints
# Place downloaded model files here

Dataset / Benchmark

Dataset Preparation

We use splits produced by Andrej Karpathy. The evaluation is performed on two benchmark datasets:

Flickr30K: 31,000 images with 5 captions per image
MS-COCO: 123,287 training images with 5 captions per image

Download Pre-extracted Features

All precomputed image features can be downloaded from SCAN:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip
unzip data.zip

Or from Google Drive: SCAN Dataset

We refer to the extracted data path as $DATA_PATH.

BERT Model Setup

Convert the Google BERT model to PyTorch format following this guide:

# Refer to the link above for detailed conversion steps
# Save the converted model to $BERT_PATH

For more details on data pre-processing, see SCAN preprocessing guide.

Vocabulary Preparation

Text vocabularies are required for tokenization and model training. You have two options:

Option 1: Download Pre-built Vocabularies (Recommended)

Pre-built vocabulary files for both Flickr30K and MS-COCO are available in the SCAN project dataset:

# Download vocab.zip from the SCAN project
# https://www.kaggle.com/datasets/kuanghueilee/scan-features

# Extract to the vocabulary directory
unzip vocab.zip -d ./vocab

The extracted directory should contain:

coco_precomp_vocab.json - Vocabulary for MS-COCO dataset
f30k_precomp_vocab.json - Vocabulary for Flickr30K dataset

We refer to the vocabulary path as $VOCAB_PATH (typically ./vocab).

Option 2: Generate Vocabularies Locally

If you prefer to build vocabularies from scratch or are working with a custom dataset, you can use the vocabulary generation script from the SCAN repository:

Clone or download the SCAN repository:

git clone https://github.com/kuanghuei/SCAN.git
cd SCAN

Generate vocabulary for your dataset:

# For Flickr30K
python vocab.py --data_path $DATA_PATH --data_name f30k_precomp

# For MS-COCO
python vocab.py --data_path $DATA_PATH --data_name coco_precomp

Replace $DATA_PATH with the path to your extracted dataset features.

Copy generated vocabularies to DIME directory:

# Copy from SCAN vocab/ directory to DIME project
cp -r SCAN/vocab/ /path/to/DIME/vocab/

Vocabulary File Structure:

Each vocabulary JSON file contains:

word2idx: Mapping from words to integer indices
idx2word: Mapping from indices to words
Other metadata used for encoding/decoding text

For more details on vocabulary generation and format, refer to the SCAN data preparation documentation.

Usage

Training

Train DIME model on different datasets and directions:

COCO - Image-to-Text (i2t):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name coco_precomp \
  --logger_name runs/coco_DIME_i2t \
  --direction i2t \
  --extra_stc 1 \
  --lambda_softmax 4

COCO - Text-to-Image (t2i):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name coco_precomp \
  --logger_name runs/coco_DIME_t2i \
  --direction t2i \
  --extra_img 1 \
  --lambda_softmax 9

Flickr30K - Image-to-Text (i2t):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name f30k_precomp \
  --logger_name runs/flickr_DIME_i2t \
  --direction i2t \
  --extra_stc 1 \
  --lambda_softmax 4

Flickr30K - Text-to-Image (t2i):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name f30k_precomp \
  --logger_name runs/flickr_DIME_t2i \
  --direction t2i \
  --extra_img 1 \
  --lambda_softmax 9

Evaluation

To evaluate trained models:

Update the model path and data path in evaluation_models.py
Run the evaluation script:

python evaluate_models.py

Options:

fold5=True: Evaluate on COCO 1K test set
fold5=False: Evaluate on COCO 5K test set

Citation

If you find this code useful for your research, please cite the paper:

@inproceedings{qu2021dynamic,
  title={Dynamic Modality Interaction Modeling for Image-Text Retrieval},
  author={Qu, Leigang and Liu, Meng and Wu, Jianlong and Gao, Zan and Nie, Liqiang},
  booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={1104--1113},
  year={2021},
  organization={ACM}
}

Acknowledgement

Thanks to:

The authors of VSRN and CAMERA for foundational work
SCAN Project for providing pre-extracted image features
Bottom-up Attention for visual feature extraction
BERT-pytorch for the BERT implementation

License

This project is licensed under the MIT License. See the LICENSE file for details.