Dynamic Modality Interaction Modeling for Image-Text Retrieval (DIME)

April 5, 2026 · View on GitHub

PyTorch implementation of SIGIR'21 paper "Dynamic Modality Interaction Modeling for Image-Text Retrieval" with dynamic routing mechanism for image-text retrieval tasks.

Authors

Leigang Qu1, Meng Liu2*, Jianlong Wu1, Zan Gao3, Liqiang Nie1*

1 Shandong University
2 Shandong Jianzhu University
3 Shandong Artificial Intelligence Institute

* Corresponding author



Table of Contents


Introduction

This repository provides the official PyTorch implementation of the paper "Dynamic Modality Interaction Modeling for Image-Text Retrieval" (DIME), built on top of VSRN and CAMERA.

Problem: Image-text retrieval is a fundamental task in information retrieval, but remains challenging due to the difficulty of intra-modal reasoning and cross-modal alignment. Existing modality interaction methods heavily rely on expert experience and empirical feedback for interaction pattern design, lacking flexibility.

Solution: DIME develops a novel modality interaction modeling network based on a routing mechanism, which is the first unified and dynamic multimodal interaction framework for image-text retrieval. The framework:

  • Designs four types of cells as basic units to explore different levels of modality interactions
  • Connects them in a dense strategy to construct a routing space
  • Integrates a dynamic router in each cell for pattern exploration
  • Learns different activated paths dynamically based on input, enabling adaptive cross-modal alignment

Key Features:

  • Dynamic route decision capability conditioned on inputs
  • Comprehensive evaluation on Flickr30K and MS-COCO benchmarks
  • Significant performance improvements over state-of-the-art baselines
  • Provides training, evaluation, and inference scripts

Highlights

  • Dynamic routing mechanism for flexible modality interaction
  • Four types of interaction cells for different levels of cross-modal understanding
  • Support for image-to-text (i2t) and text-to-image (t2i) retrieval tasks
  • Evaluation on two benchmark datasets: Flickr30K and MS-COCO
  • Pre-trained model checkpoints available for download
  • Comprehensive training, inference, and evaluation scripts

Method / Framework

Framework Overview

DIME employs a novel routing-based approach to dynamically select interaction paths for different data instances.

DIME Model Architecture

Figure 1. Overall framework of DIME with dynamic routing mechanism for multimodal interaction.

The model consists of:

  • Visual Encoder: Processes image features with bottom-up attention
  • Text Encoder: BERT-based text encoding
  • Interaction Cells: Four types of cells (Self-Attention, Cross-Attention, etc.) for different interaction patterns
  • Dynamic Router: Routes inputs through different cell combinations
  • Similarity Computation: Final matching scores for retrieval

Project Structure

.
├── models/                # Core model implementations
│   ├── BERT.py           # BERT-based text encoder
│   ├── VisNet.py         # Visual feature encoder
│   ├── TextNet.py        # Text feature encoder
│   ├── SelfAttention.py   # Self-attention cells
│   ├── DynamicInteraction.py  # Dynamic interaction module
│   ├── Router.py         # Dynamic routing mechanism
│   ├── Cells.py          # Interaction cell implementations
│   ├── Refinement.py     # Feature refinement
│   └── InteractionModule.py  # Interaction module
├── misc/                  # Utilities
│   ├── cocoeval.py       # COCO evaluation metrics
│   ├── rewards.py        # Reward computation
│   └── utils.py          # Helper functions
├── fig/                   # Model framework figure
├── data.py               # Data loading utilities
├── evaluation.py         # Evaluation functions
├── evaluate_models.py    # Evaluation script
├── train.py              # Training script
├── loss.py               # Loss functions
├── tokenization.py       # Text tokenization
├── model.py              # Main model definition
├── requirement.txt       # Dependencies
└── README.md

Installation

1. Clone the repository

git clone https://github.com/iLearn-Lab/SIGIR21-DIME
cd DIME

2. Create environment

python -m venv .venv
source .venv/bin/activate   # Linux / Mac
# .venv\Scripts\activate    # Windows

3. Install dependencies

pip install -r requirement.txt

Requirements:

  • Python >= 2.7
  • PyTorch >= 1.0.1
  • NumPy >= 1.16.5
  • TensorBoard
  • pycocotools
  • torchvision
  • matplotlib

Checkpoints / Models

Pretrained models are available for download:

After downloading, place the model in a checkpoints/ directory:

mkdir -p checkpoints
# Place downloaded model files here

Dataset / Benchmark

Dataset Preparation

We use splits produced by Andrej Karpathy. The evaluation is performed on two benchmark datasets:

  • Flickr30K: 31,000 images with 5 captions per image
  • MS-COCO: 123,287 training images with 5 captions per image

Download Pre-extracted Features

All precomputed image features can be downloaded from SCAN:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip
unzip data.zip

Or from Google Drive: SCAN Dataset

We refer to the extracted data path as $DATA_PATH.

BERT Model Setup

Convert the Google BERT model to PyTorch format following this guide:

# Refer to the link above for detailed conversion steps
# Save the converted model to $BERT_PATH

For more details on data pre-processing, see SCAN preprocessing guide.

Vocabulary Preparation

Text vocabularies are required for tokenization and model training. You have two options:

Pre-built vocabulary files for both Flickr30K and MS-COCO are available in the SCAN project dataset:

# Download vocab.zip from the SCAN project
# https://www.kaggle.com/datasets/kuanghueilee/scan-features

# Extract to the vocabulary directory
unzip vocab.zip -d ./vocab

The extracted directory should contain:

  • coco_precomp_vocab.json - Vocabulary for MS-COCO dataset
  • f30k_precomp_vocab.json - Vocabulary for Flickr30K dataset

We refer to the vocabulary path as $VOCAB_PATH (typically ./vocab).

Option 2: Generate Vocabularies Locally

If you prefer to build vocabularies from scratch or are working with a custom dataset, you can use the vocabulary generation script from the SCAN repository:

  1. Clone or download the SCAN repository:

    git clone https://github.com/kuanghuei/SCAN.git
    cd SCAN
    
  2. Generate vocabulary for your dataset:

    # For Flickr30K
    python vocab.py --data_path $DATA_PATH --data_name f30k_precomp
    
    # For MS-COCO
    python vocab.py --data_path $DATA_PATH --data_name coco_precomp
    

    Replace $DATA_PATH with the path to your extracted dataset features.

  3. Copy generated vocabularies to DIME directory:

    # Copy from SCAN vocab/ directory to DIME project
    cp -r SCAN/vocab/ /path/to/DIME/vocab/
    

Vocabulary File Structure:

Each vocabulary JSON file contains:

  • word2idx: Mapping from words to integer indices
  • idx2word: Mapping from indices to words
  • Other metadata used for encoding/decoding text

For more details on vocabulary generation and format, refer to the SCAN data preparation documentation.


Usage

Training

Train DIME model on different datasets and directions:

COCO - Image-to-Text (i2t):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name coco_precomp \
  --logger_name runs/coco_DIME_i2t \
  --direction i2t \
  --extra_stc 1 \
  --lambda_softmax 4

COCO - Text-to-Image (t2i):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name coco_precomp \
  --logger_name runs/coco_DIME_t2i \
  --direction t2i \
  --extra_img 1 \
  --lambda_softmax 9

Flickr30K - Image-to-Text (i2t):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name f30k_precomp \
  --logger_name runs/flickr_DIME_i2t \
  --direction i2t \
  --extra_stc 1 \
  --lambda_softmax 4

Flickr30K - Text-to-Image (t2i):

python train.py \
  --data_path $DATA_PATH \
  --bert_path $BERT_PATH \
  --data_name f30k_precomp \
  --logger_name runs/flickr_DIME_t2i \
  --direction t2i \
  --extra_img 1 \
  --lambda_softmax 9

Evaluation

To evaluate trained models:

  1. Update the model path and data path in evaluation_models.py
  2. Run the evaluation script:
python evaluate_models.py

Options:

  • fold5=True: Evaluate on COCO 1K test set
  • fold5=False: Evaluate on COCO 5K test set

Citation

If you find this code useful for your research, please cite the paper:

@inproceedings{qu2021dynamic,
  title={Dynamic Modality Interaction Modeling for Image-Text Retrieval},
  author={Qu, Leigang and Liu, Meng and Wu, Jianlong and Gao, Zan and Nie, Liqiang},
  booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={1104--1113},
  year={2021},
  organization={ACM}
}

Acknowledgement

Thanks to:


License

This project is licensed under the MIT License. See the LICENSE file for details.