README.md

May 25, 2026 · View on GitHub

📹 (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Zhiwei Chen¹, Yupeng Hu^1✉, Zixu Li¹, Zhiheng Fu¹, Haokun Wen², Weili Guan²

¹School of Software, Shandong University
²School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
^✉Corresponding author

Accepted by ACM MM 2025: A novel framework tackling both the 🎬 Composed Video Retrieval (CVR) and 🌁 Composed Image Retrieval (CIR) tasks by leveraging the disparity in information density between modalities.

📖 Introduction

HUD is an advanced open-source PyTorch framework designed to improve multi-modal query understanding. It is the first framework that explicitly leverages the disparity in information density between video and text to address modification subject referring ambiguity and limited detailed semantic focus. It achieves state-of-the-art (SOTA) performance across both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) benchmarks.

⬆ Back to top

📢 News

[2026-03-19] 🚀 We migrate the all training and evaluation codes of HUD from Google Drive to a GitHub repository.
[2025-07-05] 🔥 Our paper "HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval" has been accepted by ACM MM 2025!

⬆ Back to top

✨ Key Features

🎯 Holistic Pronoun Disambiguation: Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate the referents of pronouns in the modification text.
🔍 Atomistic Uncertainty Modeling: Leverages cross-modal interactions at the atomistic perspective to discern key detail semantics via uncertainty modeling, enhancing the model's focus on fine-grained visual details.
⚖️ Holistic-to-Atomistic Alignment: Adaptively aligns the composed query representation with the target video/image by incorporating a learnable similarity bias between the holistic and atomistic levels.
🧩 Unified Framework: Seamlessly supports both video (CVR) and image (CIR) retrieval tasks with strong generalization capabilities.

⬆ Back to top

🏗️ Architecture

HUD architecture

Figure 1. The overall framework of HUD consists of three key modules: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment.

⬆ Back to top

🏃‍♂️ Experiment-Results

CVR Task Performance

Table 1. Performance comparison on the test set of the CVR dataset, WebVid-CoVR, relative to R@k(%). The overall best results are in bold, while the best results over baselines are underlined.

HUD architecture

CIR Task Performance

Table 2. Performance comparison on the CIR datasets, FashionIQ and CIRR, relative to R@k(%). The overall best results are in bold, while the best results over baselines are underlined.

⬆ Back to top

Table of Contents

Introduction
News
Key Features
Architecture
Experiment Results
Quick Start & Installation
Repository Structure
Configuration Overview
Data Preparation
Training
Evaluation/Testing
Output & Checkpoints
Acknowledgement
Contact
Citation
Support & Contributing

🚀 Quick Start & Installation

We recommend using Anaconda to manage your environment following CoVR-Project. Note: This project was developed and tested with Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU.

# 1. Clone the repository
git clone https://github.com/ZivChen-Ty/HUD
cd HUD

# 2. Create a virtual environment
conda create -n hud python=3.8.10 -y
conda activate hud

# 3. Install PyTorch (Adjust CUDA version based on your hardware)
conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# 4. Install other dependencies
pip install -r requirements.txt

⬆ Back to top

📂 Repository Structure

Our codebase is highly modular. Here is a brief overview of the core files and directories:

HUD/
├── configs/               # ⚙️ Hydra configuration files (data, model, trainer, etc.)
├── src/                   # 🧠 Source code (dataloaders, model implementations, testing)
├── train_CVR.py           # 🎥 Training entry point for WebVid-CoVR
├── train_CIR.py           # 🌃 Training entry point for FashionIQ & CIRR
├── test.py                # 🧪 Evaluation entry point
└── requirements.txt       # 📦 Project dependencies

⬆ Back to top

⚙️ Configuration Overview

All hyperparameters and paths are managed by Hydra under the configs/ directory. The key configuration groups are:

configs/data/ — Dataset loaders and dataset-specific path definitions.
configs/model/ — Model architecture, checkpoints, optimizers, schedulers, and loss functions.
configs/trainer/ — Lightning Fabric training settings (devices, precision, checkpointing).
configs/machine/ — Hardware/Machine settings (batch size, num workers, default root paths).
configs/test/ — Evaluation presets across different test splits.

⬆ Back to top

🗃️ Data Preparation

By default, the datasets are expected to be placed under a common root directory.

💡 Path Configuration: You must adjust these paths for your local setup. There are two recommended ways to do this:

Edit YAML directly (Preferred): Modify configs/machine/default.yaml or the specific files in configs/data/*.yaml.

Override via CLI: Append machine.default.datasets_dir=/path/to/data to your run commands.

1. Composed Video Retrieval (CVR)

Dataset: WebVid-CoVR

Expected directory structure:

datasets_dir/
└── WebVid-CoVR/
    ├── videos/
    │   ├── 2M/
    │   └── 8M/
    └── annotation/
        ├── webvid2m-covr_train.csv
        ├── webvid8m-covr_val.csv
        └── webvid8m-covr_test.csv

2. Composed Image Retrieval (CIR)

Datasets: FashionIQ and CIRR

Expected directory structure:

datasets_dir/
├── FashionIQ/
│   ├── captions/
│   │   ├── cap.dress.[train|val|test].json
│   │   └── ...
│   ├── image_splits/
│   │   ├── split.dress.[train|val|test].json
│   │   └── ...
│   ├── dress/
│   ├── shirt/
│   └── toptee/
└── CIRR/
    ├── train/
    ├── dev/
    ├── test1/
    └── cirr/
        ├── captions/
        │   └── cap.rc2.[train|val|test1].json
        └── image_splits/
            └── split.rc2.[train|val|test1].json

⬆ Back to top

🧨 Training

You can easily override hyperparameters, datasets, and paths directly from the command line using Hydra syntax.

Train CVR Model (WebVid-CoVR)

python train_CVR.py

Train CIR Model (FashionIQ or CIRR)

python train_CIR.py

⚠️ Before running CIR training, make sure to update the dataset selection in configs/train_CIR.yaml (data and test in defaults) to your target dataset (e.g. fashioniq or cirr).

For example:
defaults:
  - data: fashioniq
  - test: fashioniq
or:
defaults:
  - data: cirr
  - test: cirr-all

⬆ Back to top

🧪 Evaluation / Testing

To evaluate a trained model, use test.py and specify the target benchmark.

python3 test.py

(Make sure to specify the dataset and path to your trained checkpoint via the config overrides or by updating the relevant configs/test/*.yaml file).

⬆ Back to top

📌 Output & Checkpoints

Hydra automatically manages your experiment logs and weights.

Outputs are systematically written to: outputs/<dataset>/<model>/<ckpt>/<experiment>/<run_name>/.
Checkpoints are saved inside the run directory as ckpt_last.ckpt (or ckpt_<epoch>.ckpt if configured).

⬆ Back to top

🤝 Acknowledgements

Our implementation is based on CoVR-2 for the foundational Composed Video Retrieval baselines and datasets and LAVIS for providing robust Vision-Language models like BLIP-2. We sincerely thank the authors for their great open-source projects.

⬆ Back to top

✉️ Contact

For any questions, issues, or feedback, please reach out to me zivczw@gmail.com ☺️

⬆ Back to top

Ecosystem & Other Works from our Team

TEMA (ACL'26) Paper \| Project \| Code	ConeSep (CVPR'26) Paper \| Project \| Code \| Blog Post (Chinese)	Air-Know (CVPR'26) Paper \| Project \| Code \| Blog Post (Chinese)
HABIT (AAAI'26) Project \| Code \| Paper	ReTrack (AAAI'26) Project \| Code \| Paper	INTENT (AAAI'26) Project \| Code \| Paper
OFFSET (ACM MM'25) Project \| Code \| Paper	ENCODER (AAAI'25) Project \| Code \| Paper

📝⭐️ Citation

If you find our work or this code useful in your research, please consider leaving a Star⭐️ or Citing📝 our paper 🥰. Your support is our greatest motivation!

@inproceedings{HUD, 
  title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval}, 
  author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili}, 
  booktitle = {Proceedings of the ACM International Conference on Multimedia}, 
  pages = {6143–6152}, 
  year = {2025} 
}

⬆ Back to top

🫡 Support & Contributing

We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:

Open an Issue for discussions or bug reports.
Submit a Pull Request to improve the codebase.

⬆ Back to top

📄 License

This project is released under the terms of the LICENSE file included in this repository.

HUD Demo