README.md
May 25, 2026 Β· View on GitHub
πΉ (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval
1School of Software, Shandong University Β Β Β2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Β Β Β
βΒ Corresponding authorΒ Β
Accepted by ACM MM 2025: A novel framework tackling both the π¬ Composed Video Retrieval (CVR) and π Composed Image Retrieval (CIR) tasks by leveraging the disparity in information density between modalities.
π Introduction
HUD is an advanced open-source PyTorch framework designed to improve multi-modal query understanding. It is the first framework that explicitly leverages the disparity in information density between video and text to address modification subject referring ambiguity and limited detailed semantic focus. It achieves state-of-the-art (SOTA) performance across both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) benchmarks.
π’ News
- [2026-03-19] π We migrate the all training and evaluation codes of HUD from Google Drive to a GitHub repository.
- [2025-07-05] π₯ Our paper "HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval" has been accepted by ACM MM 2025!
β¨ Key Features
- π― Holistic Pronoun Disambiguation: Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate the referents of pronouns in the modification text.
- π Atomistic Uncertainty Modeling: Leverages cross-modal interactions at the atomistic perspective to discern key detail semantics via uncertainty modeling, enhancing the model's focus on fine-grained visual details.
- βοΈ Holistic-to-Atomistic Alignment: Adaptively aligns the composed query representation with the target video/image by incorporating a learnable similarity bias between the holistic and atomistic levels.
- π§© Unified Framework: Seamlessly supports both video (CVR) and image (CIR) retrieval tasks with strong generalization capabilities.
ποΈ Architecture
πββοΈ Experiment-Results
CVR Task Performance
Table 1. Performance comparison on the test set of the CVR dataset, WebVid-CoVR, relative to R@k(%). The overall best results are in bold, while the best results over baselines are underlined.
CIR Task Performance
Table 2. Performance comparison on the CIR datasets, FashionIQ and CIRR, relative to R@k(%). The overall best results are in bold, while the best results over baselines are underlined.
Table of Contents
- Introduction
- News
- Key Features
- Architecture
- Experiment Results
- Quick Start & Installation
- Repository Structure
- Configuration Overview
- Data Preparation
- Training
- Evaluation/Testing
- Output & Checkpoints
- Acknowledgement
- Contact
- Citation
- Support & Contributing
π Quick Start & Installation
We recommend using Anaconda to manage your environment following CoVR-Project. Note: This project was developed and tested with Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU.
# 1. Clone the repository
git clone https://github.com/ZivChen-Ty/HUD
cd HUD
# 2. Create a virtual environment
conda create -n hud python=3.8.10 -y
conda activate hud
# 3. Install PyTorch (Adjust CUDA version based on your hardware)
conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# 4. Install other dependencies
pip install -r requirements.txt
π Repository Structure
Our codebase is highly modular. Here is a brief overview of the core files and directories:
HUD/
βββ configs/ # βοΈ Hydra configuration files (data, model, trainer, etc.)
βββ src/ # π§ Source code (dataloaders, model implementations, testing)
βββ train_CVR.py # π₯ Training entry point for WebVid-CoVR
βββ train_CIR.py # π Training entry point for FashionIQ & CIRR
βββ test.py # π§ͺ Evaluation entry point
βββ requirements.txt # π¦ Project dependencies
βοΈ Configuration Overview
All hyperparameters and paths are managed by Hydra under the configs/ directory. The key configuration groups are:
configs/data/β Dataset loaders and dataset-specific path definitions.configs/model/β Model architecture, checkpoints, optimizers, schedulers, and loss functions.configs/trainer/β Lightning Fabric training settings (devices, precision, checkpointing).configs/machine/β Hardware/Machine settings (batch size, num workers, default root paths).configs/test/β Evaluation presets across different test splits.
ποΈ Data Preparation
By default, the datasets are expected to be placed under a common root directory.
π‘ Path Configuration: You must adjust these paths for your local setup. There are two recommended ways to do this:
- Edit YAML directly (Preferred): Modify
configs/machine/default.yamlor the specific files inconfigs/data/*.yaml.- Override via CLI: Append
machine.default.datasets_dir=/path/to/datato your run commands.
1. Composed Video Retrieval (CVR)
Dataset: WebVid-CoVR
Expected directory structure:
datasets_dir/
βββ WebVid-CoVR/
βββ videos/
β βββ 2M/
β βββ 8M/
βββ annotation/
βββ webvid2m-covr_train.csv
βββ webvid8m-covr_val.csv
βββ webvid8m-covr_test.csv
2. Composed Image Retrieval (CIR)
Expected directory structure:
datasets_dir/
βββ FashionIQ/
β βββ captions/
β β βββ cap.dress.[train|val|test].json
β β βββ ...
β βββ image_splits/
β β βββ split.dress.[train|val|test].json
β β βββ ...
β βββ dress/
β βββ shirt/
β βββ toptee/
βββ CIRR/
βββ train/
βββ dev/
βββ test1/
βββ cirr/
βββ captions/
β βββ cap.rc2.[train|val|test1].json
βββ image_splits/
βββ split.rc2.[train|val|test1].json
𧨠Training
You can easily override hyperparameters, datasets, and paths directly from the command line using Hydra syntax.
Train CVR Model (WebVid-CoVR)
python train_CVR.py
Train CIR Model (FashionIQ or CIRR)
python train_CIR.py
β οΈ Before running CIR training, make sure to update the dataset selection in
configs/train_CIR.yaml(dataandtestindefaults) to your target dataset (e.g.fashioniqorcirr).For example:
defaults: - data: fashioniq - test: fashioniqor:
defaults: - data: cirr - test: cirr-all
π§ͺ Evaluation / Testing
To evaluate a trained model, use test.py and specify the target benchmark.
python3 test.py
(Make sure to specify the dataset and path to your trained checkpoint via the config overrides or by updating the relevant configs/test/*.yaml file).
π Output & Checkpoints
Hydra automatically manages your experiment logs and weights.
- Outputs are systematically written to:
outputs/<dataset>/<model>/<ckpt>/<experiment>/<run_name>/. - Checkpoints are saved inside the run directory as
ckpt_last.ckpt(orckpt_<epoch>.ckptif configured).
π€ Acknowledgements
Our implementation is based on CoVR-2 for the foundational Composed Video Retrieval baselines and datasets and LAVIS for providing robust Vision-Language models like BLIP-2. We sincerely thank the authors for their great open-source projects.
βοΈ Contact
For any questions, issues, or feedback, please reach out to me zivczw@gmail.com βΊοΈ
π Related Projects
Ecosystem & Other Works from our Team
![]() TEMA (ACL'26) Paper | Project | Code |
![]() ConeSep (CVPR'26) Paper | Project | Code | Blog Post (Chinese) |
![]() Air-Know (CVPR'26) Paper | Project | Code | Blog Post (Chinese) |
![]() HABIT (AAAI'26) Project | Code | Paper |
![]() ReTrack (AAAI'26) Project | Code | Paper |
![]() INTENT (AAAI'26) Project | Code | Paper |
![]() OFFSET (ACM MM'25) Project | Code | Paper |
![]() ENCODER (AAAI'25) Project | Code | Paper |
πβοΈ Citation
If you find our work or this code useful in your research, please consider leaving a StarβοΈ or Citingπ our paper π₯°. Your support is our greatest motivation!
@inproceedings{HUD,
title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval},
author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili},
booktitle = {Proceedings of the ACM International Conference on Multimedia},
pages = {6143β6152},
year = {2025}
}
π«‘ Support & Contributing
We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:
- Open an Issue for discussions or bug reports.
- Submit a Pull Request to improve the codebase.
π License
This project is released under the terms of the LICENSE file included in this repository.







