UPRet
March 17, 2025 ยท View on GitHub
Official Implementations for Uncertainty-aware Sign Language Video Retrieval with Probability Distribution Modeling
Introduction
Illustration of: (a) Uncertainty. (b) Previous method. (C) Ours method
Sign language video retrieval is crucial for helping the hearing-impaired community to access information. Although significant progress has been made in the field of video-text retrieval, the complexity and inherent uncertainty of sign language make it difficult to directly apply these technologies. Previous methods have attempted to map sign language videos to text through fine-grained modality alignment. However, due to the scarcity of fine-grained annotations, the uncertainty in sign language videos has been underestimated, which has limited the further development of sign language retrieval tasks.
Framework overview.
To address this challenge, we propose the Uncertainty-aware Probability Distribution Retrieval (UPRet) method. This method treats the mapping process between sign language videos and text as a matching of probability distributions. It explores their potential relationships through dynamic semantic alignment, achieving flexible mapping. We model sign language videos and text using multivariate Gaussian distributions, allowing us to explore their correspondences in a broader semantic space. This approach more accurately captures the uncertainty and polysemy of sign language. Through Monte Carlo sampling, we thoroughly explore the structure and associations of the distributions and employ Optimal Transport to achieve fine-grained cross-modal alignment.
Performance
Environment
conda create --name yourEnv python=3.7
conda activate yourEnv
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandas
pip install -r requirements.txt
Training
cd CLCL
python -m torch.distributed.launch --nproc_per_node=4 main_task_retrieval.py --do_train
Citations
@inproceedings{wu2024uncertainty,
title={Uncertainty-aware sign language video retrieval with probability distribution modeling},
author={Wu, Xuan and Li, Hongxiang and Luo, Yuanjiang and Cheng, Xuxin and Zhuang, Xianwei and Cao, Meng and Fu, Keren},
year={2024},
booktitle={European Conference on Computer Vision},
}
Acknowledgment
This code is based on CiCo.
| Model | T2V | V2T | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | MedR | MnR | R@1 | R@5 | R@10 | MedR | MnR | |
| How2Sign | 59.1 | 71.5 | 75.7 | 1.0 | 54.4 | 53.4 | 65.4 | 70.0 | 1.0 | 76.4 |
| PHOENIX2014T | 72.0 | 89.1 | 94.1 | 1.0 | 4.4 | 72.0 | 89.4 | 93.3 | 1.0 | 4.6 |
| CSL-Daily | 78.4 | 89.1 | 92.0 | 1.0 | 6.7 | 77.0 | 89.2 | 92.7 | 1.0 | 5.5 |