UPRet

March 17, 2025 · View on GitHub

Official Implementations for Uncertainty-aware Sign Language Video Retrieval with Probability Distribution Modeling

Introduction

Image description
Illustration of: (a) Uncertainty. (b) Previous method. (C) Ours method

Sign language video retrieval is crucial for helping the hearing-impaired community to access information. Although significant progress has been made in the field of video-text retrieval, the complexity and inherent uncertainty of sign language make it difficult to directly apply these technologies. Previous methods have attempted to map sign language videos to text through fine-grained modality alignment. However, due to the scarcity of fine-grained annotations, the uncertainty in sign language videos has been underestimated, which has limited the further development of sign language retrieval tasks.

Image description
Framework overview.

To address this challenge, we propose the Uncertainty-aware Probability Distribution Retrieval (UPRet) method. This method treats the mapping process between sign language videos and text as a matching of probability distributions. It explores their potential relationships through dynamic semantic alignment, achieving flexible mapping. We model sign language videos and text using multivariate Gaussian distributions, allowing us to explore their correspondences in a broader semantic space. This approach more accurately captures the uncertainty and polysemy of sign language. Through Monte Carlo sampling, we thoroughly explore the structure and associations of the distributions and employ Optimal Transport to achieve fine-grained cross-modal alignment.

Performance

Environment

conda create --name yourEnv python=3.7
conda activate yourEnv
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandas
pip install -r requirements.txt

Training

cd CLCL
python -m torch.distributed.launch --nproc_per_node=4 main_task_retrieval.py --do_train

Citations

@inproceedings{wu2024uncertainty,
      title={Uncertainty-aware sign language video retrieval with probability distribution modeling}, 
      author={Wu, Xuan and Li, Hongxiang and Luo, Yuanjiang and Cheng, Xuxin and Zhuang, Xianwei and Cao, Meng and Fu, Keren},
      year={2024},
      booktitle={European Conference on Computer Vision},
}

Acknowledgment

This code is based on CiCo.

Model	T2V					V2T
Model	R@1	R@5	R@10	MedR	MnR	R@1	R@5	R@10	MedR	MnR
How2Sign	59.1	71.5	75.7	1.0	54.4	53.4	65.4	70.0	1.0	76.4
PHOENIX2014T	72.0	89.1	94.1	1.0	4.4	72.0	89.4	93.3	1.0	4.6
CSL-Daily	78.4	89.1	92.0	1.0	6.7	77.0	89.2	92.7	1.0	5.5