README.md
September 4, 2024 ยท View on GitHub
Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
1 Great Bay University
2 Harbin Institute of Technology, Shenzhen
3 University of Oxford
4 Shenzhen Campus of Sun Yat-sen University
*Corresponding author
News :loudspeaker:
- [09/2024] We have released code about fine-tuning and ADPO's.
- [07/2024] We have released the collected AVinstruct dataset.
- [07/2024] Our work has been accepted by ECCV 2024!
- [03/2024] Arxiv paper released.
- [03/2024] Project page released.
Introduction :bulb:
We introduce the CAT, enhancing MLLM in three ways:
1) We design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models.
2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations.
3) We propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects.
Demo ๐ค
Training & Validation
We have collect an audio-visual joint instruction dataset, named AVinstruct, details in Data.md.
The Fine-tuning process is in here SFT.md.
The ADPO process is in here ADPO.md.
Citation โ๏ธ
If you find this work useful for your research, please kindly cite our paper and star our repo.
@misc{ye2024cat,
title={CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios},
author={Qilang Ye and Zitong Yu and Rui Shao and Xinyu Xie and Philip Torr and Xiaochun Cao},
year={2024},
eprint={2403.04640},
archivePrefix={arXiv},
primaryClass={cs.CV}
}