AURELIA : Test‑time Reasoning Distillation in Audio‑Visual LLMs

October 14, 2025 · View on GitHub

📄 Paper · 🌐 Project Page

AURELIA : Test‑time Reasoning Distillation in Audio‑Visual LLMs is accepted to ICCV 2025!! 🎉🎉

AURELIA

This is the official code for the paper - AURELIA : Test‑time Reasoning Distillation in Audio‑Visual LLMs

We introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning.


Table of Contents

  1. Quick Start
  2. Generating Reasoning Data
  3. Evaluation
  4. Citation

Quick Start

Clone the repo & install dependencies

git clone https://github.com/schowdhury671/aurelia
conda create -n aurelia python=3.10 -y
conda activate aurelia
pip install openai==0.28.0
pip install google-generativeai
pip install -q -U pytube moviepy
apt-get install -y ffmpeg

To run the multi‑agent pipeline you must export valid keys for both OpenAI and Google Gemini APIs:

export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."

Generating Reasoning Data

Run the data generation pipeline

cd data_generation
python gen_data.py --save_path "reason_data.json" \
                   --video_path "sample.mp4" \
                   --audio_path "sample.mp3" \
                   --query "What is the most popular food of the country where the loudest instrument originates from?" \
                   --max_tries 5

gen_data.py will iteratively call the chosen LLMs, synthesize the reasoning caption, and stop once the evaluator score ≥ τ.


Evaluation

We benchmark various AV-LLMs following the settings mentioned in Section 5 of the paper.

Given below are the links to the checkpoints of the public AV-LLMs.

FamilySizeCode & Checkpoint
Video‑LLaMA7 Bhttps://github.com/DAMO-NLP-SG/Video-LLaMA
Video‑LLaMA 27 Bhttps://github.com/DAMO-NLP-SG/VideoLLaMA2
Unified‑IO‑21B, 3Bhttps://github.com/allenai/unified-io-2
PandaGPT13 Bhttps://github.com/yxuansu/PandaGPT
Macaw‑LLM7 Bhttps://github.com/lyuchenyang/Macaw-LLM
ImageBind‑LLM7 Bhttps://github.com/OpenGVLab/LLaMA-Adapter/tree/main/imagebind_LLM
X‑InstructBLIP13 Bhttps://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip
OneLLM7 Bhttps://github.com/csuhan/OneLLM
CREMA4 Bhttps://github.com/Yui010206/CREMA
AnyGPT7 Bhttps://github.com/OpenMOSS/AnyGPT
NExT‑GPT7 Bhttps://github.com/NExT-GPT/NExT-GPT
VITA7 Bhttps://github.com/VITA-MLLM/VITA
Bay‑CAT7 Bhttps://github.com/rikeilong/Bay-CAT
video‑SALMONN7 Bhttps://github.com/bytedance/SALMONN/tree/videosalmonn
AVicuna7 Bhttps://github.com/yunlong10/AVicuna
Gemini-1.5-Prohttps://ai.google.dev/gemini
Reka Corehttps://reka.ai

🎬 Data Preparation

AURELIA is evaluated on AVReasonBench – a curated collection of 4,500 audio-visual QA samples paired with gold reasoning chains spanning six diverse tasks:

🎯 Task 📚 Source 🔢 #Samples
AV-QA (Music-AVQA) 🎵 Music-AVQA dataset 1000
AV-QA (AVSD) 🎥 AVSD dataset 1000
AV-Captioning 📝 VALOR dataset 1000
AV-Compositional 🌐 Web-scraped pairs 1000
AV-GeoIQ 🗺️ Manually authored 200
AV-Meme 🎭 AV-Odyssey Bench + augmentation 100
Dance–Music Match 💃 DM-Match 200
Total 4500

📦 Download the dataset: ➡️ Click here

Citation

If you find AURELIA useful in your research, please cite:

@article{chowdhury2025aurelia,
  title={AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs},
  author={Chowdhury, Sanjoy and Ghani, Hanan and Anand, Nishit and Nag, Sayan and Gao, Ruohan and Elhoseiny, Mohamed and Khan, Salman and Manocha, Dinesh},
  journal={arXiv preprint arXiv:2503.23219},
  year={2025}
}