AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

May 16, 2026 · View on GitHub

Junyu Xie1, Tengda Han1, Max Bain1, Arsha Nagrani1, Gül Varol1 2, Weidi Xie1 3, Andrew Zisserman1

1 Visual Geometry Group, Department of Engineering Science, University of Oxford
2 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
3 CMIC, Shanghai Jiao Tong University

Project page Dataset

Requirements

  • Basic Dependencies: pytorch=2.0.0, Pillow, pandas, decord, opencv, moviepy=1.0.3 transformers=4.37.2 accelerate==0.26.1

  • VideoLLaMA2: After installation, modify the sys.path.append("/path/to/VideoLLaMA2") in stage1/main.py and stage1/utils.py. Please download the VideoLLaMA2-7B checkpoint here.

  • Set up cache model path (for LLaMA3, etc.) by modifying os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/" in stage1/main.py and stage2/main.py

Datasets

In this work, we evaluate our model on CMD-AD, MAD-Eval, and TV-AD.

Video Frames

  • CMD-AD can be downloaded here.
  • MAD-Eval can be downloaded here.
  • TV-AD adopts a subset of TV-QA as visual sources (3fps), and can be downloaded here. Each folder containing .jpg video frames needs to be converted to a .tar file. This can be done by the code provided in tools/compress_subdir.py.
    For example,
    python tools/compress_subdir.py \
    --root_dir="resources/example_file_structures/tvad_raw/" \   # for downloaded raw (.jpg folders) files from TVQA
    --save_dir="resources/example_file_structures/tvad/"         # for compressed tar files
    

Ground Truth AD Annotations

  • All annotations can be found in resources/annotations

Results

  • The AutoAD-Zero predictions can be downloaded here.

Character Recognition

The pre-computed character recognition results (for both train and eval sets) are available in resources/annotations (e.g. resources/annotations/cmdad_anno_with_face_0.2_0.4.csv), which can be directly feeded into stage I (next step).

It is also possible to run character recognition code from stratch. Please refer to the char_recog folder for more details.

Inference

Stage I: VLM-Based Dense Video Description

python stage1/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 
--model_path={videollama2_ckpt_path} \
--output_dir={output_dir}

--dataset: choices are cmdad, madeval, and tvad.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--anno_path: path to AD annotations (with predicted face IDs and bboxes), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks.
--model_path: path to videollama2 checkpoint.
--output_dir: directory to save output csv.

Stage II: LLM-Based AD Summary

python stage2/main.py \
--dataset={dataset} \             #e.g. "cmdad"
--pred_path={stage1_result_path} 

--dataset: choices are cmdad, madeval, and tvad.
--pred_path: path to the stage1 saved csv file.

Inference with GPT-4o via OpenAI API

Note: Before starting, insert OpenAI API keys into the corresponding main.py file.
Note: This is not officially tested and reported in the original paper. You may want to adjust the text prompts to get improved / more robust outputs.

Stage I: VLM-Based Dense Video Description

python stage1_gpt/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 
--output_dir={output_dir}

Stage II: LLM-Based AD Summary

python stage2_gpt/main.py \
--dataset={dataset} \             #e.g. "cmdad"
--pred_path={stage1_result_path} 

Citation

If you find this repository helpful, please consider citing our work:

@InProceedings{xie2024autoad0,
	title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
	author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
	booktitle={ACCV},
	year={2024}
}

References

VideoLLaMA2: https://github.com/DAMO-NLP-SG/VideoLLaMA2
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct