Creation of the Modality Poolings

August 30, 2025 · View on GitHub

Our Retrieval-Augmented Generation (RAG) framework needs two pools: a pool of training-set prototypes, used as keys, and a pool of test-set embeddings, used as queries. We use ImageBind to generate the embeddings, as it maps data from different modalities into a unified embedding space where similar data points are close to each other. This section provides a detailed explanation of how to construct the two embedding pools for each dataset.

1 - Create the Prototypes

Create the two pools for each dataset you want to test using the following scripts. The resulting embeddings will be saved as .pt files. Please refer to the Data Structure Guide for more details about the specific structure of each dataset.

Music AVQA

For MusicAVQA, create the two pools by running:

python collect_IB_embeddings_music_avqa_SF.py
  --data_path <PATH> \               # Path to the train or test annotation file
  --root <ROOT_PATH> \               # Path to the folder with video/audio files
  --answer_path <OUTPUT_PATH> \      # Directory for saving .pt files 
  --batch_size <BATCH_SIZE>

Valor

For Valor, create the two pools by running:

python collect_IB_embeddings_valor_SF.py
  --data_path <PATH> \               # Path to the train or test annotation file
  --root <ROOT_PATH> \               # Path to the folder with video/audio files
  --answer_path <OUTPUT_PATH> \      # Directory for saving .pt files 
  --batch_size <BATCH_SIZE>

CharadesEGO

For CharadesEGO, create the two pools by running:

python collect_IB_embeddings_valor_SF.py
  --data_path <PATH> \               # Path to the train or test annotation file
  --video_path <VIDEO_PATH> \        # Path to the video files
  --audio_path <AUDIO_PATH> \        # Path to the audio files
  --answer_path <OUTPUT_PATH> \      # Directory for saving .pt files 
  --batch_size <BATCH_SIZE>

MOSI

For MOSI, create the two pools by running:

python collect_IB_embeddings_MOSI_SF.py
  --root <PATH> \                   # Path to the folder with the dataset files
  --mode <MODE> \                   # "train" or "test"
  --answer_path <OUTPUT_PATH> \     # Directory for saving .pt files 
  --batch_size <BATCH_SIZE>

MOSEI

For MOSEI, create the two pools by running:

python collect_IB_embeddings_MOSEI_SF.py
  --root <PATH> \                   # Path to the folder with the dataset files
  --mode <MODE> \                   # "train" or "test"
  --answer_path <OUTPUT_PATH> \     # Directory for saving .pt files 
  --batch_size <BATCH_SIZE>

2 - Create the `.h5` files

Create the .h5 files of Music AVQA, Valor and CharadesEGO by running:

python read_IB_embeddings.py

and the .h5 files of MOSI and MOSEI by running:

python read_IB_embeddings_mosi.py

setting in the python files the correct dataset, the correct split (train or test) and the data_dir parameter which correspond to the answer_path previously used for the creation of the dataset pool.