MQA-RefAVS

May 22, 2026 · View on GitHub

Official implementation of Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation.

We introduce Mask Quality Assessment under the Ref-AVS context (MQA-RefAVS), a new task that aims to automatically infer the quality of candidate segmentation masks without access to ground-truth annotations. Given a video, audio, referring expression, frame image, and candidate segmentation mask, it aims to estimate the IoU,predict the mask type and an audit action.

This project is released with MQ-RAVSBench and can be viewed as a sister project to TGS-Agent, focusing on reference-free mask quality assessment after language-referred audio-visual segmentation.

Paper: https://arxiv.org/pdf/2602.03892
Dataset: https://huggingface.co/datasets/Jinxing1/MQ-RAVSBench
Pretrained weights: https://huggingface.co/Jinxing1/MQ-Auditor

Repository Structure

MQA-RefAVS/
  configs/                 # Dataclass argument definitions
  dataset/                 # Dataset loaders and collators; MQ-RAVSBench is downloaded separately
  deepspeed/               # DeepSpeed configs
  models/                  # Llama-based multimodal auditor modules
  pretrained_weights/      # Local upstream model/checkpoint placeholders
  scripts/finetune/        # MQ-Auditor training and evaluation entry points
  scripts/pretrain/        # Optional audio/visual pretraining scripts
  utils/                   # Training, evaluation, and checkpoint utilities

Paths

Run commands from the repository root. The default layout is:

parent_dir/
  MQA-RefAVS/
    pretrained_weights/
    checkpoints/
      MQ-Auditor/
  MQ-RAVSBench/

Default paths:

PRETRAINED_WEIGHTS_DIR=pretrained_weights
MQ_RAVSBENCH_DIR=../MQ-RAVSBench
MQ_AUDITOR_CKPT_DIR=checkpoints/MQ-Auditor

If your files are stored elsewhere, edit these variables at the top of scripts/finetune/finetune_hyperlora.sh and scripts/finetune/inference_hyper_lora.sh.

Installation

conda create -n mqa python=3.10 -y
conda activate mqa
pip install -r requirements.txt

Data

Download MQ-RAVSBench from Hugging Face and place it next to this repository:

cd MQA-RefAVS
huggingface-cli download Jinxing1/MQ-RAVSBench \
  --repo-type dataset \
  --local-dir ../MQ-RAVSBench

The default scripts expect:

../MQ-RAVSBench/train_test_meta_files/metadata.csv
../MQ-RAVSBench/train_test_meta_files/train_audit_only_filtered.json
../MQ-RAVSBench/train_test_meta_files/test_s_image_filtered.json
../MQ-RAVSBench/train_test_meta_files/test_u_image_filtered.json
../MQ-RAVSBench/train_test_meta_files/test_s_video_filtered.json
../MQ-RAVSBench/train_test_meta_files/test_u_video_filtered.json

Pretrained Weights

This repository keeps only empty placeholder directories under pretrained_weights/. Download the required upstream files locally before running training or evaluation.

By default, place upstream weights under MQA-RefAVS/pretrained_weights/:

pretrained_weights/
  Llama-2-7b-chat-hf/
  clip-vit-large-patch14/
  google-bert-base-uncased/
  BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt
  audio_pretrain.bin
  visual_pretrain.bin

The first three entries are directories containing the local Hugging Face model files for Llama-2-7B-Chat, CLIP ViT-L/14, and BERT-base. BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt is the BEATs checkpoint. audio_pretrain.bin and visual_pretrain.bin are used by the MQ-Auditor training script to initialize the audio and visual projectors.

For upstream checkpoint preparation, also refer to the setup used by TGS-Agent.

Download the released MQ-Auditor checkpoint separately:

huggingface-cli download Jinxing1/MQ-Auditor \
  --local-dir checkpoints/MQ-Auditor

The evaluation script expects:

checkpoints/MQ-Auditor/
  non_lora_trainables.bin
  checkpoint-960/
    finetune_weights.bin

Evaluation

Run the released checkpoint on MQ-RAVSBench:

cd MQA-RefAVS
bash scripts/finetune/inference_hyper_lora.sh

Configure evaluation options at the top of scripts/finetune/inference_hyper_lora.sh:

TEST_NAME=test_s                 # test_s or test_u
REFAVS_EVAL_MODE=image           # image or video
REFAVS_MASK_TYPE_FILTER=perfect  # perfect, merge, full_neg, cutout, erode, or dilate
REFAVS_MASK_RANK_FILTER=-1       # -1 for all masks; 1 for Hard; 2 for Medium hard
DEVICE=cuda:0

Use the following fixed mask-type/rank settings for evaluation:

perfect: -1
merge: -1
full_neg: -1
cutout: 1 or 2
erode: 1 or 2
dilate: 1 or 2

For cutout, erode, and dilate, rank 1 denotes Hard (H) samples and rank 2 denotes Medium hard (M) samples. For perfect, merge, and full_neg, rank -1 evaluates all masks of that type.

Training

The reference fine-tuning entry point is:

cd MQA-RefAVS
bash scripts/finetune/finetune_hyperlora.sh

The released model was trained with the following main setting:

epochs96_lr1e-4_bs4_gradacc8_lora_r32alpha64_pos0.5_ioulosswei0

Default MQ-RAVSBench training inputs:

../MQ-RAVSBench/train_test_meta_files/metadata.csv
../MQ-RAVSBench/train_test_meta_files/train_audit_only_filtered.json

Default mask input mode:

mask_and_masked_frame

License

The MQ-Auditor source code is released under the MIT License. MQ-RAVSBench and the released MQ-Auditor weights are provided for non-commercial research purposes only under CC BY-NC-SA 4.0-style terms. Since MQ-RAVSBench incorporates videos and annotations from previous datasets, including Ref-AVSBench and AVSBench, users must also comply with the licenses and terms of the original datasets.

Citation

@article{zhou2026audit,
  title={Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation},
  author={Zhou, Jinxing and Zhou, Yanghao and Wang, Yaoting and Han, Zongyan and Ma, Jiaqi and Ding, Henghui and Anwer, Rao Muhammad and Cholakkal, Hisham},
  journal={arXiv preprint arXiv:2602.03892},
  year={2026}
}