ACCM: Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models

March 20, 2026 · View on GitHub

Our work has been accepted by ACM Multimedia 2025. [Paper]

Notes (New)

We have made improvements to the original ACCM by replacing supplemental captions with scene graphs which are generated by a scene graph model and filtered by a parameter-free module. The weight of the scene graph model is provided here. You need to modify the path in clip_encoder.py. All other usages are the same as ACCM.

Install

Prepare the environment as LLaVA-1.5.
Substitute utils.py in our Huggingface homepage for original file under '/xxx/anaconda3/envs/env_name/lib/python3.10/site-packages/transformers/generation/' folder.
Substitute modeling_clip.py in our Huggingface homepage for original file under '/xxx/anaconda3/envs/env_name/lib/python3.10/site-packages/transformers/models/clip/' folder.

Benchmarks

We use 7 benchmarks, including MME, MMBench, POPE, MMVP, SEED, GQA and Flickr30k. For MMVP, we include it in our project. For Flickr30k, download from our Huggingface homepage. For preparing the other benchmarks, please refer to LLaVA-1.5.

Evaluation

Download the checkpoints from our Huggingface homepage and modify the path for the scene graph model in clip_encoder.py.
Set the visual token number in clip_encoder.py.
Run the scripts under eval folder. Remember to modify the MODEL_PATH, CKPT_NAME and dataset path in the scripts. For example, the evaluation for MME is:

bash scripts/v1_5/eval/mme.sh