VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

May 10, 2026 · View on GitHub

The official repository for "VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning".

🤗 VCM-7B (Coming Soon) | 🤗 VCM-13B (Coming Soon) | 📑 Paper

Introduction

VCM (Vision Concept Modeling) is a novel framework designed to enhance the efficiency of Large Multimodal Models (LMMs). By introducing adaptive vision token compression during the instruction fine-tuning stage, VCM dynamically identifies and preserves essential visual concepts while reducing redundant tokens. This approach significantly lowers computational overhead without compromising performance on downstream multimodal tasks.

Installation and Setup

VCM is built upon the LLaVA framework. To use VCM, please follow these steps:

Clone the official LLaVA repository:

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

Install the environment: Follow the original LLaVA installation instructions to set up your Python environment and dependencies.
Apply VCM Modifications: Replace the original llava_arch.py file in the LLaVA source code with the one provided in this repository:
```
cp path/to/vcm/llava_arch.py llava/model/llava_arch.py
```

Training and Inference

Once you have replaced the architecture file, you can follow the standard LLaVA training and inference pipelines. VCM will automatically handle the adaptive token compression based on the Vision Concept Modeling logic during the forward pass.

Refer to the LLaVA Documentation for detailed commands on:

Pre-training (Feature Alignment)
Visual Instruction Tuning

Evaluation

We utilize the lmms-eval toolkit for comprehensive benchmarking.

Citation

If you find VCM useful for your research, please cite our paper:

@article{vcm2025,
  title={VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning},
  author={Run Luo and Renke Shan and Longze Chen and Ziqiang Liu and Lu Wang and Min Yang and Xiaobo Xia},
  journal={arXiv preprint arXiv:2504.19627},
  year={2025}
}