README.md
April 5, 2026 · View on GitHub
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
ICCV 2025
1Harbin Institute of Technology, Shenzhen
2Huawei Noah's Ark Lab
†Corresponding author
If you find this work useful for your research, please kindly cite our paper and star our repo.
Updates
- [01/2026] :fire: The extended paper of FALCON++ is released on TechRxiv.
- [12/2025] :fire: Checkpoint released. Enjoy it!
- [07/2025] :fire: The code and project page are released. Enjoy it!
- [06/2025] :fire: The arXiv paper is updated.
- [06/2025] FALCON is accepted to ICCV 2025!
- [01/2025] arXiv paper released.
Introduction
This is the github repository of FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers. In this work, we propose the FALCON model, which introduces a novel visual register technique to simultaneously address the issues of visual redundancy and fragmentation in the high-resolution visual encoding of MLLMs.
Installation
- Clone this repository and navigate to the folder
git clone git@github.com:iLearn-Lab/ICCV25-FALCON.git
cd falcon
- Install Package
conda create -n falcon python=3.10 -y
conda activate falcon
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Quick Start
We have developed a well-encapsulated class JiutianHDInfer specifically designed for model inference in jiutian/eval/model_infer.py.
Below is an example of how to use the JiutianHDInfer class. By calling the inference method, you can easily obtain the model's inference results.
from jiutian.eval.model_infer import JiutianHDInfer
model_infer = JiutianHDInfer(
model_path='/path/to/ckpt',
model_base='/path/to/base_ckpt or None',
conv_mode='llama_3_1',
)
image_file = '/path/to/image'
question = 'question'
model_infer.inference(image_file, question)
Evaluations
See docs/Evaluation.md for details.
Training
Please refer to the scripts in scripts/jiutian/train
Citation
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{zhang2025falcon,
title={Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers},
author={Zhang, Renshan and Shao, Rui and Chen, Gongwei and Zhang, Miao and Zhou, Kaiwen and Guan, Weili and Nie, Liqiang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={23530--23540},
year={2025}
}