README.md

September 3, 2025 · View on GitHub

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Gongwei Chen, Leyang Shen, Rui Shao*, Xiang Deng, Liqiang Nie*

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024

[Paper] [Project Page] [Video(YouTube)] [Video(bilibili)]

:fire: Details will be released. Stay tuned :beers: :+1:

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

[07/2024] Code and checkpoints are released.
[02/2024] LION has been accepted by CVPR 2024.
[11/2023] Arxiv paper released.
[11/2023] Project page released.

This is the github repository of LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting capabilities and alleviating hallucinations.

The framework of the proposed LION model:

Installation

Download

git clone https://github.com/JiuTian-VL/JiuTian-LION.git
cd JiuTian-LION

Environment

conda create -n LION python=3.12
conda activate LION
conda install pip
pip install -r requirements.txt

Checkpoints

Version	Checkpoint
LION-FlanT5-XL	daybreaksly/LION-FlanT5-XL
LION-FlanT5-XXL	daybreaksly/LION-FlanT5-XXL

Usage

Prepare models

Download the pre-trained vit model eva_vit_g.
Download the pre-trained RAM model ram_swin_large_14m.
Download the pre-trained FlanT5 model FlanT5-XL.
Download the pre-trained BERT model bert-base-uncased
Fill in the paths to these models into the corresponding locations in the config file configs\models\lion_flant5xl.yaml

Inference

We provide inference examples for Image-Level and Region-Level tasks in playground.ipynb.

Training

We provide a training script and instruction to do stage4 training as an example.

Download dataset from huggingface
Download images and organized them in one folder:

Please download the following datasets:

Training images
- OCR-VQA
- coco-2014
- coco-2017
- okvqa-2014
- textcaps
- vqav2-2014
- visual_genome

After downloading, place all these folders under a single directory.
For example:

/path/to/data/images/
├── OCR-VQA/images
├── coco/images/train2014
├── coco_2017/train2017
├── okvqa/images/train/train2014
├── textcaps/images/train_images
├── vqav2/images/train2014
├── visual_genome/VG_100K
└── visual_genome/VG_100K_2

---

In your config file, add the unified image folder path:

train_datasets:
  - ann_path: "/path/to/image_level_data.json"
    vis_root: "/path/to/image_folder"
    is_train: true
    sample_ratio: 1
  - ann_path: "/path/to/region_level_data.json"
    vis_root: "/path/to/image_folder"
    is_train: true
    sample_ratio: 1

Configure training with configs/lion_train_stage4.yaml (update model paths and dataset paths)
Run multi‑GPU training:

cd JiuTian-LION
bash scripts/start_train.sh

Or manually:

CUDA_VISIBLE_DEVICES=0,1,2,3 TOKENIZERS_PARALLELISM=true \
  torchrun --master_port 12345 --nproc_per_node=4 \
  train.py --cfg-path configs/lion_train_stage4.yaml

Outputs and checkpoints are written to outputs/lion_stage4/<timestamp>/ by default.

Evaluation results

For image-level tasks, we focus on image captioning and Visual Question Answering (VQA). For region-level tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.

Score

Image-level Region-level

We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 1~2 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.

MMBench POPE

@inproceedings{chen2024lion,
    title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge}, 
    author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}