Welcome to Griffon

April 17, 2026 · View on GitHub

Welcome to Griffon

Welcome to the official repository of the Griffon Series — including Griffon v1, v2, G, R, and the Vision-R1 reinforcement learning framework. Griffon begins with fine-grained perception and localization, achieving state-of-the-art performance in visual grounding and referring expression comprehension (REC) — rivaling expert-level object detection models. Beyond its visual strengths, Griffon also demonstrates impressive general-purpose question answering and the ability to identify relevant regions based on a given question to perform reasoning. Griffon is continuously evolving to tackle increasingly complex vision-language tasks. We are actively maintaining and open-sourcing our progress. Feel free to follow the project and open an issue if you have questions or feedback!

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

📕Paper 🌀Usage

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning (CVPR 2026)

📕Paper 🌀Usage 🤗Model 🤗Data

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

📕Paper 🌀Usage 🤗Model 🤗Data🔥

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring (ICCV 2025)

📕Paper 🌀Intro 🤗Data🔥

Griffon: Spelling out All Object Locations at Any Granuality with Large Language Model (ECCV 2024)

📕Paper 🌀Usage 🤗Model

Release

What can Griffon do now?

Griffon-G demonstrates advanced performance across multimodal benchmarks, general VQAs, and text-rich VQAs, achieving new state-of-the-art results in REC and object detection. More quantitative evaluation results can be found in our paper.

Get Started with Griffon

💡 Looking for Vision-R1? If you are here for the RL training on Qwen3-VL/Qwen2.5-VL, please navigate to the Vision-R1 Directory after cloning this repository.

1.Clone & Install

git clone git@github.com:jefferyZhan/Griffon.git
cd Griffon
pip install -e .

Tips: If you encounter any errors while installing the packages, you can always download the corresponding source files (*.whl), which have been verified by us.

2.Download the Griffon and CLIP models to the checkpoints folder.

Model	Links
Griffon-G-9B	`🤗HuggingFace`
Griffon-G-27B	`🤗HuggingFace`
clip-vit-large-path14	`🤗HuggingFace`
clip-vit-large-path14-336_to_1022	`🤗HuggingFace`

3. Training

Please refer to the Training README.

4.Inference

# 4.1 Modify the instruction in the run_inference.sh.

# 4.2.1 DO NOT USE Visual Prompt
bash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH]

# 4.2.2 USE Visual Prompt for COUNTING: Input both query image and prompt image splited with comma and specify <region> placeholder in the instruction
bash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH,PROMPT_PATH]

Notice: Please pay attention to the singular and plural expressions of objects.

5.Evaluation

5.1 Multimodal Benchmark Evaluation

Please Refer to LLaVA Evaluation or Use VLMEvalKit.

5.2 COCO Detection Evaluation

# Single Node
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 12457 -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json

# Multiple Node
## NODE 0
torchrun --nproc_per_node 8 --nnodes N --node_rank 0 --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json --init tcp://MASTER_ADDR:MASTER_PORT
## NODE K(1 to N-1)
torchrun --nproc_per_node 8 --nnodes N --node_rank K --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json --init tcp://MASTER_ADDR:MASTER_PORT

5.3 REC Evaluation

Processed RefCOCO annotation set can be downloaded from this link.

# Single Node
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 12457 -m griffon.eval.eval_rec --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN

# Multiple Node
## NODE 0
torchrun --nproc_per_node 8 --nnodes N --node_rank 0 --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN --init tcp://MASTER_ADDR:MASTER_PORT
## NODE K(1 to N-1)
torchrun --nproc_per_node 8 --nnodes N --node_rank K --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN --init tcp://MASTER_ADDR:MASTER_PORT

Acknowledgement

LLaVA provides the base codes and pre-trained models.
Shikra provides insight of how to organize datasets and some base processed annotations.
Llama provides the large language model.
Gemma2 provides the large language model.
volgachen provides the basic environment setting config.

Citation

If you find Griffon useful for your research and applications, please cite using this BibTeX:

@inproceedings{zhan2025griffonv1,
  title={Griffon: Spelling out all object locations at any granularity with large language models},
  author={Zhan, Yufei and Zhu, Yousong and Chen, Zhiyang and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  booktitle={European Conference on Computer Vision},
  pages={405--422},
  year={2025},
  organization={Springer}
}

@inproceedings{zhan2025griffon,
  title={Griffon v2: Advancing multimodal perception with high-resolution scaling and visual-language co-referring},
  author={Zhan, Yufei and Zheng, Shurong and Zhu, Yousong and Zhao, Hongyin and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={22947--22957},
  year={2025}
}

@article{zhan2024griffon-G,
  title={Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models},
  author={Zhan, Yufei and Zhao, Hongyin and Zhu, Yousong and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  journal={arXiv preprint arXiv:2410.16163},
  year={2024}
}

@article{zhan2025understand,
  title={Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models},
  author={Zhan, Yufei and Zhao, Hongyin and Zhu, Yousong and Zheng, Shurong and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  journal={arXiv preprint arXiv:2505.20753},
  year={2025}
}

@misc{zhan2025visionr1evolvinghumanfreealignment,
      title={Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning}, 
      author={Yufei Zhan and Yousong Zhu and Shurong Zheng and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},
      year={2025},
      eprint={2503.18013},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18013}, 
}

License

The data and checkpoint is licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Gemma2, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.