ControlMLLM
July 17, 2025 ยท View on GitHub
The repo is for the paper ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models (NeurIPS2024).
@article{wu2024controlmllm,
title={Controlmllm: Training-free visual prompt learning for multimodal large language models},
author={Wu, Mingrui and Cai, Xinyue and Ji, Jiayi and Li, Jiale and Huang, Oucheng and Luo, Gen and Fei, Hao and Jiang, Guannan and Sun, Xiaoshuai and Ji, Rongrong},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={45206--45234},
year={2024}
}
Features
- Training-free method, supports running on a single RTX 3090 24GB GPU.
- Provides visualization tools in
utils.pyfor interpretability.
News
2025/5/26:We release the code of ControlMLLM++, an extension of ControlMLLM, which introduces a new optimization strategy for better test-time stability and convergence. The technical report is coming soon.2024/9/26:ControlMLLM is accepted by NeurIPS 2024.2024/8/21:We release eval pipeline on ROC and RTC task.2024/8/8:We release demo on InstructBLIP.2024/8/2:We release demo on LLaVA v1.5.
Project Structure
| Folder / File | Description |
|---|---|
controlmllm/ | Original ControlMLLM implementation. Includes demo scripts, ROC & RTC tasks. |
controlmllm++/ | Enhanced ControlMLLM++. Supports multi-model pipelines & RD task. |
datasets.md | Unified dataset preparation guide (ROC, RTC, RefCOCOg, ScreenSpot). |
Setup and Usage Instructions
- For ControlMLLM (ROC, RTC, demo), see
controlmllm/RUN.md - For ControlMLLM++, each model has its own setup and run instructions:
Data preparation
Please follow the instructions at DATASETS.md to prepare all datasets.
Support Models
- Qwen2.5-VL
- LLaVA v1.5(version<='05ae243')
- InstructBLIP
- LLaVA-HR
- More coming soon
Demo
python controlmllm/llava/llava_demo.py

Tips: Due to the image cropping during preprocessing in LLaVA1.5, referring to region at the edges of the image may become unreliable. If your referring does not work, you can also try slightly adjusting the visual prompt or text prompt, which might produce surprising results.
Results
The results of combining with different MLLMs on ROC and RTC tasks.
| MODELS | ROC | RTC |
|---|---|---|
| LLAVA-1.5 | 54.72 | 57.42 |
| LLAVA-1.5 + CONTROLMLLM | 60.59 | 63.06 |
| LLAVA-1.5 + CONTROLMLLM++ | 71.19 | 74.66 |
| LLAVA-HR | 53.81 | 57.00 |
| LLAVA-HR + CONTROLMLLM | 58.92 | 66.89 |
| LLAVA-HR + CONTROLMLLM++ | 69.06 | 82.68 |
| QWEN2.5-VL | 78.81 | 81.91 |
| QWEN2.5-VL + CONTROLMLLM | 79.20 | 86.43 |
| QWEN2.5-VL + CONTROLMLLM++ | 79.20 | 88.23 |
| MODELS | REF-COCOG (IN-DOMAIN) | SCREENSHOT (OUT-OF-DOMAIN) | ||||||
| B@4 | M | C | S | B@4 | M | C | S | |
| LLAVA-1.5 | 5.02 | 13.15 | 55.61 | 17.61 | 0.32 | 3.96 | 9.80 | 3.58 |
| LLAVA-1.5 + CONTROLMLLM | 5.53 | 14.00 | 59.75 | 19.08 | 0.45 | 5.08 | 19.74 | 5.81 |
| LLAVA-1.5 + CONTROLMLLM++ | 6.24 | 15.05 | 67.37 | 21.46 | 0.57 | 6.53 | 40.01 | 9.14 |
| LLAVA-HR | 5.28 | 13.45 | 56.29 | 18.55 | 0.29 | 4.27 | 10.88 | 4.59 |
| LLAVA-HR + CONTROLMLLM | 6.32 | 15.00 | 68.82 | 21.55 | 0.64 | 6.79 | 37.10 | 8.54 |
| LLAVA-HR + CONTROLMLLM++ | 7.50 | 16.11 | 78.42 | 24.02 | 0.98 | 9.18 | 66.96 | 13.83 |
| QWEN2.5-VL | 5.22 | 16.86 | 56.78 | 20.18 | 1.09 | 4.56 | 34.32 | 7.15 |
| QWEN2.5-VL + CONTROLMLLM | 5.33 | 16.91 | 58.20 | 20.12 | 4.26 | 9.91 | 86.35 | 15.27 |
| QWEN2.5-VL + CONTROLMLLM++ | 5.45 | 16.53 | 59.50 | 19.95 | 9.05 | 16.04 | 141.36 | 25.08 |
Acknowledgement
Layout-Guidance, ml-ferret, Transformers, SeeClick and Visualizer.