ControlMLLM

July 17, 2025 · View on GitHub

The repo is for the paper ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models (NeurIPS2024).

@article{wu2024controlmllm,
  title={Controlmllm: Training-free visual prompt learning for multimodal large language models},
  author={Wu, Mingrui and Cai, Xinyue and Ji, Jiayi and Li, Jiale and Huang, Oucheng and Luo, Gen and Fei, Hao and Jiang, Guannan and Sun, Xiaoshuai and Ji, Rongrong},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={45206--45234},
  year={2024}
}

Features

Training-free method, supports running on a single RTX 3090 24GB GPU.
Provides visualization tools in utils.py for interpretability.

News

2025/5/26: We release the code of ControlMLLM++, an extension of ControlMLLM, which introduces a new optimization strategy for better test-time stability and convergence. The technical report is coming soon.
2024/9/26: ControlMLLM is accepted by NeurIPS 2024.
2024/8/21: We release eval pipeline on ROC and RTC task.
2024/8/8: We release demo on InstructBLIP.
2024/8/2: We release demo on LLaVA v1.5.

Project Structure

Folder / File	Description
`controlmllm/`	Original ControlMLLM implementation. Includes demo scripts, ROC & RTC tasks.
`controlmllm++/`	Enhanced ControlMLLM++. Supports multi-model pipelines & RD task.
`datasets.md`	Unified dataset preparation guide (ROC, RTC, RefCOCOg, ScreenSpot).

Setup and Usage Instructions

For ControlMLLM (ROC, RTC, demo), see controlmllm/RUN.md
For ControlMLLM++, each model has its own setup and run instructions:
- controlmllm++/llava/RUN.md
- controlmllm++/qwen2_5_vl/RUN.md

Data preparation

Please follow the instructions at DATASETS.md to prepare all datasets.

Support Models

Qwen2.5-VL
LLaVA v1.5(version<='05ae243')
InstructBLIP
LLaVA-HR
More coming soon

Tips: Due to the image cropping during preprocessing in LLaVA1.5, referring to region at the edges of the image may become unreliable. If your referring does not work, you can also try slightly adjusting the visual prompt or text prompt, which might produce surprising results.

Results

The results of combining with different MLLMs on ROC and RTC tasks.

MODELS	ROC	RTC
LLAVA-1.5	54.72	57.42
LLAVA-1.5 + CONTROLMLLM	60.59	63.06
LLAVA-1.5 + CONTROLMLLM++	71.19	74.66
LLAVA-HR	53.81	57.00
LLAVA-HR + CONTROLMLLM	58.92	66.89
LLAVA-HR + CONTROLMLLM++	69.06	82.68
QWEN2.5-VL	78.81	81.91
QWEN2.5-VL + CONTROLMLLM	79.20	86.43
QWEN2.5-VL + CONTROLMLLM++	79.20	88.23

Referring description performance on REFCOCOG and screenshot datasets. Metrics include BLEU-4 (B@4), METEOR (M), CIDEr (C), and SPICE (S). Our method not only equips non-referring models with effective grounding ability but also complements modern referring-capable MLLMs by enhancing their generalization and precision.

MODELS	REF-COCOG (IN-DOMAIN)				SCREENSHOT (OUT-OF-DOMAIN)
MODELS	B@4	M	C	S	B@4	M	C	S
LLAVA-1.5	5.02	13.15	55.61	17.61	0.32	3.96	9.80	3.58
LLAVA-1.5 + CONTROLMLLM	5.53	14.00	59.75	19.08	0.45	5.08	19.74	5.81
LLAVA-1.5 + CONTROLMLLM++	6.24	15.05	67.37	21.46	0.57	6.53	40.01	9.14
LLAVA-HR	5.28	13.45	56.29	18.55	0.29	4.27	10.88	4.59
LLAVA-HR + CONTROLMLLM	6.32	15.00	68.82	21.55	0.64	6.79	37.10	8.54
LLAVA-HR + CONTROLMLLM++	7.50	16.11	78.42	24.02	0.98	9.18	66.96	13.83
QWEN2.5-VL	5.22	16.86	56.78	20.18	1.09	4.56	34.32	7.15
QWEN2.5-VL + CONTROLMLLM	5.33	16.91	58.20	20.12	4.26	9.91	86.35	15.27
QWEN2.5-VL + CONTROLMLLM++	5.45	16.53	59.50	19.95	9.05	16.04	141.36	25.08

Acknowledgement

Layout-Guidance, ml-ferret, Transformers, SeeClick and Visualizer.