DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

August 14, 2025 · View on GitHub

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

This is the official repo for Dynamic Focus (Visual Search), a training-free visual search method for enhancing LMMs/MLLMs in Fine-Grained Visual Understanding by simulating human dynamic visual focus.

🔥 Update

[2025-08-11]: 🚀 Updated to be compatible with the latest vllm. Merged DyFo and expert environments for easier setup.
[2025-05-15]: 🚀 Codes released.
[2025-04-21]: ⭐️ DyFo is selected as Poster Highlight in CVPR 2025! (Top 13.5% in accepted papers), check out this link for details.

🎯 Overview

We introduce DyFo (Dynamic Focus), a training-free visual search method that dynamically adjusts focus regions to enhance fine-grained visual understanding in large multimodal models (LMMs).
The focus adjustment is guided by a bidirectional interaction between LMMs and visual experts, optimized via a Monte Carlo Tree Search (MCTS) algorithm
DyFo effectively filters out irrelevant content while avoiding the need for additional training or specialized localization modules, leading to improved fine-grained visual understanding and reduced hallucination in LMMs.

🕹️ Usage

1. Environment Setup

DyFo combines two components: (1) A Large Multimodal Model (LMM) like Qwen2-VL and LLaVA-1.5 (vllm), and (2) A visual expert like Lang_SAM(this link) to collaborative inference.

Note

If you encounter network issues accessing GitHub or HuggingFace during installation, you can try using these mirror sites:

GitHub Mirror: this link
HuggingFace Mirror: this link

1. Python environment setup:

conda create -n dyfo python=3.11
conda activate dyfo
pip install -r requirements.txt

2. Visual Expert (Lang_SAM) install:

Download github repository:

git clone https://github.com/luca-medeiros/lang-segment-anything && cd lang-segment-anything

(Manual Action) Modify line 41 in lang_sam/models/gdino.py to support batch inference:

inputs = self.processor(images=images_pil, text=texts_prompt, return_tensors="pt", padding=True).to(self.model.device)

(Manual Action) Modify line 47 in lang_sam/models/gdino.py to adapt for latest transformers (version 4.55):

threshold=box_threshold,

2. Data Preparation

Download the dataset from this link and unzip dataset.zip to get the following directory structure:

.
├── dyfo
│   ├── scripts
│   └── src
└── playground (dataset)
    └── data
        └── eval
            ├── pope
            └── vstar

3. Evaluation

1. Starting Servers

To start both LMM and Visual Expert servers:

# Start LMM server (recommend tmux)
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/lmm_server/<qwen/llava>_server.sh

# Start Visual Expert server (recommend tmux)
conda activate dyfo 
CUDA_VISIBLE_DEVICES=1 bash dyfo/scripts/expert_server/start_server.sh

2. Collaborative Inference

For POPE evaluation:

Batch testing (all 9 sub-datasets about 6~7h):

conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/pope/<qwen/llava>_batch.sh

Single dataset testing (about 40~50mins):

# take gqa_random for example 
# other datasets: <coco/aokvqa/gqa>/<coco/aokvqa/gqa>_pope_<random/popular/adversarial>
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/pope/stream_pope_<qwen/llava>.sh mcts False gqa/gqa_pope_random

For V* evaluation (about 30mins):

conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/vstar/stream_vstar_<qwen/llava>.sh mcts False

🏅 Experiments

The experimental results of new version are shown below:

Dataset	Type	Model	Accuracy↑	Precision	Recall	F1Score↑
MSCOCO	random	LLaVA1.5	92.03	93.94	89.87	91.86
		Qwen2-VL	92.33	96.49	87.87	91.97
	popular	LLaVA1.5	88.77	87.69	90.20	88.93
		Qwen2-VL	89.20	90.50	87.60	89.02
	adversarial	LLaVA1.5	83.33	79.66	89.53	84.31
		Qwen2-VL	86.87	86.62	87.20	86.91
A-OKVQA	random	LLaVA1.5	90.43	87.42	94.47	90.80
		Qwen2-VL	92.33	92.05	92.67	92.36
	popular	LLaVA1.5	84.83	79.04	94.80	86.21
		Qwen2-VL	89.17	87.07	92.00	89.47
	adversarial	LLaVA1.5	75.17	68.11	94.67	79.22
		Qwen2-VL	82.13	76.78	92.13	83.76
GQA	random	LLaVA1.5	90.03	87.27	93.73	90.39
		Qwen2-VL	88.60	94.74	81.73	87.76
	popular	LLaVA1.5	80.33	74.00	93.53	82.63
		Qwen2-VL	85.87	88.93	81.93	85.29
	adversarial	LLaVA1.5	75.03	68.33	93.33	78.90
		Qwen2-VL	81.87	82.12	81.47	81.79

table 1. Results on POPE for MSCOCO/AOKVQA/GQA with LLaVA1.5 and Qwen2-VL.

Dataset	Model	Attribute↑	Spatial↑	Overall↑
V*	DyFo-L	65.22	57.89	62.30
	DyFo-Q	80.87	78.95	80.10

table 2. Results on V*. DyFo-L and DyFo-Q represent our method with LLaVA1.5 and Qwen2-VL, respectively.

Please refer to our paper for detailed experimental results.

📑 Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@misc{li2025dyfotrainingfreedynamicfocus,
      title={DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding}, 
      author={Geng Li and Jinglin Xu and Yunzhen Zhao and Yuxin Peng},
      year={2025},
      eprint={2504.14920},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.14920}, 
}

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
LangSam: Language Segment-Anything (Cool Expert!)
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding