DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

August 14, 2025 Β· View on GitHub

Dynamic Focus

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

This is the official repo for Dynamic Focus (Visual Search), a training-free visual search method for enhancing LMMs/MLLMs in Fine-Grained Visual Understanding by simulating human dynamic visual focus.

πŸ”₯ Update

  • [2025-08-11]: πŸš€ Updated to be compatible with the latest vllm. Merged DyFo and expert environments for easier setup.
  • [2025-05-15]: πŸš€ Codes released.
  • [2025-04-21]: ⭐️ DyFo is selected as Poster Highlight in CVPR 2025! (Top 13.5% in accepted papers), check out this link for details.

🎯 Overview

  • We introduce DyFo (Dynamic Focus), a training-free visual search method that dynamically adjusts focus regions to enhance fine-grained visual understanding in large multimodal models (LMMs).
  • The focus adjustment is guided by a bidirectional interaction between LMMs and visual experts, optimized via a Monte Carlo Tree Search (MCTS) algorithm
  • DyFo effectively filters out irrelevant content while avoiding the need for additional training or specialized localization modules, leading to improved fine-grained visual understanding and reduced hallucination in LMMs.

πŸ•ΉοΈ Usage

1. Environment Setup

DyFo combines two components: (1) A Large Multimodal Model (LMM) like Qwen2-VL and LLaVA-1.5 (vllm), and (2) A visual expert like Lang_SAM(this link) to collaborative inference.

Note

If you encounter network issues accessing GitHub or HuggingFace during installation, you can try using these mirror sites:

  1. GitHub Mirror: this link
  2. HuggingFace Mirror: this link

1. Python environment setup:

conda create -n dyfo python=3.11
conda activate dyfo
pip install -r requirements.txt

2. Visual Expert (Lang_SAM) install:

  • Download github repository:
git clone https://github.com/luca-medeiros/lang-segment-anything && cd lang-segment-anything
  • (Manual Action) Modify line 41 in lang_sam/models/gdino.py to support batch inference:
inputs = self.processor(images=images_pil, text=texts_prompt, return_tensors="pt", padding=True).to(self.model.device)
  • (Manual Action) Modify line 47 in lang_sam/models/gdino.py to adapt for latest transformers (version 4.55):
threshold=box_threshold,

2. Data Preparation

Download the dataset from this link and unzip dataset.zip to get the following directory structure:

.
β”œβ”€β”€ dyfo
β”‚   β”œβ”€β”€ scripts
β”‚   └── src
└── playground (dataset)
    └── data
        └── eval
            β”œβ”€β”€ pope
            └── vstar

3. Evaluation

1. Starting Servers

To start both LMM and Visual Expert servers:

# Start LMM server (recommend tmux)
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/lmm_server/<qwen/llava>_server.sh 
# Start Visual Expert server (recommend tmux)
conda activate dyfo 
CUDA_VISIBLE_DEVICES=1 bash dyfo/scripts/expert_server/start_server.sh 

2. Collaborative Inference

For POPE evaluation:

  • Batch testing (all 9 sub-datasets about 6~7h):
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/pope/<qwen/llava>_batch.sh
  • Single dataset testing (about 40~50mins):
# take gqa_random for example 
# other datasets: <coco/aokvqa/gqa>/<coco/aokvqa/gqa>_pope_<random/popular/adversarial>
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/pope/stream_pope_<qwen/llava>.sh mcts False gqa/gqa_pope_random

For V* evaluation (about 30mins):

conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/vstar/stream_vstar_<qwen/llava>.sh mcts False

πŸ… Experiments

The experimental results of new version are shown below:

DatasetTypeModelAccuracy↑PrecisionRecallF1Score↑
MSCOCOrandomLLaVA1.592.0393.9489.8791.86
Qwen2-VL92.3396.4987.8791.97
popularLLaVA1.588.7787.6990.2088.93
Qwen2-VL89.2090.5087.6089.02
adversarialLLaVA1.583.3379.6689.5384.31
Qwen2-VL86.8786.6287.2086.91
A-OKVQArandomLLaVA1.590.4387.4294.4790.80
Qwen2-VL92.3392.0592.6792.36
popularLLaVA1.584.8379.0494.8086.21
Qwen2-VL89.1787.0792.0089.47
adversarialLLaVA1.575.1768.1194.6779.22
Qwen2-VL82.1376.7892.1383.76
GQArandomLLaVA1.590.0387.2793.7390.39
Qwen2-VL88.6094.7481.7387.76
popularLLaVA1.580.3374.0093.5382.63
Qwen2-VL85.8788.9381.9385.29
adversarialLLaVA1.575.0368.3393.3378.90
Qwen2-VL81.8782.1281.4781.79

table 1. Results on POPE for MSCOCO/AOKVQA/GQA with LLaVA1.5 and Qwen2-VL.

DatasetModelAttribute↑Spatial↑Overall↑
V*DyFo-L65.2257.8962.30
DyFo-Q80.8778.9580.10

table 2. Results on V*. DyFo-L and DyFo-Q represent our method with LLaVA1.5 and Qwen2-VL, respectively.

  • Please refer to our paper for detailed experimental results.

πŸ“‘ Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@misc{li2025dyfotrainingfreedynamicfocus,
      title={DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding}, 
      author={Geng Li and Jinglin Xu and Yunzhen Zhao and Yuxin Peng},
      year={2025},
      eprint={2504.14920},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.14920}, 
}
  • V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
  • Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
  • LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
  • LangSam: Language Segment-Anything (Cool Expert!)
  • vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
  • VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding