README.md

February 9, 2026 · View on GitHub

Vision-DeepResearch & Vision-DeepResearch Benchmark (VDR-Bench)

The official repo for "Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models" and "Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models".

logo Project Page

🤗 Cold-start Dataset (demo)   |   🤗 RL Dataset (demo)   |   🤗 VDR-Bench (full)   |   🤗 VDR-Bench (testmini)  

🤗 Vision-DeepResearch-30B-A3B (SFT+RL, coming soon)   |   🤗 Vision-DeepResearch-8B (SFT-only)  

📑Vision-DeepResearch Paper   |   📑 VDR-Bench Paper  

The datasets, code and weights will be released, stay tuned!

Timeline

Demo (click to watch on YouTube)

Compare to Other Methods

More Cases

Performance

ModelVDRFVQAMMSearch+MMSearchLiveVQABC-VLAvg.
Direct Answer
GPT-59.857.319.133.357.547.237.4
Gemini-2.5 Pro8.060.714.539.860.343.137.7
Gemini-2.5 Flash6.247.78.130.451.037.130.1
Claude-4-Sonnet2.035.34.018.738.529.321.3
Claude-3.7-Sonnet4.636.74.021.138.032.322.8
Qwen3-VL-8B-Instruct2.828.03.215.241.025.119.2
Qwen3-VL-8B-Thinking5.624.02.715.843.325.119.4
Qwen3-VL-30B-A3B-Instruct3.834.73.218.742.729.622.1
Qwen3-VL-30B-A3B-Thinking4.432.74.519.349.034.624.1
RAG Workflow
Gemini-2.5-flash------43.941.312.1--
Claude-3.7-Sonnet------32.730.310.0--
Qwen-2.5-VL-72B------29.235.710.2--
Agent Workflow
GPT-520.469.017.263.773.346.148.3
Gemini-2.5 Pro18.868.322.269.076.049.950.7
Gemini-2.5 Flash16.368.019.964.073.044.647.6
Claude-4-Sonnet13.669.023.167.269.748.648.5
Claude-3.7-Sonnet27.267.317.263.772.050.449.6
Qwen3-VL-8B-Thinking17.651.312.245.656.337.136.7
Qwen3-VL-30B-A3B-Thinking23.263.013.653.262.044.143.2
Multimodal DeepResearch MLLM
MMSearch-R1-7B--58.4--53.848.4----
Webwatcher-7B------49.151.220.3--
Webwatcher-32B------55.358.726.7--
Ours
Qwen3-VL-8B-Instruct (Agentic)17.058.711.352.063.038.640.1
Vision-DeepResearch-8B (Ours)29.2 (+12.2)64.7 (+6.0)20.4 (+9.1)69.6 (+17.6)76.7 (+13.7)42.6 (+4.0)50.5 (+10.4)
Qwen3-VL-30B-A3B-Instruct (Agentic)20.257.710.055.060.042.640.9
Vision-DeepResearch-30B-A3B (Ours)37.8 (+17.6)74.2 (+16.5)28.5 (+18.5)69.6 (+14.6)77.6 (+17.6)53.7 (+11.1)56.9 (+16.0)

Teaser

Vision-DeepResearch

VDR-Bench

Data Pipeline

Vision-DeepResearch

VDR-Bench

Quickstart

Environment Setup

# 1. Clone the repository
git clone https://github.com/Osilly/Vision-DeepResearch.git
cd Vision-DeepResearch

# 2. Install verl
cd rllm/verl
pip install -e .

# 3. Install Megatron-LM
cd ../../Megatron-LM
pip install -e .

# 4. Install mbridge
cd ../mbridge
pip install -e .

# 5. Install rllm
cd ../rllm
pip install -e .

# 6. Install additional dependencies
pip install requests==2.32.3
pip install oss2

# 7. Return to the project root directory
cd ..

Data Preparation

SFT Data

Download the Cold-start dataset (Demo 1K).

You need to convert the data in Parquet format into the JSONL training format supported by ms-swift. We provide a conversion script for this purpose: ms-swift/run/data_prepare/convert_parquet2jsonl.sh.

You need to provide an --image_dir, where images stored as bytes in the Parquet file will be converted to .png/.jpg files and saved to disk.

RL Data

Download the RL dataset (Demo 1K).

First, you need to convert the data in Parquet format into the JSONL format. We provide a conversion script for this purpose: rllm/vision_deepresearch_async_workflow/data_prepare/convert_parquet2jsonl.sh.

Then, you need to run rllm/vision_deepresearch_async_workflow/data_prepare/register_rl_dataset.sh to register the RL dataset.

SFT Train

cd ms-swift
bash run/vision_deepresearch_SFT_30B_A3B_megatron_lr2e5_2ep.sh

RL Train

First, deploy the Extract model (used to summarize web page contents) and the Judge model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve \
 Qwen/Qwen3-VL-30B-A3B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --served-model-name "Qwen3-VL-30B-A3B-Instruct" \
  --max_model_len 160000 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching

Then, modify the vLLM URL service endpoints for JUDGE_MODEL and EXTRACT_MODEL in rllm/.env, and enter your SERP_API_KEY, JINA_API_KEY, and OSS configuration.

Run RL train.

cd rllm
bash vision_deepresearch_async_workflow/run/vision_deepresearch_30B_A3B_grpo_plus_bfloat16_sglang_megatron_128batch_128mini_8n.sh

Eval

Run the command below to start an OpenAI-compatible API service:

Vision-DeepResearch model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve \
 Osilly/Vision-DeepResearch-8B \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --served-model-name "Vision-DeepResearch-8B" \
  --max_model_len 160000 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching

Extract model (used to summarize web page contents) and Judge model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve \
 Qwen/Qwen3-VL-30B-A3B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --served-model-name "Qwen3-VL-30B-A3B-Instruct" \
  --max_model_len 160000 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching

Modify the vLLM URL service endpoints for JUDGE_MODEL and EXTRACT_MODEL in rllm/.env, and enter your SERP_API_KEY, JINA_API_KEY, and OSS configuration.

Modify the base-url and model (the Vision DeepResearch vLLM service endpoint and model name) in rllm/eval/run_eval.sh. For the data format of test.parquet, refer to rllm/eval/README.md.

Run rllm/eval/run_eval.sh to start inference.

bash rllm/eval/run_eval.sh

Star History

Star History Chart

Citation

@article{huang2026vision,
  title={Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Zeng, Yu and Wang, Qiuchen and Fang, Zhen and Cao, Shaosheng and Chu, Zheng and Yin, Qingyu and Chen, Shuang and Yin, Zhenfei and Chen, Lin and others},
  journal={arXiv preprint arXiv:2601.22060},
  year={2026}
}

@article{zeng2026vision,
  title={Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models},
  author={Zeng, Yu and Huang, Wenxuan and Fang, Zhen and Chen, Shuang and Shen, Yufan and Cai, Yishuo and Wang, Xiaoman and Yin, Zhenfei and Chen, Lin and Chen, Zehui and others},
  journal={arXiv preprint arXiv:2602.02185},
  year={2026}
}