Vero: An Open RL Recipe for General Visual Reasoning

June 3, 2026 · View on GitHub

Vero

Paper URL Model Checkpoints Vero Dataset Project Page

Vero: An Open RL Recipe for General Visual Reasoning

Vero is a fully open reinforcement learning recipe for training and evaluating multi-task visual reasoning with vision-language models.

The released project combines an RL training stack (vero-rl) and an evaluation harness (vero-eval).

Vero Teaser


Highlights

  • 600K curated RL samples from 59 datasets across 6 visual reasoning task categories: STEM, Chart & OCR, Spatial & Action, Knowledge & Recognition, Grounding, Counting & Search, & Captioning & Instruction Following
  • Single-stage RL recipe for visual reasoning with task-routed reward functions
  • VeroEvalSuite with 30 benchmarks spanning the 6 multimodal reasoning task categories
  • Support for many base models: Qwen3.5, Qwen2.5-VL, Qwen3-VL, MiMo-VL, Bee, Molmo2
  • Fully open codebase for training and evaluation

Installation

Clone Repository

git clone https://github.com/zlab-princeton/vero.git
cd vero

Environment Setup

bash scripts/setup_env.sh

This installs PyTorch, vLLM, Transformers, FlashAttention, and both project packages (vero-rl, vero-eval) in editable mode. See scripts/setup_env.sh for the full setup flow.


Data Setup

Dataset Composition

For Vero RL training, the model-run scripts use formatted local data under vero-rl/data by default. Prepare it once with:

python scripts/download_and_format_vero_600k.py

This script downloads or reuses cached data from zlab-princeton/Vero-600k, exports images into vero-rl/data/images/, and writes:

vero-rl/data/vero_600k_train.verl.jsonl
vero-rl/data/vero_600k_val.verl.jsonl

All bash launchers in vero-rl/examples/model_runs/ will pick up those files automatically once they exist.

For custom data, Vero expects a specific data format for RL training.

For dataset format, curation details, and reward routing metadata, see docs/DATA.md.


Vero Reward

We open source our runtime reward stack in vero-rl/vero_reward. Its main entrypoint, math_verify_reward_type_boxed.py, routes scoring by reward_type and combines strict <think>/<answer> format checks with task-specific accuracy. The package covers boxed/numeric/string-match style rewards, grounding rewards based on bbox matching in grounding_reward.py, clicking rewards based on point-in-box checks in click_reward.py, and instruction-following checks in instructions.py.

During Vero RL training, these rule-based rewards are combined with an LLM-judge path implemented in vero_vllm_judge.py. The shared model-run config gspo_llmjudge_shared.yaml enables the vero_vllm_judge reward manager, points the custom reward function at vero_reward/math_verify_reward_type_boxed.py, and configures judge parameters such as the local API endpoint, sampling settings, sleep mode, and the instruction-following blend weight.

The LLM judge itself uses the prompt in llm_judge_reference.txt, which asks the judge model to compare the rollout answer against a reference answer and return a structured 1-10 score. In the standard training scripts such as run_gspo_qwen3vl_instruct_mix_all_llmjudge.sh, the judge server is started automatically by sourcing llm_judge_server.sh, which launches a local vllm serve process, waits for readiness, and prepares the server for training-time reward calls.


Model Checkpoints

Pretrained Huggingface checkpoints are available via the following links:

ModelBase ModelParametersHF Link
Vero-Qwen25-7BQwen2.5-VL-7B-Instruct7Bzlab-princeton/Vero-Qwen25-7B
Vero-Qwen3I-8BQwen3-VL-8B-Instruct8Bzlab-princeton/Vero-Qwen3I-8B
Vero-Qwen3T-8BQwen3-VL-8B-Thinking8Bzlab-princeton/Vero-Qwen3T-8B
Vero-MiMo-7BMiMo-VL-7B-SFT7Bzlab-princeton/Vero-MiMo-7B

See docs/MODELS.md for the documented model families, training settings, and inference format.


Supported Training Launch Scripts

ScriptModel FamilyBase Model
Train Vero-Qwen25-7BVero-Qwen25-7BQwen2.5-VL-7B-Instruct
Train Vero-Qwen3I-8BVero-Qwen3I-8BQwen3-VL-8B-Instruct
Train Vero-MiMo-7BVero-MiMo-7BMiMo-VL-7B-SFT

Quick Start

First set cache paths (the base model and reward judge download on the fly under HF_HOME) and prepare the repo-local training data:

cp scripts/set_paths.sh.example set_paths.sh   # edit ROOT_PATH (a roomy disk)
source set_paths.sh                            # sets HF_HOME, activates verovlm
python scripts/download_and_format_vero_600k.py

Then launch a training run. TRAIN_FILES, VAL_FILES, and IMAGE_ROOT are optional overrides if you want to point at different formatted data.

export ROOT_PATH="/path/to/data_root"  # for datasets and checkpoints
cd vero-rl
bash examples/model_runs/run_gspo_qwen3vl_instruct_mix_all_llmjudge.sh

The reward judge (Qwen/Qwen3.5-27B by default) downloads on first use; override it with export VLLM_JUDGE_MODEL_PATH=<model>.

Optional dataset overrides:

export TRAIN_FILES="/path/to/train.verl.jsonl"
export VAL_FILES="/path/to/val.verl.jsonl"
export IMAGE_ROOT="/path/to/data_root"

The training scripts auto-detect REPO_ROOT from their location, manage the LLM judge server automatically, and use Hydra-based configs from vero-rl/examples/model_runs/config/.


Evaluation

Evaluation is independent of training — if you only want to run the benchmarks, you can skip the training setup entirely.

Vero is evaluated with vero-eval, an evaluation harness built on lmms-eval which houses VeroEvalSuite, a 30-benchmark suite spanning:

  • Chart and OCR
  • STEM reasoning
  • Spatial reasoning and action
  • Knowledge and recognition
  • Grounding, counting, and visual search
  • Captioning and instruction following

Evaluation Benchmarks

Task CategoryBenchmarks
Chart & OCRChartQA-Pro, ChartQA, InfoVQA, CharXiv, ChartMuseum, EvoChart
STEMMMMU-PRO Standard, MMMU-PRO Vision, MathVision, MathVista
Spatial & ActionBlink, ERQA, GameQA, EmbSpatial, CVBench
Knowledge & RecognitionRealWorldQA, SimpleVQA (English), FVQA, MM-Vet V2
Grounding, Counting & Visual SearchCountBenchQA, CountQA, MMERealWorld, VStarBench, AerialVG, VisualProbe, ScreenSpot, ScreenSpotPro
Captioning & Instruction FollowingMM-MTBench, MIABench, MMIFEval

Quick Start

First set your cache paths and Hugging Face login (datasets and models download on the fly under HF_HOME), then verify the machine is ready:

cp scripts/set_paths.sh.example set_paths.sh   # edit ROOT_PATH (a roomy disk)
source set_paths.sh                            # sets HF_HOME, caches, JUDGE_MODEL_PATH
huggingface-cli login                          # gated datasets (e.g. MMMU_Pro)

cd vero-eval
bash examples/preflight.sh --download-judge    # check env/GPU/login + pre-fetch judge

Then run an evaluation:

cd vero-eval

# Single task (rule-based, no judge needed); --limit for a quick smoke test
bash examples/eval.sh \
    --model-path zlab-princeton/Vero-Qwen3I-8B \
    --tasks chartqa_reasoning \
    --limit 5

# A full domain. Reasoning variants need a judge (pass it with --judge-model);
# judge-based tasks need 2 GPUs — one for the model, one for the judge.
bash examples/eval_domain.sh \
    --model-path zlab-princeton/Vero-Qwen3I-8B \
    --domain chart_ocr \
    --variant reasoning \
    --judge-model Qwen/Qwen3-32B \
    --num-gpus 2

The judge is selected by the JUDGE_MODEL_PATH env var (which --judge-model sets). If left unset, judge-based tasks fall back to OpenAI gpt-4o and need GPT_API_KEY. Judge tasks require 2 GPUs. See docs/EVALUATION.md.

Setting up with an AI coding agent? docs/AGENTS_SETUP.md is a one-file runbook a Claude Code / Codex agent (or a human) can follow end to end.

See docs/EVALUATION.md for benchmark coverage, judge configuration, and evaluation workflows.


Repository Structure

Vero/
|-- docs/          Data, training, evaluation, and model documentation
|-- scripts/       Environment setup and data filtering scripts
|-- vero-eval/     Evaluation harness built around lmms-eval
`-- vero-rl/       RL training framework built around veRL

Documentation


Citation

If you use this repository, please cite:

@article{sarch2026vero,
    title   = {Vero: An Open RL Recipe for General Visual Reasoning},
    author  = {Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
    year    = {2026},
    journal = {arXiv preprint arXiv:2604.04917},
  }

Acknowledgements

This project builds on several strong open-source foundations:

  • veRL for distributed RL training infrastructure
  • lmms-eval for multimodal evaluation

License

This project is licensed under the Apache License 2.0.