Vero: An Open RL Recipe for General Visual Reasoning
June 3, 2026 · View on GitHub
Vero: An Open RL Recipe for General Visual Reasoning
Vero is a fully open reinforcement learning recipe for training and evaluating multi-task visual reasoning with vision-language models.
The released project combines an RL training stack (vero-rl) and an evaluation harness (vero-eval).
Highlights
- 600K curated RL samples from 59 datasets across 6 visual reasoning task categories: STEM, Chart & OCR, Spatial & Action, Knowledge & Recognition, Grounding, Counting & Search, & Captioning & Instruction Following
- Single-stage RL recipe for visual reasoning with task-routed reward functions
- VeroEvalSuite with 30 benchmarks spanning the 6 multimodal reasoning task categories
- Support for many base models: Qwen3.5, Qwen2.5-VL, Qwen3-VL, MiMo-VL, Bee, Molmo2
- Fully open codebase for training and evaluation
Installation
Clone Repository
git clone https://github.com/zlab-princeton/vero.git
cd vero
Environment Setup
bash scripts/setup_env.sh
This installs PyTorch, vLLM, Transformers, FlashAttention, and both project packages (vero-rl, vero-eval) in editable mode. See scripts/setup_env.sh for the full setup flow.
Data Setup
For Vero RL training, the model-run scripts use formatted local data under vero-rl/data by default.
Prepare it once with:
python scripts/download_and_format_vero_600k.py
This script downloads or reuses cached data from zlab-princeton/Vero-600k, exports images into vero-rl/data/images/, and writes:
vero-rl/data/vero_600k_train.verl.jsonl
vero-rl/data/vero_600k_val.verl.jsonl
All bash launchers in vero-rl/examples/model_runs/ will pick up those files automatically once they exist.
For custom data, Vero expects a specific data format for RL training.
For dataset format, curation details, and reward routing metadata, see docs/DATA.md.
Vero Reward
We open source our runtime reward stack in vero-rl/vero_reward. Its main entrypoint, math_verify_reward_type_boxed.py, routes scoring by reward_type and combines strict <think>/<answer> format checks with task-specific accuracy. The package covers boxed/numeric/string-match style rewards, grounding rewards based on bbox matching in grounding_reward.py, clicking rewards based on point-in-box checks in click_reward.py, and instruction-following checks in instructions.py.
During Vero RL training, these rule-based rewards are combined with an LLM-judge path implemented in vero_vllm_judge.py. The shared model-run config gspo_llmjudge_shared.yaml enables the vero_vllm_judge reward manager, points the custom reward function at vero_reward/math_verify_reward_type_boxed.py, and configures judge parameters such as the local API endpoint, sampling settings, sleep mode, and the instruction-following blend weight.
The LLM judge itself uses the prompt in llm_judge_reference.txt, which asks the judge model to compare the rollout answer against a reference answer and return a structured 1-10 score. In the standard training scripts such as run_gspo_qwen3vl_instruct_mix_all_llmjudge.sh, the judge server is started automatically by sourcing llm_judge_server.sh, which launches a local vllm serve process, waits for readiness, and prepares the server for training-time reward calls.
Model Checkpoints
Pretrained Huggingface checkpoints are available via the following links:
| Model | Base Model | Parameters | HF Link |
|---|---|---|---|
Vero-Qwen25-7B | Qwen2.5-VL-7B-Instruct | 7B | zlab-princeton/Vero-Qwen25-7B |
Vero-Qwen3I-8B | Qwen3-VL-8B-Instruct | 8B | zlab-princeton/Vero-Qwen3I-8B |
Vero-Qwen3T-8B | Qwen3-VL-8B-Thinking | 8B | zlab-princeton/Vero-Qwen3T-8B |
Vero-MiMo-7B | MiMo-VL-7B-SFT | 7B | zlab-princeton/Vero-MiMo-7B |
See docs/MODELS.md for the documented model families, training settings, and inference format.
Supported Training Launch Scripts
| Script | Model Family | Base Model |
|---|---|---|
| Train Vero-Qwen25-7B | Vero-Qwen25-7B | Qwen2.5-VL-7B-Instruct |
| Train Vero-Qwen3I-8B | Vero-Qwen3I-8B | Qwen3-VL-8B-Instruct |
| Train Vero-MiMo-7B | Vero-MiMo-7B | MiMo-VL-7B-SFT |
Quick Start
First set cache paths (the base model and reward judge download on the fly under HF_HOME) and prepare the repo-local training data:
cp scripts/set_paths.sh.example set_paths.sh # edit ROOT_PATH (a roomy disk)
source set_paths.sh # sets HF_HOME, activates verovlm
python scripts/download_and_format_vero_600k.py
Then launch a training run. TRAIN_FILES, VAL_FILES, and IMAGE_ROOT are optional overrides if you want to point at different formatted data.
export ROOT_PATH="/path/to/data_root" # for datasets and checkpoints
cd vero-rl
bash examples/model_runs/run_gspo_qwen3vl_instruct_mix_all_llmjudge.sh
The reward judge (Qwen/Qwen3.5-27B by default) downloads on first use; override it with export VLLM_JUDGE_MODEL_PATH=<model>.
Optional dataset overrides:
export TRAIN_FILES="/path/to/train.verl.jsonl"
export VAL_FILES="/path/to/val.verl.jsonl"
export IMAGE_ROOT="/path/to/data_root"
The training scripts auto-detect REPO_ROOT from their location, manage the LLM judge server automatically, and use Hydra-based configs from vero-rl/examples/model_runs/config/.
Evaluation
Evaluation is independent of training — if you only want to run the benchmarks, you can skip the training setup entirely.
Vero is evaluated with vero-eval, an evaluation harness built on lmms-eval which houses VeroEvalSuite, a 30-benchmark suite spanning:
- Chart and OCR
- STEM reasoning
- Spatial reasoning and action
- Knowledge and recognition
- Grounding, counting, and visual search
- Captioning and instruction following
Evaluation Benchmarks
| Task Category | Benchmarks |
|---|---|
| Chart & OCR | ChartQA-Pro, ChartQA, InfoVQA, CharXiv, ChartMuseum, EvoChart |
| STEM | MMMU-PRO Standard, MMMU-PRO Vision, MathVision, MathVista |
| Spatial & Action | Blink, ERQA, GameQA, EmbSpatial, CVBench |
| Knowledge & Recognition | RealWorldQA, SimpleVQA (English), FVQA, MM-Vet V2 |
| Grounding, Counting & Visual Search | CountBenchQA, CountQA, MMERealWorld, VStarBench, AerialVG, VisualProbe, ScreenSpot, ScreenSpotPro |
| Captioning & Instruction Following | MM-MTBench, MIABench, MMIFEval |
Quick Start
First set your cache paths and Hugging Face login (datasets and models download
on the fly under HF_HOME), then verify the machine is ready:
cp scripts/set_paths.sh.example set_paths.sh # edit ROOT_PATH (a roomy disk)
source set_paths.sh # sets HF_HOME, caches, JUDGE_MODEL_PATH
huggingface-cli login # gated datasets (e.g. MMMU_Pro)
cd vero-eval
bash examples/preflight.sh --download-judge # check env/GPU/login + pre-fetch judge
Then run an evaluation:
cd vero-eval
# Single task (rule-based, no judge needed); --limit for a quick smoke test
bash examples/eval.sh \
--model-path zlab-princeton/Vero-Qwen3I-8B \
--tasks chartqa_reasoning \
--limit 5
# A full domain. Reasoning variants need a judge (pass it with --judge-model);
# judge-based tasks need 2 GPUs — one for the model, one for the judge.
bash examples/eval_domain.sh \
--model-path zlab-princeton/Vero-Qwen3I-8B \
--domain chart_ocr \
--variant reasoning \
--judge-model Qwen/Qwen3-32B \
--num-gpus 2
The judge is selected by the
JUDGE_MODEL_PATHenv var (which--judge-modelsets). If left unset, judge-based tasks fall back to OpenAIgpt-4oand needGPT_API_KEY. Judge tasks require 2 GPUs. See docs/EVALUATION.md.
Setting up with an AI coding agent? docs/AGENTS_SETUP.md is a one-file runbook a Claude Code / Codex agent (or a human) can follow end to end.
See docs/EVALUATION.md for benchmark coverage, judge configuration, and evaluation workflows.
Repository Structure
Vero/
|-- docs/ Data, training, evaluation, and model documentation
|-- scripts/ Environment setup and data filtering scripts
|-- vero-eval/ Evaluation harness built around lmms-eval
`-- vero-rl/ RL training framework built around veRL
Documentation
- Agent Setup Guide — one-file, end-to-end setup + eval runbook for AI coding agents (Claude Code / Codex) or humans
- Training Guide
- Evaluation Guide
- Data Guide
- Model Guide
Citation
If you use this repository, please cite:
@article{sarch2026vero,
title = {Vero: An Open RL Recipe for General Visual Reasoning},
author = {Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
year = {2026},
journal = {arXiv preprint arXiv:2604.04917},
}
Acknowledgements
This project builds on several strong open-source foundations:
License
This project is licensed under the Apache License 2.0.