README.md

May 11, 2026 · View on GitHub

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

Training VLM agents with multi-turn reinforcement learning

🔥 NeurIPS 2025 🔥

Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li

(* equal contribution)

FrozenLake

Navigation

Sokoban

ManiSkill

SVG

We introduce VAGEN, a multi-turn reinforcement learning framework designed specifically for training vision-language model (VLM) agents. Built upon this framework, we propose World Modeling RL, a novel reinforcement learning approach that significantly improves the multi-turn performance of VLMs by explicitly supervising their worldmodel reasoning process, as shown in Figure 1.

We frame multi-turn VLM agentic tasks as a Partially Observable Markov Decision Process (POMDP), shown in Figure 2.


_{Figure 1. Overview of the VAGEN framework.}	_{Figure 2. POMDP formulation of multi-turn VLM agentic tasks.}

News

[2026/02] We have migrated the main branch to VAGEN-Lite, a lightweight and clean reimplementation built on VERL agent-loop for easy customization and stable performance. For the previous full-featured release, please visit the vagen-legacy branch.

[2025/12] Introducing VAGEN-Lite: a lightweight and clean reimplementation of VAGEN, built on the VERL agent-loop for easy customization and stable performance.

[2025/09] VAGEN is accepted by Neurips 2025

[2025/04] We've introduced a new modular design for environments and services in VAGEN:

Enhanced environment framework for easier creation of custom environments
New service architecture for efficient distributed training
Check out our new guides:
- Creating Environments: New environment protocal.
- Creating Services: We now support hosting environments in a separate process

[2025/03] We release VAGEN, a multi-turn reinforcement learning framework for training VLM Agents!

Installation

conda create -n vagen python=3.12 -y
conda activate vagen

git clone https://github.com/mll-lab-nu/VAGEN.git
cd VAGEN
git submodule update --init --recursive

cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..
pip install -e .
pip install "trl==0.26.2"

Quick Start

Training

VAGEN currently supports PPO / GRPO with two multi-turn training paradigms:

Multi-turn Concatenated Training: All turns in a trajectory are concatenated into a single training instance.

# Qwen/Qwen2.5-VL-3B-Instruct
cd VAGEN
bash examples/train/sokoban/train_ppo_qwen25vl3b.sh

# Qwen/Qwen3-VL-4B-Instruct
# pip install transformers==4.57.1
# pip install "sglang[all]==0.5.3.post3"
cd VAGEN
bash examples/train/sokoban/train_grpo_qwen3vl4b.sh

# Enable reward variance based top-p filtering
cd VAGEN
bash examples/train/frozenlake/train_grpo_qwen25vl3b_filtertopp_vision.sh

Multi-turn Non-Concatenated Training: Each trajectory is split into multiple turn-level training instances.

cd VAGEN
bash examples/train/sokoban/train_ppo_no_concat_qwen25vl3b.sh

Evaluation

VAGEN supports evaluation using different backends (OpenAI, Claude, Gemini, sglang, vLLM). For details, see vagen/evaluate/README.md.

cd VAGEN
# FrozenLake evaluation with sglang
bash examples/evaluate/frozenlake/eval_qwen25_vl_3b.sh

cd VAGEN
# Sokoban evaluation
bash examples/evaluate/sokoban/run_eval.sh

Customizing Your Environment

To train on your own environment, follow the steps below.

1. Create Your Environment Class

Use GymImageEnv as the base class:
- vagen/envs/gym_image_env.py
Refer to Sokoban for a full implementation example:
- vagen/envs/sokoban/sokoban_env.py

2. Register the Environment

Add your environment entry to:

vagen/configs/env_registry.yaml

3. Create Configuration Files

Prepare training and validation configs:

train.yaml
val.yaml

You can follow the Sokoban examples as templates:

4. Create a Training Script

Write your training script based on:

examples/train/sokoban/train_ppo_qwen25vl3b.sh

More Customization

See the Documentation for more customization options:

Custom Filter — Trajectory filtering (e.g., Reward Variance (RV) filter in RAGEN)
Custom Metric - Add W&B logging metrics
Configuration - Training configuration reference

Useful Configs

refer to vagen/configs/vagen_multiturn.yaml

No Concat Mode

# Enable no concat mode: input is system prompt + current step observation
trainer:
  concat_multi_turn: False
# Currently only supported with algorithm.adv_estimator=no_concat_gae

Image Logging

# Warning:
# - If you set a training-data rollout dir AND enable image logging, training images will also be dumped to disk.
#   This can consume a large amount of storage very quickly. Monitor disk usage and consider cleanup/limits.
trainer:
  log_image:
    enable: false      # true can enable saving rollout/validation images to disk
    max_pending: 2     # max concurrent async image dump tasks
    png_compress_level: 0  # PNG compression (0 = fastest, 9 = smallest)

HuggingFace Hub Upload

# export HF_TOKEN=xxx
huggingface_hub:
  hf_save_freq: null   # upload every N steps (must be a multiple of trainer.save_freq); null = disabled
  repo_id: null        
  private: false

Training Data Filtering


filter:
  name: reward_variance_top_p # refer to vagen/custom_filter
  filter_kwargs: 
    top_p: 0.9 
  enable: False # set to true to enable filtering, recommended for grpo trainining

Known Issues & Fixes

See docs/issues.md

Citation

If you find our framework and paper useful, we appreciate it if you could cite our work:

@inproceedings{wang2025vagen,
  title={VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents},
  author={Kangrui Wang and Pingyue Zhang and Zihan Wang and Yaning Gao and Linjie Li and Qineng Wang and Hanyang Chen and Yiping Lu and   Zhengyuan Yang and Lijuan Wang and Ranjay Krishna and Jiajun Wu and Li Fei-Fei and Yejin Choi and Manling Li},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://arxiv.org/abs/2510.16907}
}