World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
June 3, 2026 ยท View on GitHub
PF-OPSD โ Privileged-Future On-Policy Self-Distillation for controlled concrete reasoning in MLLMs.
Overview
World models and multimodal large language models (MLLMs) offer complementary capabilities for predicting future outcomes from static visual observations. World models generate concrete visual rollouts; MLLMs reason abstractly over goals, rules, and question semantics. However, rolling out a world model blindly does not help โ generated rollouts are stochastic, may be visually plausible yet task-incorrect, and the model must learn when to invoke simulation, whether to trust a rollout, and how to integrate it with abstract reasoning.
We call this problem controlled concrete reasoning and make three concrete contributions:
- VRQABench โ a benchmark for controllable spatial lookahead from maze and Sokoban puzzle images (4,636 questions). ๐ค Dataset
- OpenWorldQA โ a benchmark for open-domain physical future prediction from pre-event anchor frames in real-world videos (4,404 questions). ๐ค Dataset
- PF-OPSD โ a training framework that uses ground-truth future videos as teacher-side privileged context, scores on-policy concrete-reasoning trajectories, and distills advantage-weighted targets back into a deployable student that has no access to true futures at test time.
Installation
git clone https://github.com/yczhou001/PF-OPSD
cd PF-OPSD
pip install -r requirements.txt
Flash Attention 2 (recommended for training speed):
pip install flash-attn --no-build-isolation
Set required environment variables:
export OPENAI_API_KEY="your_api_key"
export OPENAI_API_BASE="https://api.openai.com/v1" # or your endpoint
Part 1 โ OpenWorldQA Dataset Construction
OpenWorldQA is a benchmark for predicting real-world physical futures from pre-event anchor frames. It is built with a five-stage agentic pipeline applied to videos from Charades, SomethingV2, Oops, and CharadesEgo.
๐ก The prebuilt benchmark is available on Hugging Face:
YCZhou/openworld_qa. The steps below are only needed to reconstruct it from scratch.
Step 1a โ Download raw videos
Download the following datasets and place them under openworld_qa/raw_data/ (or set RAW_DATA_ROOT):
| Dataset | Suggested path |
|---|---|
| Something-Something V2 | raw_data/sthv2/extracted/20bn-something-something-v2/ |
| Charades | raw_data/charades/Charades_v1_480/Charades_v1_480/ |
| Oops | raw_data/oops/oops_dataset/oops_video/train/ |
| CharadesEgo | raw_data/charades_ego/CharadesEgo_v1_480/ |
Step 1b โ Sample and extract frames
# Sample 5000 videos and extract frames (runs ffmpeg in parallel)
python openworld_qa/sample_and_extract.py \
--total 5000 \
--output_dir openworld_qa/output/frames \
--num_workers 16
# Or extract frames from a specific video directory
python openworld_qa/extract_frames.py \
--video_dir raw_data/charades/Charades_v1_480/Charades_v1_480/ \
--output_dir openworld_qa/output/frames
Step 1c โ Run the 5-agent pipeline
export OPENAI_API_KEY="your_key"
python openworld_qa/pipeline.py \
--frames_dir openworld_qa/output/frames/ \
--output_dir openworld_qa/output/reviewed/ \
--num_workers 3 \
--save_rejections
The pipeline is resumable โ re-running skips already-processed videos.
| Argument | Default | Description |
|---|---|---|
--frames_dir | output/frames/ | Input: frame directories |
--output_dir | output/reviewed/ | Output: accepted QA JSON files |
--num_workers | 3 | Parallel workers |
--max_videos | 0 (all) | Cap for testing |
--save_rejections | off | Save rejected samples |
--save_generated | off | Save intermediate outputs |
Pipeline architecture (5 agents):
All frames
โ
โผ Agent 1: SceneAnalyst (all frames, multimodal)
โ structured scene report + anchor frame selection
โ
โผ Agent 2: QuestionDesigner (text only)
โ 6 question skeletons across 12 physical categories
โ
โผ Agent 3: DistractorForge (text only)
โ 6 complete QA drafts with physically plausible distractors
โ
โผ Agent 4: SmallModelProbe (anchor frame, small model, ร2 shuffled)
โ "too_easy" โ discard | "hard_enough" โ keep
โ
โผ Agent 5: Reviewer (anchor + post-anchor context frames)
โ 5-dimension review; score โฅ 7 โ accept
Step 1d โ Evaluate a model
python openworld_qa/evaluate.py \
--split test \
--model gpt-5.4 \
--num_workers 8
Part 2 โ VRQABench Dataset Construction
VRQABench tests controllable spatial lookahead from maze, irregular-maze, and Sokoban puzzle images. Labels are programmatically verified (BFS / geometric solver), and a VLM only writes question text.
๐ก The prebuilt benchmark is available on Hugging Face:
YCZhou/vrqa_bench. The steps below are only needed to reconstruct it from scratch.
Step 2a โ Get VR-Bench data
Download VR-Bench and place under vrqa_bench/raw_data/.
Step 2b โ Run the pipeline
export OPENAI_API_KEY="your_key"
# Evaluation split (hard difficulty)
python vrqa_bench/pipeline.py --split eval --num_workers 4
# Training split (hard + medium + easy, with quotas)
python vrqa_bench/pipeline.py --split train --num_workers 4
Pipeline (4 steps):
- Programmatic Solver โ BFS on maze, Sokoban search, geometric path analysis โ verified answer + options
- QuestionWriter (VLM) โ writes natural-language question text only (no answer generation)
- SmallModelProbe โ filters trivially-easy items
- Reviewer (VLM) โ checks question text validity and distractor plausibility
Step 2c โ Evaluate a model
python vrqa_bench/scripts/evaluate.py --model gpt-5.4 --num_workers 8
Step 2d โ Shuffle options & pack training split
python vrqa_bench/scripts/shuffle_options.py # randomise A/B/C/D order
python vrqa_bench/scripts/pack_train.py # package into VRBench-Spatial-v2-train.tar.gz
Part 3 โ PF-OPSD Training
World Model: Helios
PF-OPSD uses Helios as the generative video world model.
Helios repository: https://github.com/helios-world-model/helios
Yuan et al., 2026 โ "Helios: A Video World Model for Physical Future Simulation"
| Variant | Used for |
|---|---|
helios-vrbench | VRQABench experiments |
helios-general | OpenWorldQA experiments |
Set up Helios and export:
export HELIOS_BASE_URL="http://your-helios-endpoint"
export HELIOS_API_KEY="your-helios-key"
For offline testing without Helios, use --world_model_type stub.
Step 3a โ Generate Stage-1 Privileged Trajectories
python -m trajectory_gen.pipeline \
--benchmark openworldqa \
--world_model_type helios \
--teacher_model gemini-3.1-pro \
--num_workers 4 \
--output_dir trajectory_gen/output/trajectories
The teacher VLM observes ground-truth future frames (v*) and correct answer (y*), then generates labelled training trajectories d_sim โ p_sim โ z_ver โ z_rel โ y.
| Argument | Default | Description |
|---|---|---|
--benchmark | all | openworldqa | vrqabench | all |
--teacher_model | gemini-3.1-pro | Teacher VLM |
--world_model_type | stub | stub | helios |
--max_samples | 0 (all) | Cap for quick tests |
--dry_run | off | Print plan, no API calls |
Step 3b โ Stage 1: Protocol SFT
bash train/scripts/run_sft.sh
Paper hyperparameters (train/configs/sft.yaml)
Step 3c โ Stage 2: PF-OPSD On-Policy Self-Distillation
Configure train/configs/pfopsd.yaml:
evaluator_model: "Qwen3.6-27B"
evaluator_base_url: "" # or set $OPENAI_API_BASE
world_model_type: "helios"
helios_base_url: "" # or set $HELIOS_BASE_URL
helios_model: "helios-vrbench"
Then run:
bash train/scripts/run_pfopsd.sh 4 # requires Stage 1 checkpoint
Paper hyperparameters (train/configs/pfopsd.yaml)
Environment Variables
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY | API key for all VLM calls | โ (required) |
OPENAI_API_BASE | OpenAI-compatible endpoint | https://api.openai.com/v1 |
HELIOS_BASE_URL | Helios world model endpoint | โ |
HELIOS_API_KEY | Helios auth token | โ |
VRB_DATA_ROOT | VRQABench data root | <repo>/vrqa_bench/raw_data |
RAW_DATA_ROOT | OpenWorldQA raw video root | openworld_qa/raw_data |
CONDA_ENV | Conda env for training scripts | (current env) |
Citation
@misc{zhou2026worldmodelsmeetlanguage,
title={World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning},
author={Yucheng Zhou and Wei Tao and Yiwen Guo and Jianbing Shen},
year={2026},
eprint={2606.03603},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.03603},
}
License
This project is licensed under the MIT License.