World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

June 3, 2026 ยท View on GitHub

arXiv GitHub VRQABench OpenWorldQA License: MIT

PF-OPSD โ€” Privileged-Future On-Policy Self-Distillation for controlled concrete reasoning in MLLMs.

Motivation: concrete vs. abstract reasoning


Overview

World models and multimodal large language models (MLLMs) offer complementary capabilities for predicting future outcomes from static visual observations. World models generate concrete visual rollouts; MLLMs reason abstractly over goals, rules, and question semantics. However, rolling out a world model blindly does not help โ€” generated rollouts are stochastic, may be visually plausible yet task-incorrect, and the model must learn when to invoke simulation, whether to trust a rollout, and how to integrate it with abstract reasoning.

PF-OPSD full pipeline

We call this problem controlled concrete reasoning and make three concrete contributions:

  1. VRQABench โ€” a benchmark for controllable spatial lookahead from maze and Sokoban puzzle images (4,636 questions). ๐Ÿค— Dataset
  2. OpenWorldQA โ€” a benchmark for open-domain physical future prediction from pre-event anchor frames in real-world videos (4,404 questions). ๐Ÿค— Dataset
  3. PF-OPSD โ€” a training framework that uses ground-truth future videos as teacher-side privileged context, scores on-policy concrete-reasoning trajectories, and distills advantage-weighted targets back into a deployable student that has no access to true futures at test time.

Installation

git clone https://github.com/yczhou001/PF-OPSD
cd PF-OPSD
pip install -r requirements.txt

Flash Attention 2 (recommended for training speed):

pip install flash-attn --no-build-isolation

Set required environment variables:

export OPENAI_API_KEY="your_api_key"
export OPENAI_API_BASE="https://api.openai.com/v1"  # or your endpoint

Part 1 โ€” OpenWorldQA Dataset Construction

OpenWorldQA is a benchmark for predicting real-world physical futures from pre-event anchor frames. It is built with a five-stage agentic pipeline applied to videos from Charades, SomethingV2, Oops, and CharadesEgo.

๐Ÿ’ก The prebuilt benchmark is available on Hugging Face: YCZhou/openworld_qa. The steps below are only needed to reconstruct it from scratch.

Step 1a โ€” Download raw videos

Download the following datasets and place them under openworld_qa/raw_data/ (or set RAW_DATA_ROOT):

DatasetSuggested path
Something-Something V2raw_data/sthv2/extracted/20bn-something-something-v2/
Charadesraw_data/charades/Charades_v1_480/Charades_v1_480/
Oopsraw_data/oops/oops_dataset/oops_video/train/
CharadesEgoraw_data/charades_ego/CharadesEgo_v1_480/

Step 1b โ€” Sample and extract frames

# Sample 5000 videos and extract frames (runs ffmpeg in parallel)
python openworld_qa/sample_and_extract.py \
    --total 5000 \
    --output_dir openworld_qa/output/frames \
    --num_workers 16

# Or extract frames from a specific video directory
python openworld_qa/extract_frames.py \
    --video_dir raw_data/charades/Charades_v1_480/Charades_v1_480/ \
    --output_dir openworld_qa/output/frames

Step 1c โ€” Run the 5-agent pipeline

export OPENAI_API_KEY="your_key"

python openworld_qa/pipeline.py \
    --frames_dir openworld_qa/output/frames/ \
    --output_dir openworld_qa/output/reviewed/ \
    --num_workers 3 \
    --save_rejections

The pipeline is resumable โ€” re-running skips already-processed videos.

ArgumentDefaultDescription
--frames_diroutput/frames/Input: frame directories
--output_diroutput/reviewed/Output: accepted QA JSON files
--num_workers3Parallel workers
--max_videos0 (all)Cap for testing
--save_rejectionsoffSave rejected samples
--save_generatedoffSave intermediate outputs

Pipeline architecture (5 agents):

All frames
    โ”‚
    โ–ผ  Agent 1: SceneAnalyst  (all frames, multimodal)
       โ†’ structured scene report + anchor frame selection
    โ”‚
    โ–ผ  Agent 2: QuestionDesigner  (text only)
       โ†’ 6 question skeletons across 12 physical categories
    โ”‚
    โ–ผ  Agent 3: DistractorForge  (text only)
       โ†’ 6 complete QA drafts with physically plausible distractors
    โ”‚
    โ–ผ  Agent 4: SmallModelProbe  (anchor frame, small model, ร—2 shuffled)
       โ†’ "too_easy" โ†’ discard | "hard_enough" โ†’ keep
    โ”‚
    โ–ผ  Agent 5: Reviewer  (anchor + post-anchor context frames)
       โ†’ 5-dimension review; score โ‰ฅ 7 โ†’ accept

Step 1d โ€” Evaluate a model

python openworld_qa/evaluate.py \
    --split test \
    --model gpt-5.4 \
    --num_workers 8

Part 2 โ€” VRQABench Dataset Construction

VRQABench tests controllable spatial lookahead from maze, irregular-maze, and Sokoban puzzle images. Labels are programmatically verified (BFS / geometric solver), and a VLM only writes question text.

๐Ÿ’ก The prebuilt benchmark is available on Hugging Face: YCZhou/vrqa_bench. The steps below are only needed to reconstruct it from scratch.

Step 2a โ€” Get VR-Bench data

Download VR-Bench and place under vrqa_bench/raw_data/.

Step 2b โ€” Run the pipeline

export OPENAI_API_KEY="your_key"

# Evaluation split (hard difficulty)
python vrqa_bench/pipeline.py --split eval --num_workers 4

# Training split (hard + medium + easy, with quotas)
python vrqa_bench/pipeline.py --split train --num_workers 4

Pipeline (4 steps):

  1. Programmatic Solver โ€” BFS on maze, Sokoban search, geometric path analysis โ†’ verified answer + options
  2. QuestionWriter (VLM) โ€” writes natural-language question text only (no answer generation)
  3. SmallModelProbe โ€” filters trivially-easy items
  4. Reviewer (VLM) โ€” checks question text validity and distractor plausibility

Step 2c โ€” Evaluate a model

python vrqa_bench/scripts/evaluate.py --model gpt-5.4 --num_workers 8

Step 2d โ€” Shuffle options & pack training split

python vrqa_bench/scripts/shuffle_options.py    # randomise A/B/C/D order
python vrqa_bench/scripts/pack_train.py         # package into VRBench-Spatial-v2-train.tar.gz

Part 3 โ€” PF-OPSD Training

World Model: Helios

PF-OPSD uses Helios as the generative video world model.

Helios repository: https://github.com/helios-world-model/helios

Yuan et al., 2026 โ€” "Helios: A Video World Model for Physical Future Simulation"

VariantUsed for
helios-vrbenchVRQABench experiments
helios-generalOpenWorldQA experiments

Set up Helios and export:

export HELIOS_BASE_URL="http://your-helios-endpoint"
export HELIOS_API_KEY="your-helios-key"

For offline testing without Helios, use --world_model_type stub.

Step 3a โ€” Generate Stage-1 Privileged Trajectories

python -m trajectory_gen.pipeline \
    --benchmark openworldqa \
    --world_model_type helios \
    --teacher_model gemini-3.1-pro \
    --num_workers 4 \
    --output_dir trajectory_gen/output/trajectories

The teacher VLM observes ground-truth future frames (v*) and correct answer (y*), then generates labelled training trajectories d_sim โ†’ p_sim โ†’ z_ver โ†’ z_rel โ†’ y.

ArgumentDefaultDescription
--benchmarkallopenworldqa | vrqabench | all
--teacher_modelgemini-3.1-proTeacher VLM
--world_model_typestubstub | helios
--max_samples0 (all)Cap for quick tests
--dry_runoffPrint plan, no API calls

Step 3b โ€” Stage 1: Protocol SFT

bash train/scripts/run_sft.sh

Paper hyperparameters (train/configs/sft.yaml)

Step 3c โ€” Stage 2: PF-OPSD On-Policy Self-Distillation

Configure train/configs/pfopsd.yaml:

evaluator_model:   "Qwen3.6-27B"
evaluator_base_url: ""          # or set $OPENAI_API_BASE
world_model_type:  "helios"
helios_base_url:   ""           # or set $HELIOS_BASE_URL
helios_model:      "helios-vrbench"

Then run:

bash train/scripts/run_pfopsd.sh 4   # requires Stage 1 checkpoint

Paper hyperparameters (train/configs/pfopsd.yaml)


Environment Variables

VariableDescriptionDefault
OPENAI_API_KEYAPI key for all VLM callsโ€” (required)
OPENAI_API_BASEOpenAI-compatible endpointhttps://api.openai.com/v1
HELIOS_BASE_URLHelios world model endpointโ€”
HELIOS_API_KEYHelios auth tokenโ€”
VRB_DATA_ROOTVRQABench data root<repo>/vrqa_bench/raw_data
RAW_DATA_ROOTOpenWorldQA raw video rootopenworld_qa/raw_data
CONDA_ENVConda env for training scripts(current env)

Citation

@misc{zhou2026worldmodelsmeetlanguage,
      title={World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning}, 
      author={Yucheng Zhou and Wei Tao and Yiwen Guo and Jianbing Shen},
      year={2026},
      eprint={2606.03603},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.03603}, 
}

License

This project is licensed under the MIT License.