World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

July 22, 2025 ยท View on GitHub

๐Ÿ“– arXiv | ๐Ÿค— Paper | ๐Ÿค— Dataset | GitHub | ๐Ÿ“ฃ Twitter/X

This repository contains the code and data for our paper: World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning.

Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (DยฒPO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D^2PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.


๐ŸŽ‰ News

[2025-05-16] Our paper is accepted by ACL 2025 (main)!

[2025-03-26] Our paper is accepted by ICLR 2025 Workshop on World Models!

๐Ÿค— D2PO Dataset

The D2PO dataset contains various data splits for alignment training, including supervised fine-tuning and direct preference optimization.

Split NameDescriptionSize
๐Ÿค— SFT_PolicySFT data for action selection4.5k
๐Ÿค— DPO_PolicyDPO data for action selection15k
๐Ÿค— DPO_WorldDPO data for state prediction8.7k

๐Ÿš€ Install

  1. Clone the whole repo.

    $ git clone {repo_url}
    
  2. Setup a virtual environment.

    $ conda create -n vota python=3.8
    $ conda activate vota
    
  3. Install PyTorch (2.0.0) first (see https://pytorch.org/get-started/locally/).

    # exemplary install command for PyTorch 2.0.0 with CUDA 11.7
    $ pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 --index-url https://download.pytorch.org/whl/cu117
    
  4. Install python packages in requirements.txt.

    $ pip install -r requirements.txt
    

๐Ÿ“Š Benchmarking on VoTA-Bench

๐Ÿ“ฆ Download ALFRED dataset.

$ cd alfred/data
$ sh download_data.sh json

๐Ÿ–ฅ๏ธ Running on Headless Server

If running the ALFRED experiments on a headless server, start the X display. Below script uses 1 for the X_DISPLAY id, but you can use different ids such as 0.

$ sudo python3 alfred/scripts/startx.py 1

Alternatively, you can use Xvfb:

$ Xvfb :1

๐Ÿค– Model Server

Both vllm and sglang are supported as model servers.

Example: Start a vllm server for Qwen2-VL-7B-Instruct

python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model Qwen/Qwen2-VL-7B-Instruct --port 30000

๐Ÿ“ Running Evaluation

$ python src/evaluate2.py --config-name=config_alfred

We use Hydra for configuration management. You can override settings in ./conf/config_alfred.yaml or via the command line.

Notes:

  • model_name and base_url must match your chosen model server.
  • api_key is required for OpenAI models like GPT-4o.
  • icl: (True/False) enable or disable example usage.
  • sft: (True/False) set to True for SFT-style prompts.
  • eval_set: choose 'valid_seen' or 'valid_unseen'.
  • eval_start_index & eval_end_index: control the evaluation data range.

๐ŸŒฒ Data Exploration

  1. First, set the api_key and base_url in ./src/task_planner.py (lines 17โ€“19). You can specify different models for different modules as needed.

  2. Run the scripts/run_{task_type}.sh script to generate data in parallel using multiple GPUs. This script launches multiple processes to execute src/evaluate3.py, which collects data through a tree search mechanism. You can control task parallelism and index assignment within the shell script using the following parameters:

  • BASE_START_INDEX=: starting index
  • NODE_INCREMENT=50: increment per node
  • INCREMENT=10: number of tasks per process
  • NUM_TASKS=5: number of parallel processes to launch
  1. Process the generated data as required.

๐Ÿ“ TODO

  • Open source evaluation data and scripts (See section: ๐Ÿ“Š Benchmarking on VoTA-Bench)
  • Release data collection scripts and training data

๐Ÿ‘‹ Citation

BibTeX:

@article{wang2025world,
  title={World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning},
  author={Siyin Wang and Zhaoye Fei and Qinyuan Cheng and Shiduo Zhang and Panpan Cai and Jinlan Fu and Xipeng Qiu},
  journal={arXiv preprint arXiv:2503.10480},
  year={2025}
}