README.md

May 16, 2026 ยท View on GitHub


๐ŸŒ Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight

Yifei Dong1,*, Fengyi Wu1,*, Guangyu Chen1,*, Lingdong Kong2, Xu Zhu1, Qiyu Hu1, Yuxuan Zhou1, Jingdong Sun3, Jun-Yan He1, Qi Dai4, Alexander G. Hauptmann5, Zhi-Qi Cheng1,โ€ 
1UW, 2NUS, 3Apple, 4Microsoft Research, 5CMU

task

UniWM introduce a unified, memory-augmented world model paradigm integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between visualization and planning. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons.

You are also welcome to explore our previous work, including GOViG, which introduces a new task that we leverage multimodal LLM reasoning to generate navigation instructions directly from egocentric visual observations of the initial and goal states and HA-VLN, where we introduce HA-VLN 2.0, a unified benchmark coupling discrete (DE) and continuous (CE) navigation paradigms with explicit social-awareness constraints.

Quick Start

conda create -n uniwm python=3.10 -y
conda activate uniwm
bash install.sh

Implementation

Data

We host the UniWM dataset on Hugging Face: fly1113/UniWM_Dataset.

To download and extract all splits into data/ with a single command:

bash download_data.sh

After extraction, the directory structure will look like:

data/
โ”œโ”€โ”€ go_stanford/
โ”‚   โ”œโ”€โ”€ traj_0000/
โ”‚   โ”‚   โ”œโ”€โ”€ 0.jpg
โ”‚   โ”‚   โ”œโ”€โ”€ 1.jpg
โ”‚   โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ”‚   โ”œโ”€โ”€ n.jpg
โ”‚   โ”‚   โ””โ”€โ”€ traj_data.pkl
โ”‚   โ”œโ”€โ”€ traj_0001/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ ...

Each traj_xxxx/ folder contains a sequence of egocentric frames (0.jpg, 1.jpg, ..., n.jpg) and a traj_data.pkl file storing the per-step metadata (e.g., actions, poses) for that trajectory. The other splits follow the same layout.

Training

To train the model on multiple datasets, use the following torchrun command. This script supports multi-GPU distributed training (we provide an example in train.sh).

torchrun --nproc_per_node={GPU_NUM_PER_NODE} train.py \
    --model anole \
    --data go_stanford,scand,sacson,recon \
    --data_dir ./data \
    --decoder_type anole \
    --image_seq_length 784 \
    --input_format anole \
    --output /path/to/save/output \
    --note {experiment_note} \
    --report_to none \
    --do_train \
    --bfloat16

Evaluation

To evaluate a trained model, use the command below. The script supports several evaluation modes, which can be selected by using the appropriate flag (we provide an example in eval.sh).

torchrun --nproc_per_node=<GPU_NUM_PER_NODE> train.py \
    --model anole \
    --model_ckpt /path/to/your/checkpoint \
    --data go_stanford,scand,sacson,recon \
    --data_dir ./data \
    --decoder_type anole \
    --image_seq_length 784 \
    --input_format anole \
    --output /path/to/save/eval_results \
    --note {experiment_note} \
    --report_to none \
    \
    # Choose ONE of the following evaluation flags for different eval mode:
    --do_single_step_eval
    # --do_task_level_eval
    # --do_rollout_eval

    # Optional: --use_memory_bank_inference

Evaluation Flags (choose one):

--do_single_step_eval: Evaluates the model's performance on a single step of prediction.

--do_task_level_eval: Evaluates the model on the full end-to-end task across an entire trajectory. You can optionally enable the memory bank mechanism by adding the --use_memory_bank_inference flag to the command. If this flag is omitted, the evaluation runs with the memory bank disabled.

--do_rollout_eval: Generates a full trajectory autoregressively (i.e., the model uses its own previous predictions and ground truth actions as input for the next step) and evaluates the result.

Contributing

We welcome contributions to this project! Please contact yfeidong@uw.edu or fyiwu@uw.edu.

Acknowledgement

We would like to thank ANOLE and MVOT for their publicly available codebase, which we referenced during the implementation of Anole training.

๐ŸŒŸ Citation

If you find this repository or our paper useful, please consider starring this repository and citing our paper:

@misc{dong2026unifiedworldmodelsvisual,
      title={Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight}, 
      author={Yifei Dong and Fengyi Wu and Guangyu Chen and Lingdong Kong and Xu Zhu and Qiyu Hu and Yuxuan Zhou and Jingdong Sun and Jun-Yan He and Qi Dai and Alexander G. Hauptmann and Zhi-Qi Cheng},
      year={2026},
      eprint={2510.08713},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.08713}, 
}