README.md
February 21, 2026 ยท View on GitHub
WorldMem: Long-term Consistent World Simulation
with Memory
Zeqi Xiao1
Yushi Lan1
Yifan Zhou1
Wenqi Ouyang1
Shuai Yang2
Yanhong Zeng3
Xingang Pan1
1S-Lab, Nanyang Technological University,
2Wangxuan Institute of Computer Technology, Peking University,
3Shanghai AI Laboratory
https://github.com/user-attachments/assets/fb8a32e2-9470-4819-a93d-c38caf76d72c
Installation
conda create python=3.10 -n worldmem
conda activate worldmem
pip install -r requirements.txt
conda install -c conda-forge ffmpeg=4.3.2
Quick start
python app.py
Run
To enable cloud logging with Weights & Biases (wandb), follow these steps:
-
Sign up for a wandb account.
-
Run the following command to log in:
wandb login -
Open
configurations/training.yamland set theentityandprojectfield to your wandb username.
Training
Download pretrained weights from Oasis.
Training the model on 4 H100 GPUs, it converges after approximately 500K steps. We observe that gradually increasing task difficulty improves performance. Thus, we adopt a multi-stage training strategy: ,
sh train_stage_1.sh # Small range, no vertical turning
sh train_stage_2.sh # Large range, no vertical turning
sh train_stage_3.sh # Large range, with vertical turning
To resume training from a previous checkpoint, configure the resume and output_dir variables in the corresponding .sh script.
Inference
To run inference:
sh infer.sh
You can either load the diffusion model and VAE separately:
+diffusion_model_path=zeqixiao/worldmem_checkpoints/diffusion_only.ckpt \
+vae_path=zeqixiao/worldmem_checkpoints/vae_only.ckpt \
+customized_load=true \
+seperate_load=true \
Or load a combined checkpoint:
+load=your_model_path \
+customized_load=true \
+seperate_load=false \
Evaluation
To run evaluation:
sh evaluate.sh
This script reproduces the results in Table 1 (beyond context window). It will generate PSNR and Lpips. Evaluating 1 case on 1 A100 GPU takes approximately 6 minutes. You can adjust experiment.test.limit_batch to specify the number of cases to evaluate.
Visual results will be saved by default to a timestamped directory (e.g., outputs/2025-11-30/00-02-42).
To calculate the FID score, run:
python calculate_fid.py --videos_dir <path_to_videos>
For example:
python calculate_fid.py --videos_dir outputs/2025-11-30/00-02-42/videos/test_vis
Expected Results:
| Metric | Value |
|---|---|
| PSNR | 24.01 |
| LPIPS | 0.1667 |
| FID | 15.13 |
Note: FID is computed over 5000 frames.
Dataset
Download the Minecraft dataset from Hugging Face
Place the dataset in the following directory structure:
data/
โโโ minecraft/
โโโ training/
โโโ validation/
โโโ test/
Data Generation
After setting up the environment as described in MineDojo's GitHub repository, you can generate data using the following command:
xvfb-run -a python data_generator.py -o data/test -z 4 --env_type plains
Parameters:
-o: Output directory for generated data-z: Number of parallel workers--env_type: Environment type (e.g.,plains)
TODO
- Release inference models and weights;
- Release training pipeline on Minecraft;
- Release training data on Minecraft;
- Release evaluation scripts and data generator.
๐ Citation
If you find our work helpful, please cite:
@inproceedings{xiaoworldmem,
title={WorldMem: Long-term Consistent World Simulation with Memory},
author={Xiao, Zeqi and Yushi, LAN and Zhou, Yifan and Ouyang, Wenqi and Yang, Shuai and Zeng, Yanhong and Pan, Xingang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}
๐ Acknowledgements
- Diffusion Forcing: Diffusion Forcing provides flexible training and inference strategies for our methods.
- Minedojo: We collect our Minecraft dataset from Minedojo.
- Open-oasis: Our model architecture is based on Open-oasis. We also use pretrained VAE and DiT weight from it.