World Model Adaptation
May 28, 2025 ยท View on GitHub
DMLab and Minecraft datasets
In our paper, we borrow the data source from TECO for world model adaptation experiments. If you only need a few samples, you can also try the mini sets from Diffusion Forcing, as downloading the TECO datasets may take a couple of days.
Be aware of the different action indices of these two when processing the data. The actions should be extracted as follows: npz_data["action"][:-1] for DMLab and npz_data["action"][1:] for Minecraft.
For environments with discrete action spaces
- Split the collected samples into short video clips consisting of 7 (n_context_frames + 1) frames, and organize them into different folders according to action indices that correspond to the last frame transition. Take the Minecraft with 3 action options as an example:
data/ |--minecraft/ |--action_0/ | |--00000.mp4 | |... |--action_1/ | |--00000.mp4 | |... |--action_2/ |--00000.mp4 |... - Go through each action folder to infer the latent actions using the pretrained latent action encoder. This can be done by running
lam/test.shand settingbatch_sizeinlam/config/lam.yamlto 1. - Uncomment the
on_test_epoch_endfunction inlam/lam/model.pyto save the inferred latent actions aslatent_action_stats.pt. - In MultiSourceSamplerDataset, replace VideoDataset with VideoDatasetDiscreteActionSpace. Please check the parameter inputs and rename all paths if necessary.
- (Optional) Reset the learning rate of the pretrained weights by uncommenting the provided code under
configure_optimizersinworldmodel/vwm/models/diffusion.py. - Use the averaged latent actions as the action embeddings for the discrete action codebook of ActionBook in
worldmodel/vwm/modules/encoders/modules.py. An example is provided in__init__. - Run
worldmodel/run_adaptation_discrete.sh.
For environments with continuous action spaces
- Split the collected samples into short video clips consisting of 7 (n_context_frames + 1) frames, and save the action values that correspond to the last frame transition using the same file name. Take the nuScenes with a two-dimensional action displacement as an example:
The TXT files store a list that contains the displacement [x,y] of each transition.data/ |--nuscenes/ |--00000.mp4 |--00000.txt |--00001.mp4 |--00001.txt |... - Go through all video clips to infer their latent actions using the pretrained latent action encoder. This can be done by running
lam/test.shand settingbatch_sizeinlam/config/lam.yamlto 1. - Uncomment the
on_test_epoch_endfunction inlam/lam/model.pyto save the inferred latent actions aslatent_action_stats.pt. - In MultiSourceSamplerDataset, replace VideoDataset with VideoDatasetContinuousActionSpace. Please check the parameter inputs and rename all paths if necessary.
- (Optional) Reset the learning rate of the pretrained weights by uncommenting the provided code under
configure_optimizersinworldmodel/vwm/models/diffusion.py. - Convert the ground truth of all actions to
raw_action_inputs.pt, ensuring it corresponds to the order oflatent_action_stats.pt. - Use
worldmodel/fast_init_mlp.pyto optimize the initialization weightsmlp_init_weights.pthfor ActionMLP inworldmodel/vwm/modules/encoders/modules.py. An example is provided in__init__. - Run
worldmodel/run_adaptation_continuous.sh.
To visualize the UMAP projection of latent actions in our paper, please refer to UMAP and set n_neighbors and min_dist to 15 and 0.5, respectively.
<= Previous: [Action Transfer]
=> Next: [Visual Planning]