ACWM-Phys
May 19, 2026 · View on GitHub
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
Project Page · arXiv · Dataset · Checkpoints
Haotian Xue†, Yipu Chen*, Liqian Ma*, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen
Georgia Institute of Technology (†project lead, *equal contribution)

Overview
ACWM-Phys is a benchmark for evaluating action-conditioned video world models under diverse physical dynamics. It spans 8 environments across 4 physics regimes:
| Category | Environments |
|---|---|
| Rigid-Body | Push Cube, Stack Cube |
| Deformable | Push Rope, Cloth Move |
| Particle | Push Sand, Pour Water |
| Kinematics | Robot Arm, Reacher |
Each environment provides 1,000 training trajectories + controlled in-distribution (InD) and out-of-distribution (OoD) test splits. We also provide ACWM-DiT, a latent diffusion transformer baseline trained with flow matching.
Installation
We use uv for fast, reproducible environment management.
git clone https://github.com/xavihart/ACWM-Phys.git
cd ACWM-Phys
# Create and activate a virtual environment
uv venv --python 3.10
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
# Flash Attention (recommended for speed)
uv pip install flash-attn --no-build-isolation
Dataset
Download the ACWM-Phys dataset from HuggingFace:
huggingface-cli download t1an/ACWM-Phys --repo-type dataset --local-dir ./data
Then set the data root:
export ACWM_DATA_ROOT=./data
Expected structure:
data/
├── rigid_dynamics/
│ ├── push_block/ {ind_train, ind_test, ood_test}/
│ └── stack_cube/
├── deformable/
│ ├── push_rope/
│ └── clothmove/
├── particle/
│ ├── push_sand/
│ └── pour_water/
└── kinematics/
├── robot_arm_64/
└── reacher/
Dataset Format
Each split directory (e.g. push_block/ind_train/) contains:
- `episode_{i}.mp4$ — \text{RGB} \text{video} \text{at} 10 \text{fps}, 240 \times 240 (240 \times 400 \text{for} \text{Push} \text{Sand})
- **$metadata.pt
** — serialized list of episode dicts (load withtorch.load`)
Each entry in metadata.pt has:
| Field | Type | Description |
|---|---|---|
video_path | str | Filename relative to the split dir, e.g. episode_0.mp4 |
actions | FloatTensor [T, action_dim] | Per-step action sequence |
length | int | Number of frames T |
seed | int | Random seed used during simulation |
episode_idx | int | Global episode index (some environments) |
Example:
import torch
metadata = torch.load("data/rigid_dynamics/push_block/ind_train/metadata.pt", weights_only=False)
entry = metadata[0]
# entry["video_path"] → "episode_0.mp4"
# entry["actions"] → Tensor of shape [T, 2]
# entry["length"] → 16
Checkpoints
Download the pretrained DiT-S checkpoints (100k steps) and the Wan 2.1 VAE:
huggingface-cli download t1an/ACWM-Phys-checkpoints --local-dir ./checkpoints
Set the VAE path:
export WAN_VAE_PATH=./checkpoints/Wan2.1_VAE.pth
The env configs in configs/envs/ also reference WAN_VAE_PATH via the vae_config field.
Released Checkpoints
All checkpoints are DiT-S (~200M parameters), trained for 100k steps with flow matching.
| Environment | Category | Action Dim | Resolution | Checkpoint |
|---|---|---|---|---|
| Push Cube | Rigid-Body | 2 | 240×240 | link |
| Stack Cube | Rigid-Body | 7 | 240×240 | link |
| Push Rope | Deformable | 2 | 240×240 | link |
| Cloth Move | Deformable | 3 | 240×240 | link |
| Push Sand | Particle | 7 | 240×400 | link |
| Pour Water | Particle | 4 | 240×240 | link |
| Robot Arm | Kinematics | 7 | 240×240 | link |
| Reacher | Kinematics | 2 | 240×240 | link |
Evaluation
Evaluate a single environment:
python eval.py --env push_cube --steps 50 --split both --save_videos
Evaluate all 8 environments:
bash scripts/eval_all.sh --save_videos
Results are written to results/results.md. Videos are saved under results/{env}/steps_50/{split}/sample_{i}/video.mp4 as side-by-side GT (left) | Prediction (right).
Key arguments:
| Argument | Default | Description |
|---|---|---|
--env | required | Environment name |
--steps | 50 | Denoising steps |
--split | both | ind_test, ood_test, or both |
--ckpt | auto | Override checkpoint path |
--cfg | auto | Override config path |
--save_videos | off | Save GT|Pred side-by-side videos |
Training
Train DiT-S on Push Cube (single GPU):
python train.py --config configs/envs/push_cube.yaml
Multi-GPU (4 GPUs):
torchrun --nproc_per_node=4 train.py --config configs/envs/push_cube.yaml
SLURM example:
sbatch scripts/train_slurm.sh push_cube
Training hyperparameters are in configs/envs/{env}.yaml. Model size (S/M/L) is set via model_type: dit_s in the config.
Model Architecture
ACWM-DiT takes the first video frame + full action sequence and predicts the complete future trajectory:
- Causal VAE (Wan 2.1) — encodes video into 16-ch latent tokens at H/8×W/8, 4× temporal compression
- DiT with flow matching — denoises the full latent trajectory; supports AdaLN and cross-attention action conditioning
- Action conditioning — injected via AdaLN (default) or cross-attention (better for high-dim actions)
Three model sizes: DiT-S (~200M), DiT-M (~600M), DiT-L (~800M).
Metrics
| Metric | Description |
|---|---|
| MSE | Mean squared error on pixel values in [0,1] |
| M-MSE | Motion-weighted MSE (floor 0.01; focuses on moving regions) |
| PSNR | Peak signal-to-noise ratio (dB) |
| SSIM | Structural similarity index |
Citation
@article{xue2026acwm,
title={ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models},
author={Xue, Haotian and Chen, Yipu and Ma, Liqian and Zhao, Zelin and Moukheiber, Lama and Zhu, Yuchen and Che, Yongxin},
journal={arXiv preprint arXiv:2605.08567},
year={2026}
}