ACWM-Phys

May 19, 2026 · View on GitHub

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Project Page · arXiv · Dataset · Checkpoints

Haotian Xue†, Yipu Chen*, Liqian Ma*, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen

Georgia Institute of Technology (†project lead, *equal contribution)

Teaser

Overview

ACWM-Phys is a benchmark for evaluating action-conditioned video world models under diverse physical dynamics. It spans 8 environments across 4 physics regimes:

Category	Environments
Rigid-Body	Push Cube, Stack Cube
Deformable	Push Rope, Cloth Move
Particle	Push Sand, Pour Water
Kinematics	Robot Arm, Reacher

Each environment provides 1,000 training trajectories + controlled in-distribution (InD) and out-of-distribution (OoD) test splits. We also provide ACWM-DiT, a latent diffusion transformer baseline trained with flow matching.

Installation

We use uv for fast, reproducible environment management.

git clone https://github.com/xavihart/ACWM-Phys.git
cd ACWM-Phys

# Create and activate a virtual environment
uv venv --python 3.10
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Flash Attention (recommended for speed)
uv pip install flash-attn --no-build-isolation

Dataset

Download the ACWM-Phys dataset from HuggingFace:

huggingface-cli download t1an/ACWM-Phys --repo-type dataset --local-dir ./data

Then set the data root:

export ACWM_DATA_ROOT=./data

Expected structure:

data/
├── rigid_dynamics/
│   ├── push_block/      {ind_train, ind_test, ood_test}/
│   └── stack_cube/
├── deformable/
│   ├── push_rope/
│   └── clothmove/
├── particle/
│   ├── push_sand/
│   └── pour_water/
└── kinematics/
    ├── robot_arm_64/
    └── reacher/

Dataset Format

Each split directory (e.g. push_block/ind_train/) contains:

`episode_{i}.mp4$ — \text{RGB} \text{video} \text{at} 10 \text{fps}, 240 \times 240 (240 \times 400 \text{for} \text{Push} \text{Sand})
**$metadata.pt** — serialized list of episode dicts (load with torch.load`)

Each entry in metadata.pt has:

Field	Type	Description
`video_path`	`str`	Filename relative to the split dir, e.g. `episode_0.mp4`
`actions`	`FloatTensor [T, action_dim]`	Per-step action sequence
`length`	`int`	Number of frames T
`seed`	`int`	Random seed used during simulation
`episode_idx`	`int`	Global episode index (some environments)

Example:

import torch

metadata = torch.load("data/rigid_dynamics/push_block/ind_train/metadata.pt", weights_only=False)
entry = metadata[0]
# entry["video_path"]  → "episode_0.mp4"
# entry["actions"]     → Tensor of shape [T, 2]
# entry["length"]      → 16

Checkpoints

Download the pretrained DiT-S checkpoints (100k steps) and the Wan 2.1 VAE:

huggingface-cli download t1an/ACWM-Phys-checkpoints --local-dir ./checkpoints

Set the VAE path:

export WAN_VAE_PATH=./checkpoints/Wan2.1_VAE.pth

The env configs in configs/envs/ also reference WAN_VAE_PATH via the vae_config field.

Released Checkpoints

All checkpoints are DiT-S (~200M parameters), trained for 100k steps with flow matching.

Environment	Category	Action Dim	Resolution	Checkpoint
Push Cube	Rigid-Body	2	240×240	link
Stack Cube	Rigid-Body	7	240×240	link
Push Rope	Deformable	2	240×240	link
Cloth Move	Deformable	3	240×240	link
Push Sand	Particle	7	240×400	link
Pour Water	Particle	4	240×240	link
Robot Arm	Kinematics	7	240×240	link
Reacher	Kinematics	2	240×240	link

Evaluation

Evaluate a single environment:

python eval.py --env push_cube --steps 50 --split both --save_videos

Evaluate all 8 environments:

bash scripts/eval_all.sh --save_videos

Results are written to results/results.md. Videos are saved under results/{env}/steps_50/{split}/sample_{i}/video.mp4 as side-by-side GT (left) | Prediction (right).

Key arguments:

Argument	Default	Description
`--env`	required	Environment name
`--steps`	50	Denoising steps
`--split`	both	`ind_test`, `ood_test`, or `both`
`--ckpt`	auto	Override checkpoint path
`--cfg`	auto	Override config path
`--save_videos`	off	Save GT\|Pred side-by-side videos

Training

Train DiT-S on Push Cube (single GPU):

python train.py --config configs/envs/push_cube.yaml

Multi-GPU (4 GPUs):

torchrun --nproc_per_node=4 train.py --config configs/envs/push_cube.yaml

SLURM example:

sbatch scripts/train_slurm.sh push_cube

Training hyperparameters are in configs/envs/{env}.yaml. Model size (S/M/L) is set via model_type: dit_s in the config.

Model Architecture

ACWM-DiT takes the first video frame + full action sequence and predicts the complete future trajectory:

Causal VAE (Wan 2.1) — encodes video into 16-ch latent tokens at H/8×W/8, 4× temporal compression
DiT with flow matching — denoises the full latent trajectory; supports AdaLN and cross-attention action conditioning
Action conditioning — injected via AdaLN (default) or cross-attention (better for high-dim actions)

Three model sizes: DiT-S (~200M), DiT-M (~600M), DiT-L (~800M).

Metrics

Metric	Description
MSE	Mean squared error on pixel values in [0,1]
M-MSE	Motion-weighted MSE (floor 0.01; focuses on moving regions)
PSNR	Peak signal-to-noise ratio (dB)
SSIM	Structural similarity index

Citation

@article{xue2026acwm,
  title={ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models},
  author={Xue, Haotian and Chen, Yipu and Ma, Liqian and Zhao, Zelin and Moukheiber, Lama and Zhu, Yuchen and Che, Yongxin},
  journal={arXiv preprint arXiv:2605.08567},
  year={2026}
}