ACWM-Phys

May 19, 2026 · View on GitHub

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Project Page · arXiv · Dataset · Checkpoints

Haotian Xue†, Yipu Chen*, Liqian Ma*, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen

Georgia Institute of Technology (†project lead, *equal contribution)


Teaser

Overview

ACWM-Phys is a benchmark for evaluating action-conditioned video world models under diverse physical dynamics. It spans 8 environments across 4 physics regimes:

CategoryEnvironments
Rigid-BodyPush Cube, Stack Cube
DeformablePush Rope, Cloth Move
ParticlePush Sand, Pour Water
KinematicsRobot Arm, Reacher

Each environment provides 1,000 training trajectories + controlled in-distribution (InD) and out-of-distribution (OoD) test splits. We also provide ACWM-DiT, a latent diffusion transformer baseline trained with flow matching.


Installation

We use uv for fast, reproducible environment management.

git clone https://github.com/xavihart/ACWM-Phys.git
cd ACWM-Phys

# Create and activate a virtual environment
uv venv --python 3.10
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Flash Attention (recommended for speed)
uv pip install flash-attn --no-build-isolation

Dataset

Download the ACWM-Phys dataset from HuggingFace:

huggingface-cli download t1an/ACWM-Phys --repo-type dataset --local-dir ./data

Then set the data root:

export ACWM_DATA_ROOT=./data

Expected structure:

data/
├── rigid_dynamics/
│   ├── push_block/      {ind_train, ind_test, ood_test}/
│   └── stack_cube/
├── deformable/
│   ├── push_rope/
│   └── clothmove/
├── particle/
│   ├── push_sand/
│   └── pour_water/
└── kinematics/
    ├── robot_arm_64/
    └── reacher/

Dataset Format

Each split directory (e.g. push_block/ind_train/) contains:

  • `episode_{i}.mp4$ — \text{RGB} \text{video} \text{at} 10 \text{fps}, 240 \times 240 (240 \times 400 \text{for} \text{Push} \text{Sand})
  • **$metadata.pt** — serialized list of episode dicts (load with torch.load`)

Each entry in metadata.pt has:

FieldTypeDescription
video_pathstrFilename relative to the split dir, e.g. episode_0.mp4
actionsFloatTensor [T, action_dim]Per-step action sequence
lengthintNumber of frames T
seedintRandom seed used during simulation
episode_idxintGlobal episode index (some environments)

Example:

import torch

metadata = torch.load("data/rigid_dynamics/push_block/ind_train/metadata.pt", weights_only=False)
entry = metadata[0]
# entry["video_path"]  → "episode_0.mp4"
# entry["actions"]     → Tensor of shape [T, 2]
# entry["length"]      → 16

Checkpoints

Download the pretrained DiT-S checkpoints (100k steps) and the Wan 2.1 VAE:

huggingface-cli download t1an/ACWM-Phys-checkpoints --local-dir ./checkpoints

Set the VAE path:

export WAN_VAE_PATH=./checkpoints/Wan2.1_VAE.pth

The env configs in configs/envs/ also reference WAN_VAE_PATH via the vae_config field.

Released Checkpoints

All checkpoints are DiT-S (~200M parameters), trained for 100k steps with flow matching.

EnvironmentCategoryAction DimResolutionCheckpoint
Push CubeRigid-Body2240×240link
Stack CubeRigid-Body7240×240link
Push RopeDeformable2240×240link
Cloth MoveDeformable3240×240link
Push SandParticle7240×400link
Pour WaterParticle4240×240link
Robot ArmKinematics7240×240link
ReacherKinematics2240×240link

Evaluation

Evaluate a single environment:

python eval.py --env push_cube --steps 50 --split both --save_videos

Evaluate all 8 environments:

bash scripts/eval_all.sh --save_videos

Results are written to results/results.md. Videos are saved under results/{env}/steps_50/{split}/sample_{i}/video.mp4 as side-by-side GT (left) | Prediction (right).

Key arguments:

ArgumentDefaultDescription
--envrequiredEnvironment name
--steps50Denoising steps
--splitbothind_test, ood_test, or both
--ckptautoOverride checkpoint path
--cfgautoOverride config path
--save_videosoffSave GT|Pred side-by-side videos

Training

Train DiT-S on Push Cube (single GPU):

python train.py --config configs/envs/push_cube.yaml

Multi-GPU (4 GPUs):

torchrun --nproc_per_node=4 train.py --config configs/envs/push_cube.yaml

SLURM example:

sbatch scripts/train_slurm.sh push_cube

Training hyperparameters are in configs/envs/{env}.yaml. Model size (S/M/L) is set via model_type: dit_s in the config.


Model Architecture

ACWM-DiT takes the first video frame + full action sequence and predicts the complete future trajectory:

  1. Causal VAE (Wan 2.1) — encodes video into 16-ch latent tokens at H/8×W/8, 4× temporal compression
  2. DiT with flow matching — denoises the full latent trajectory; supports AdaLN and cross-attention action conditioning
  3. Action conditioning — injected via AdaLN (default) or cross-attention (better for high-dim actions)

Three model sizes: DiT-S (~200M), DiT-M (~600M), DiT-L (~800M).


Metrics

MetricDescription
MSEMean squared error on pixel values in [0,1]
M-MSEMotion-weighted MSE (floor 0.01; focuses on moving regions)
PSNRPeak signal-to-noise ratio (dB)
SSIMStructural similarity index

Citation

@article{xue2026acwm,
  title={ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models},
  author={Xue, Haotian and Chen, Yipu and Ma, Liqian and Zhao, Zelin and Moukheiber, Lama and Zhu, Yuchen and Che, Yongxin},
  journal={arXiv preprint arXiv:2605.08567},
  year={2026}
}