README.md
April 24, 2026 ยท View on GitHub
ABot-PhysWorld is a physically consistent, action-controllable video world model for robotic manipulation, built on a 14-billion-parameter Diffusion Transformer. It integrates physics-aware training, memory-efficient preference optimization, and precise spatial action injection to generate realistic and physically plausible robot-object interactions โ even in zero-shot settings.
๐๏ธ News
- [2026-04] ๐ 1st Place on WorldArena Leaderboard! ABot-PhysWorld achieves the top rank on the WorldArena benchmark.
- [2026-04] ๐ฅ 2nd Place on GigaBrain Challenge CVPR 2026 โ World Model Track! ABot-PhysWorld secures the runner-up position in the CVPR 2026 GigaBrain Challenge World Model Track.
- [2026-04] ๐ฎ A2V code released! Action-to-Video training and inference via VACE parallel context blocks. See
training/README_A2V.mdandinference/README_A2V.md. - [2026-04] ๐งช DPO training released! Direct Preference Optimization pipeline for physics-aware alignment with LoRA. See
training/README_DPO.md. - [2026-03] ๐ Training code released! Full-parameter SFT training scripts for fine-tuning on custom robot manipulation datasets. See
training/. - [2026-03] ๐ฆ SFT training data released! The v1 SFT training dataset is available on ModelScope.
- [2026-03] ๐ฌ Benchmark released! EZS-Bench evaluation toolkit and data are open-sourced. See
EZS-Bench/. - [2026-03] ๐ Inference code released! Generate robot manipulation videos with the pre-trained model. See
inference/.
๐ Competition Results
WorldArena Leaderboard โ ๐ฅ 1st Place
GigaBrain Challenge CVPR 2026 โ World Model Track โ ๐ฅ 2nd Place
Table of Contents
- ๐ Key Contributions
- ๐ EZS-Bench
- ๐ Evaluation
- ๐ผ๏ธ Qualitative Results
- ๐ ๏ธ Usage
- ๐๏ธ Training
- ๐ฎ A2V (Action-to-Video)
- ๐งช DPO Training
- ๐ Citing
๐ Key Contributions
-
Industrial-Grade Data Pipeline
Curated ~3M real-world manipulation clips from five datasets (AgiBot,RoboCoin,RoboMind,Galaxea,OXE) with motion, semantic, and action consistency filtering, plus hierarchical sampling for balanced generalization.
-
Physics-Aware DPO Training
Introduces a decoupled VLM-based discriminator: Qwen3-VL generates task-specific physics checklists, Gemini 3 Pro scores videos via Chain-of-Thought; combined with LoRA-augmented DPO on a 14B DiT to enforce physical plausibility.
-
Parallel Context Blocks for Action Control
Enables precise action-conditioned generation by residually injecting spatial action maps into cloned DiT blocks, preserving physical priors while supporting cross-embodiment control.
-
EZSbench โ First True Zero-Shot Benchmark
Fully training-independent evaluation covering unseen robot, scene, and task combinations, with dual-model scoring to eliminate self-evaluation bias.
๐ EZS-Bench
Embodied-ZeroShot Benchmark for Physically Consistent Video Generation ๐คโจ
EZS-Bench is a zero-shot evaluation benchmark designed to rigorously assess physically plausible video generation in robotic manipulation. It evaluates models on physical consistency, action controllability, and cross-embodiment generalizationโwith no training-test data overlap. ๐๐ฌ
โจ Key Features
โ
True Zero-Shot Evaluation
Unseen combinations of:
- ๐ค Robot morphologies (e.g., single-arm, bimanual, custom kinematics)
- ๐ Scenes & backgrounds
- ๐ฏ Manipulation tasks (pick-and-place, wiping, assembly, etc.)
๐จ Dual-Source Data Construction
- ๐งฌ Synthetic branch: Text-to-image generation with controlled variation
- ๐ผ๏ธ Real-world editing: VLM-driven scene augmentation preserving physical interactions
๐ง Physics-Aware Evaluation
- Dynamic physical checklists generated by VLMs (e.g., "Does the gripper penetrate the object?", "Is gravity respected?")
- 30โ50% negative questions to prevent guessing ๐ซ
- Decoupled scorer architecture to eliminate self-evaluation bias โ๏ธ
๐ Comprehensive Metrics
Evaluates:
- Physical fidelity (penetration, contact, deformation) ๐ฅ
- Temporal coherence ๐
- Spatial alignment & trajectory consistency ๐ฏ
๐ฆ Getting Started
Download evaluation data from ModelScope:
git lfs install
git clone https://www.modelscope.cn/datasets/amap_cvlab/EZS-Bench_data.git
Install and run the evaluation toolkit:
cd EZS-Bench
pip install -e .
# Full evaluation (Video Quality + Domain Score)
torchrun --standalone --nproc_per_node=4 evaluate_ezsbench.py \
--data_file /path/to/EZS-Bench_data/video_prompt_question_196_ezs0.jsonl \
--method_name "YourMethod" \
--method_dir /path/to/generated_videos \
--output_dir ./results
The VLM judge model (Qwen2.5-VL-72B-Instruct, ~150 GB) is automatically downloaded on first run.
๐ See EZS-Bench/README.md for full documentation.
๐ Evaluation
We evaluate ABot-PhysWorld on three key aspects:
- Physical Consistency (via PBench and EZSbench)
- Zero-Shot Generalization (via EZSbench)
- Action-Conditioned Controllability (via custom A2V benchmark)
๐ Summary of Advancements ๐๐
| Capability | Benchmark | Ours | Best Baseline | Gain |
|---|---|---|---|---|
| Physical Fidelity | PBench (Domain Score) | 0.9306 | 0.8644 (Wan2.5) | +6.62% |
| Zero-Shot Generalization | EZSbench (Domain Score) | 0.8366 | 0.7951 (WoW) | +4.15% |
| Action Control | Trajectory Consistency | 0.8522 | 0.8157 (Enerverse) | +3.65% |
โ ABot-PhysWorld establishes a new standard for physically grounded, controllable, and generalizable world models in robotic manipulation.
๐ผ๏ธ Qualitative Results
Selected representative zero-shot generation results demonstrating ABot-PhysWorld's strong generalization and physical plausibility.
๐ฏ Zero-Shot Capabilities
๐ง Scene 1: Deformable Object โ Dual-Arm Towel Folding
![]() |
![]() |
![]() |
![]() |
- Task: Fold a towel using dual robotic arms
- Challenge: Complex cloth dynamics and bimanual coordination
- Ours:
โ Physically realistic deformation
โ Smooth, collision-free arm motion
โ Natural folding sequence with consistent contact
๐ฅค Scene 2: Fine Manipulation โ Diverse Object Handling
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Task: Stack cups, build blocks, place a knife
- Challenge: Varying shapes, weights, and friction
- Ours:
โ Accurate grasp pose prediction
โ Adaptive gripper control
โ Stable pick-and-place without slippage or penetration
๐ช Scene 3: Articulated Object โ Opening a Cabinet Door
![]() |
![]() |
- Task: Open a hinged cabinet or door
- Challenge: Enforce rotational constraints and correct force direction
- Ours:
โ Proper handle grasping
โ Realistic hinge rotation
โ Motion follows physical pivot axis
๐ซ Scene 4: Fluid Interaction โ Pouring Water
![]() |
![]() |
- Task: Pour water from a cup into a bowl using dual arms
- Challenge: Bimanual coordination, tilt control, liquid dynamics
- Ours:
โ Collision-free trajectory planning
โ Accurate pour timing and angle
โ Visual consistency in fluid transfer (simulated proxy)
๐งฝ Scene 5: Cleaning Task โ Wiping a Stain
Note: The Gemini watermark (bottom-right) indicates the initial frame generated by Gemini (ensuring it is completely unseen); all other frames are generated by ABot-PhysWorld.
![]() |
![]() |
![]() |
![]() |
- Task: Wipe a stain off a table
- Challenge: Maintain contact, uniform pressure, full coverage
- Ours:
โ Continuous tool-surface contact
โ Systematic wiping motion
โ Gradual removal of the stain in video output
๐ Scene 6: Multi-Scene Generalization โ Fruit Sorting
Note: The Gemini watermark (bottom-right) indicates the initial frame generated by Gemini (ensuring it is completely unseen); all other frames are generated by ABot-PhysWorld.
![]() |
![]() |
![]() |
![]() |
- Task: Place fruits into a plate across diverse scenes
- Challenge: Background, lighting, and fruit variation
- Ours:
โ Robust object recognition under domain shifts
โ Consistent performance across unseen environments
โ Fast and stable manipulation regardless of setup
๐ Pbench Results Demonstration
We conducted systematic qualitative comparative experiments on the PAI-Bench benchmark dataset. Below are the generated results from several typical scenarios.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Task | Baselines | Ours |
|---|---|---|
| Grasping | Frequent penetration, floatation | โ Firm contact, no violation |
| Long-horizon Planning | Inconsistent state transitions | โ Coherent multi-step reasoning |
| Rigid-body Dynamics | Unphysical deformations | โ Preserved geometry and mass behavior |
| Contact Modeling | Non-contact attraction | โ Realistic interaction onset |
Our model consistently generates physically valid trajectories even in complex, unseen scenarios โ proving its utility as a reliable simulator for embodied AI.
๐ ๏ธ Usage
Quick Start: Video Generation Inference
Generate physically plausible robot manipulation videos using the ABot-PhysWorld fine-tuned model.
Environment Setup
# Create conda environment
conda create -n abot-physworld python=3.10
conda activate abot-physworld
# Install PyTorch with CUDA support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install -r requirements.txt
Hardware Requirements:
| Configuration | VRAM | Notes |
|---|---|---|
| Recommended | >= 60GB | Best performance, no tiling needed |
| Minimum | >= 24GB | Uses tiled VAE (enabled by default) |
Demo: Generate Video from Image + Text Prompt
cd inference
# Download demo data and run inference
python inference.py \
--jsonl_path assets/demo.jsonl \
--output_dir ./outputs/demo \
--save_first_frames
This generates videos for 2 Franka robot manipulation samples. The model checkpoint is auto-downloaded from ModelScope on first run.
Single Image Inference
python inference.py \
--input_image /path/to/image.jpg \
--prompt "robot arm picks up the red cube from the table" \
--output_dir ./outputs
Batch Inference from JSONL
Prepare a JSONL file (each line is a sample):
{"video": "path/to/image.jpg", "prompt": "robot grasps the object"}
{"video": "path/to/image2.jpg", "prompt": "robot places object on table"}
Then run:
python inference.py \
--jsonl_path data.jsonl \
--output_dir ./outputs \
--num_samples 100 # Process max 100 samples
Full Parameter Reference
python inference.py --help
Key parameters:
--checkpoint_path: Local path to model weights (auto-downloads if not provided)--cache_dir: Directory to store downloaded weights (default:./checkpoints)--height, `--width$: \text{Video} \text{resolution} (\text{default}: 480 \times 832)- $--num_frames`: Number of output frames (default: 81 โ 5.4s at 15fps)
--num_inference_steps: Denoising steps, higher = better quality but slower (default: 50)--cfg_scale: Classifier-free guidance scale (default: 5.0)--seed: Random seed for reproducibility--gpu_id: GPU device index
Output
- Single image:
{output_dir}/{image_name}_generated.mp4 - Batch mode:
{output_dir}/{unique_id}_generated.mp4+results.json(with status for each sample)
Model Weights
Auto-Download: The fine-tuned checkpoint is automatically downloaded from ModelScope on first inference run.
Manual Download (Optional):
pip install modelscope
modelscope download --model amap_cvlab/Abot-PhysWorld --local_dir ./inference/checkpoints
Base Model: Wan2.1-I2V-14B-480P is also auto-downloaded by DiffSynth-Studio.
More Details
For detailed setup instructions, examples, and troubleshooting, see inference/README.md.
๐๏ธ Training
We provide full-parameter SFT training scripts to fine-tune Wan2.1-I2V-14B-480P on your own robot manipulation datasets.
Training Data
The v1 SFT training dataset is available on ModelScope:
git lfs install
git clone https://www.modelscope.cn/datasets/amap_cvlab/ABot-PhysWorld_SFT_Training_Data_v1.git
Quick Start
cd training
# Prepare your dataset (JSONL format, see training/assets/demo_train.jsonl)
# Then launch 8-GPU training:
bash run_train.sh
Key Features
- Full-parameter SFT on the 14B DiT model (LoRA also supported)
- DeepSpeed ZeRO-2 distributed training via Accelerate
- Encoded feature caching: Save VAE/T5/CLIP encodings to disk, skip re-encoding in subsequent runs
- Resume from checkpoint: Continue training from any saved step
- Real-time text encoding: Re-train with new captions while reusing cached video features
Resume from Checkpoint
RESUME_CHECKPOINT=./outputs/sft_training/step-800.safetensors \
bash run_train_resume.sh
Training with Encoded Cache
# First run: train + save encoded features
ENCODED_CACHE_DIR=./encoded_cache bash run_train.sh
# Subsequent runs: reuse cached features (much faster)
ENCODED_CACHE_DIR=./encoded_cache bash run_train.sh
For detailed training instructions, data preparation, and parameter reference, see training/README.md.
๐ฎ A2V (Action-to-Video)
We release the A2V training and inference code for action-conditioned video generation via VACE parallel context blocks. Given an input image and an action trajectory (end-effector poses), the model generates a physically consistent video of the robot executing the specified actions.
Quick Start: A2V Training
cd training
# Train VACE module on top of SFT DiT
DIT_CHECKPOINT=/path/to/dit_checkpoint.safetensors \
DATASET_BASE_PATH=/path/to/dataset \
DATASET_METADATA_PATH=/path/to/metadata.jsonl \
bash run_train_a2v.sh
Quick Start: A2V Inference
cd inference
# Run A2V inference (checkpoints auto-downloaded from ModelScope)
python inference_a2v.py \
--jsonl_path ./assets/demo_a2v.jsonl \
--output_dir ./outputs/a2v_results
# With trajectory overlay visualization
python inference_a2v.py \
--jsonl_path data.jsonl \
--output_dir ./outputs \
--overlay_action_condition
For detailed documentation, see training/README_A2V.md and inference/README_A2V.md.
๐งช DPO Training
We release the DPO (Direct Preference Optimization) training pipeline for physics-aware alignment. Using winner/loser video pairs, the model learns to generate videos that better respect physical laws via LoRA fine-tuning.
Pipeline
- Preprocess: Encode video pairs into cached tensors
- Train: Run DPO LoRA training on cached data
cd training
# Step 1: Preprocess DPO data
DPO_JSONL=/path/to/dpo_pairs.jsonl \
CACHE_DIR=/path/to/dpo_cache \
bash run_preprocess_dpo.sh
# Step 2: Train DPO LoRA
DIT_CHECKPOINT=/path/to/dit_checkpoint.safetensors \
DPO_CACHE_DIR=/path/to/dpo_cache \
bash run_train_dpo.sh
For detailed documentation, see training/README_DPO.md.
๐ Citing
If you find ABot-PhysWorld is useful in your research or applications, please consider giving us a star ๐ and citing it by the following BibTeX entry:
@article{chen2026abotphysworld,
title={ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment},
author={Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu},
journal={arXiv preprint arXiv:2603.23376},
year={2026}
}
๐ Acknowledgement
This project builds upon the following open-source projects. We thank these teams for their contributions:





























