README.md
May 27, 2026 · View on GitHub
Action Images: End-to-End Policy Learning via Multiview Video Generation
arXiv 2026
Haoyu Zhen*, Zixian Gao*, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan
We propose Action Images, an end-to-end framework for robotic policy learning that takes multi-view images and text instructions to jointly generate RGB videos and action trajectories, enabling direct policy learning through multiview video generation.
Table of Contents
News
- [2026-05-26] We have released the training and inference code, along with the model checkpoint and RLBench dataset on Hugging Face!
- [2026-04-06] Action Images is on arXiv!
- [2026-04-06] Check out our project website for more demos and results.
Installation
Create a conda environment and install the required packages:
conda create -n actionimages python=3.11
conda activate actionimages
git clone https://github.com/UMass-Embodied-AGI/ActionImages.git
cd ActionImages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .
Data Preparation
Action Images supports multi-view robotic datasets including RLBench, Bridge, and DROID.
RLBench
Download the processed RLBench data from anyeZHY/ActionImages-RLBench into ./data/rlbench, unzip every .tar.gz in that folder, then delete the archives. Example:
mkdir -p ./data/rlbench
hf download anyeZHY/ActionImages-RLBench --repo-type dataset --local-dir ./data/rlbench
To preview raw RLBench samples, run python vis/vis_rlbench.py. To add a custom dataset, subclass BaseDataset in training/dataset/base.py. Before training, you can sanity-check the dataloader with:
python training/dataset/test_dataset.py --dataset rlbench --backend torch # or numpy
Bridge
TODO: Release Bridge preprocessing script to convert raw Bridge data into the layout expected by
BridgeMVDataset.
Training
Pre-training or Full Fine-tuning
The training code supports distributed training with multiple GPUs via DeepSpeed ZeRO. Wan backbone weights are downloaded automatically on first run.
To train Action Images, run:
bash scripts/train.sh <num_gpus>
To fine-tune from a released checkpoint, download anyeZHY/ActionImages and add --init_ckpt_path or --resume_ckpt_path in scripts/train.sh:
hf download anyeZHY/ActionImages --local-dir ./checkpoints/ActionImages
torchrun ... train.py --init_ckpt_path ./checkpoints/ActionImages/checkpoint.ckpt ...
Configuration
Key training arguments (see training/args.py for the full list):
| Argument | Default | Description |
|---|---|---|
--dataset_name | rlbench | Dataset: rlbench, bridge, or droid |
--num_frames | 41 | Number of video frames per sample |
--height / --width | 512 | Output resolution |
Multi-dataset co-training is supported via --dataset_name with per-dataset sampling ratios (name@ratio), e.g. rlbench@0.5,bridge@0.3,droid@0.2. A single dataset name defaults to @1.0.
Edit scripts/train.sh to modify learning rate, batch size, checkpoint frequency, and W&B logging.
Inference
Run multi-view inference with two input images and a text prompt. Outputs are saved under results/ by default.
Image-to-video-action (i2va) — joint RGB video and action generation:
torchrun --nproc_per_node=8 inference.py \
--images asset/xarm-left.jpg asset/xarm-right.jpg \
--ckpt_path anyeZHY/ActionImages \
--prompt "place the black cup in the blue bowl" \
--task_type i2va \
--use_usp \
--num_inference_steps 50 \
--cfg_parallel \
--torch_compile \
--view1_action 350 130 350 120 350 80 1 \
--view2_action 325 190 375 180 325 100 1
--view1_action / --view2_action format (7 values per view, matching the RGB action image channels):
| Index | Name | Description |
|---|---|---|
| 0–1 | red x, red y | Gripper position (R channel) |
| 2–3 | green x, green y | Gripper orientation / normal direction (G channel) |
| 4–5 | blue x, blue y | Gripper up direction (B channel) |
| 6 | openness | 1 = open, 0 = grasp |
- Pixel coordinates use the top-left corner of the image as origin
(0, 0), withxrightward andydownward. - Provide 7 values (same action repeated for all frames) or 7 × num_frames values (per-frame trajectory).
Optional flags:
--use_usp: Unified Sequence Parallel for multi-GPU inference--cfg_parallel: Split CFG branches across GPUs--dynamic_cache_schedule: Faster inference via cache scheduling--torch_compile: Enabletorch.compilefor speedup--task_type:i2v(video only) ori2va(video + action)
Note
Inference uses VGGT to estimate camera poses from the two input images. The model weights are downloaded automatically on first run.
TODO: Release Blender rendering script at
inference/render_blender.pyto visualize predicted actions / point clouds in a 3D scene.
Citation
If you find our work useful, please consider citing:
@article{zhen2026actionimages,
title={Action Images: End-to-End Policy Learning via Multiview Video Generation},
author={Haoyu Zhen and Zixian Gao and Qiao Sun and Yilin Zhao and Yuncong Yang and Yilun Du and Pengsheng Guo and Tsun-Hsuan Wang and Yi-Ling Qiao and Chuang Gan},
year={2026},
eprint={2604.06168},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.06168},
}
Acknowledgement
We would like to thank the following works for their code and models:
- Video generation: Wan, ReCamMaster, DiffSynth and VideoX-Fun
- Camera estimation: VGGT
- Datasets: RLBench, Bridge and DROID