README.md

June 9, 2026 · View on GitHub

Action Images: End-to-End Policy Learning via Multiview Video Generation

arXiv 2026

Haoyu Zhen^*, Zixian Gao^*, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

We propose Action Images, an end-to-end framework for robotic policy learning that takes multi-view images and text instructions to jointly generate RGB videos and action trajectories, enabling direct policy learning through multiview video generation.

Logo

Table of Contents

News
Installation
Data Preparation
Training
- Pre-training or Full Fine-tuning
- Configuration
Inference
Citation
Acknowledgement

News

[2026-05-26] We have released the training and inference code, along with the model checkpoint and RLBench dataset on Hugging Face!
[2026-04-06] Action Images is on arXiv!
[2026-04-06] Check out our project website for more demos and results.

Installation

Create a conda environment and install the required packages:

conda create -n actionimages python=3.11
conda activate actionimages

git clone https://github.com/UMass-Embodied-AGI/ActionImages.git
cd ActionImages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Download the base Wan backbone weights from HuggingFace into ./checkpoints/ (required for both training and inference):

hf download Wan-AI/Wan2.2-TI2V-5B --local-dir ./checkpoints/Wan-AI/Wan2.2-TI2V-5B --include "diffusion_pytorch_model*.safetensors" "models_t5_umt5-xxl-enc-bf16*.pth" "Wan*_VAE.pth" "google/*"

Data Preparation

Action Images supports multi-view robotic datasets including RLBench, Bridge, and DROID.

RLBench

Download the processed RLBench data from anyeZHY/ActionImages-RLBench into ./data/rlbench, unzip every .tar.gz in that folder, then delete the archives. Example:

mkdir -p ./data/rlbench
hf download anyeZHY/ActionImages-RLBench --repo-type dataset --local-dir ./data/rlbench

To preview raw RLBench samples, run python vis/vis_rlbench.py. To add a custom dataset, subclass BaseDataset in training/dataset/base.py. Before training, you can sanity-check the dataloader with:

python training/dataset/test_dataset.py --dataset rlbench --backend torch  # or numpy

Bridge

TODO: Release Bridge preprocessing script to convert raw Bridge data into the layout expected by BridgeMVDataset.

Training

Pre-training or Full Fine-tuning

The training code supports distributed training with multiple GPUs via DeepSpeed ZeRO. Wan backbone weights are downloaded automatically on first run.

To train Action Images, run:

bash scripts/train.sh <num_gpus>

To fine-tune from a released checkpoint, download anyeZHY/ActionImages and add --init_ckpt_path or --resume_ckpt_path in scripts/train.sh:

hf download anyeZHY/ActionImages --local-dir ./checkpoints/ActionImages
torchrun ... train.py --init_ckpt_path ./checkpoints/ActionImages/checkpoint.ckpt ...

Configuration

Key training arguments (see training/args.py for the full list):

Argument	Default	Description
`--dataset_name`	`rlbench`	Dataset: `rlbench`, `bridge`, or `droid`
`--num_frames`	`41`	Number of video frames per sample
`--height` / `--width`	`512`	Output resolution

Multi-dataset co-training is supported via --dataset_name with per-dataset sampling ratios (name@ratio), e.g. rlbench@0.5,bridge@0.3,droid@0.2. A single dataset name defaults to @1.0.

Edit scripts/train.sh to modify learning rate, batch size, checkpoint frequency, and W&B logging.

Inference

Run multi-view inference with two input images and a text prompt. Outputs are saved under results/ by default.

Image-to-video-action (i2va) — joint RGB video and action generation:

torchrun --nproc_per_node=8 inference.py \
  --images asset/xarm-left.jpg asset/xarm-right.jpg \
  --ckpt_path anyeZHY/ActionImages \
  --prompt "place the black cup in the blue bowl" \
  --task_type i2va \
  --use_usp \
  --num_inference_steps 50 \
  --cfg_parallel \
  --torch_compile \
  --view1_action 350 130 350 120 350 80 1 \
  --view2_action 325 190 375 180 325 100 1

--view1_action / --view2_action format (7 values per view, matching the RGB action image channels):

Index	Name	Description
0–1	`red x`, `red y`	Gripper position (R channel)
2–3	`green x`, `green y`	Gripper orientation / normal direction (G channel)
4–5	`blue x`, `blue y`	Gripper up direction (B channel)
6	`openness`	`1` = open, `0` = grasp

Pixel coordinates use the top-left corner of the image as origin (0, 0), with x rightward and y downward.
Provide 7 values (same action repeated for all frames) or 7 × num_frames values (per-frame trajectory).

Optional flags:

--use_usp: Unified Sequence Parallel for multi-GPU inference
--cfg_parallel: Split CFG branches across GPUs
--dynamic_cache_schedule: Faster inference via cache scheduling
--torch_compile: Enable torch.compile for speedup
--task_type: i2v (video only) or i2va (video + action)

Note

Inference uses VGGT to estimate camera poses from the two input images. The model weights are downloaded automatically on first run.

TODO: Release Blender rendering script at inference/render_blender.py to visualize predicted actions / point clouds in a 3D scene.

Citation

If you find our work useful, please consider citing:

@article{zhen2026actionimages,
  title={Action Images: End-to-End Policy Learning via Multiview Video Generation},
  author={Haoyu Zhen and Zixian Gao and Qiao Sun and Yilin Zhao and Yuncong Yang and Yilun Du and Pengsheng Guo and Tsun-Hsuan Wang and Yi-Ling Qiao and Chuang Gan},
  year={2026},
  eprint={2604.06168},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.06168},
}

Acknowledgement

We would like to thank the following works for their code and models:

Video generation: Wan, ReCamMaster, DiffSynth and VideoX-Fun
Camera estimation: VGGT
Datasets: RLBench, Bridge and DROID