README.md

May 27, 2026 · View on GitHub

Action Images: End-to-End Policy Learning via Multiview Video Generation

arXiv 2026

Haoyu Zhen*, Zixian Gao*, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

Paper PDF Project Page Model Hugging Face Dataset RLBench

We propose Action Images, an end-to-end framework for robotic policy learning that takes multi-view images and text instructions to jointly generate RGB videos and action trajectories, enabling direct policy learning through multiview video generation.

Logo


Table of Contents
  1. News
  2. Installation
  3. Data Preparation
  4. Training
  5. Inference
  6. Citation
  7. Acknowledgement

News

Installation

Create a conda environment and install the required packages:

conda create -n actionimages python=3.11
conda activate actionimages

git clone https://github.com/UMass-Embodied-AGI/ActionImages.git
cd ActionImages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Data Preparation

Action Images supports multi-view robotic datasets including RLBench, Bridge, and DROID.

RLBench

Download the processed RLBench data from anyeZHY/ActionImages-RLBench into ./data/rlbench, unzip every .tar.gz in that folder, then delete the archives. Example:

mkdir -p ./data/rlbench
hf download anyeZHY/ActionImages-RLBench --repo-type dataset --local-dir ./data/rlbench

To preview raw RLBench samples, run python vis/vis_rlbench.py. To add a custom dataset, subclass BaseDataset in training/dataset/base.py. Before training, you can sanity-check the dataloader with:

python training/dataset/test_dataset.py --dataset rlbench --backend torch  # or numpy

Bridge

TODO: Release Bridge preprocessing script to convert raw Bridge data into the layout expected by BridgeMVDataset.

Training

Pre-training or Full Fine-tuning

The training code supports distributed training with multiple GPUs via DeepSpeed ZeRO. Wan backbone weights are downloaded automatically on first run.

To train Action Images, run:

bash scripts/train.sh <num_gpus>

To fine-tune from a released checkpoint, download anyeZHY/ActionImages and add --init_ckpt_path or --resume_ckpt_path in scripts/train.sh:

hf download anyeZHY/ActionImages --local-dir ./checkpoints/ActionImages
torchrun ... train.py --init_ckpt_path ./checkpoints/ActionImages/checkpoint.ckpt ...

Configuration

Key training arguments (see training/args.py for the full list):

ArgumentDefaultDescription
--dataset_namerlbenchDataset: rlbench, bridge, or droid
--num_frames41Number of video frames per sample
--height / --width512Output resolution

Multi-dataset co-training is supported via --dataset_name with per-dataset sampling ratios (name@ratio), e.g. rlbench@0.5,bridge@0.3,droid@0.2. A single dataset name defaults to @1.0.

Edit scripts/train.sh to modify learning rate, batch size, checkpoint frequency, and W&B logging.

Inference

Run multi-view inference with two input images and a text prompt. Outputs are saved under results/ by default.

Image-to-video-action (i2va) — joint RGB video and action generation:

torchrun --nproc_per_node=8 inference.py \
  --images asset/xarm-left.jpg asset/xarm-right.jpg \
  --ckpt_path anyeZHY/ActionImages \
  --prompt "place the black cup in the blue bowl" \
  --task_type i2va \
  --use_usp \
  --num_inference_steps 50 \
  --cfg_parallel \
  --torch_compile \
  --view1_action 350 130 350 120 350 80 1 \
  --view2_action 325 190 375 180 325 100 1

--view1_action / --view2_action format (7 values per view, matching the RGB action image channels):

IndexNameDescription
0–1red x, red yGripper position (R channel)
2–3green x, green yGripper orientation / normal direction (G channel)
4–5blue x, blue yGripper up direction (B channel)
6openness1 = open, 0 = grasp
  • Pixel coordinates use the top-left corner of the image as origin (0, 0), with x rightward and y downward.
  • Provide 7 values (same action repeated for all frames) or 7 × num_frames values (per-frame trajectory).

Optional flags:

  • --use_usp: Unified Sequence Parallel for multi-GPU inference
  • --cfg_parallel: Split CFG branches across GPUs
  • --dynamic_cache_schedule: Faster inference via cache scheduling
  • --torch_compile: Enable torch.compile for speedup
  • --task_type: i2v (video only) or i2va (video + action)

Note

Inference uses VGGT to estimate camera poses from the two input images. The model weights are downloaded automatically on first run.

TODO: Release Blender rendering script at inference/render_blender.py to visualize predicted actions / point clouds in a 3D scene.

Citation

If you find our work useful, please consider citing:

@article{zhen2026actionimages,
  title={Action Images: End-to-End Policy Learning via Multiview Video Generation},
  author={Haoyu Zhen and Zixian Gao and Qiao Sun and Yilin Zhao and Yuncong Yang and Yilun Du and Pengsheng Guo and Tsun-Hsuan Wang and Yi-Ling Qiao and Chuang Gan},
  year={2026},
  eprint={2604.06168},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.06168},
}

Acknowledgement

We would like to thank the following works for their code and models: