README.md

January 19, 2026 · View on GitHub

[RA-L 2026] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

Jun Guo*1,2, Xiaojian Ma*†1, Yikai Wang*3, Min Yang1,4, Huaping Liu†2, Qing Li†1
*Equal contribution. **Corresponding author.
1State Key Laboratory of General Artificial Intelligence (BIGAI),
2Department of Computer Science and Technology, Tsinghua University,
3School of Artificial Intelligence, Beijing Normal University,
4School of Computer Science and Technology, University of Science and Technology of China

           

This repository is the official implemetation of the paper in IEEE RA-L 2026: "FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation".

Overview

Installation

The code has been tested on Ubuntu 22.04, Python 3.12, PyTorch 2.5.1 with CUDA 12.4.

# The example for Anaconda installation. You can skip them and install on your own environment.
conda create -n flowdm python=3.12
conda install cuda -c nvidia/label/cuda-12.4

# Install PyTorch and xformers. You can change the version as you want, but their version should match.
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -U xformers==0.0.29.post1 --index-url https://download.pytorch.org/whl/cu124

# Install other dependencies.
pip install -r requirements.txt

Models

We start to train our FlowDreamer from Stable Diffusion 2.1 Base, you need to download this model and set --pretrained_path to the directory of SD 2.1.

Notice: The original repository released by StabilityAI (stabilityai/stable-diffusion-2-1-base) was deprecated and deleted by StabilityAI team in November, 2025. As an alternative, we can download the model from the backup repository.

Flowdreamer needs a metric depth estimation model to perform autoregressive inference, and we choose Depth Anything V2 for Metric Depth Estimation and finetune it on our training set to perform metric depth estimation.

We also provide some datasets and checkpoints used in our experiments.

More resources will be released as soon as possible.

Data Preparation

The structure of our dataset is as follows:

dataset_root
├── test
│   └── 034000
│       ├── annotation.json
│       ├── depth.tiff
│       ├── flow.tiff
│       └── rgb.mp4
├── train
└── val
  • RGB frames are saved in .mp4 format.
  • Depth maps are saved in uint16 .tiff format.
  • 3D scene flows are saved in float16 .tiff format.
  • Robot actions, camera intrinsics and extrinsics are saved in .json format.

The detailed dataset information used in our paper is listed in the following table:

Dataset nameHeightWidthAction dim
RT-1 Simpler2563207
Language Table2885122
VP2^2 RoboDesk3203205
VP2^2 Robosuite2562564

Usage

To train FlowDreamer, run:

torchrun --nproc_per_node=8 main.py --dataset_dir /PATH/TO/YOUR/DATASET/ \
  --pretrained_path /PATH/TO/YOUR/SD21/ \
  --depth_est_path /PATH/TO/YOUR/DEPTH_ANYTHING_V2/

To evaluate FlowDreamer, run:

python main.py --dataset_dir /PATH/TO/YOUR/DATASET/ \
  --pretrained_path /PATH/TO/YOUR/SD21/ \
  --depth_est_path /PATH/TO/YOUR/DEPTH_ANYTHING_V2/ \
  --evaluate --eval_length EVAL_LENGTH \
  --ckpt_path /PATH/TO/YOUR/TRAINED_CHECKPOINTS.ckpt

Acknowledgement

The training code is mainly based on huggingface/diffusers.

The depth estimator code is based on DepthAnything/Depth-Anything-V2, and we use the metric_depth version.

The FID calculation code is based on mseitzer/pytorch-fid, and the FVD calculation code is based on universome/stylegan-v.

Citation

If you find this project useful, please cite our paper as:

@article{guo2026flowdreamer,
  title={FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation},
  author={Guo, Jun and Ma, Xiaojian and Wang, Yikai and Yang, Min and Liu, Huaping and Li, Qing},
  journal={IEEE Robotics and Automation Letters},
  year={2026},
}