3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

July 6, 2025 · View on GitHub

This repository contains PyTorch implementation for 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

[📖 arXiv] [🤖 model] [📑 dataset]

Overview

Manipulation has been a challenging task for robots, a major obstacle is the lack of a large, uniform dataset for teaching robots manipulation skills. We observe that understanding how objects should move in 3D space is crucial for guiding manipulation actions, and this insight is applicable to both humans and robots. We aim to develop a 3D flow world model, which predicts the future movement of interacting objects in 3D space to guide action planning. We also introduce a flow-guided rendering mechanism that predicts the final state and uses GPT-4o to evaluate whether the predicted flow aligns with the task description, enabling closed-loop planning for robots. The predicted 3D optical flow serves as constraints for an optimization policy that determines the robot's actions for manipulation. Extensive experiments show strong generalization across diverse robotic tasks and effective cross-embodiment adaptation without hardware-specific training.

TODO

Release Moving object detection pipeline for BridgeV2
Release ManiFlow-110k
Release model weight of 3D Flow World Model
Release inference code of 3D Flow World Model
Release training code of 3D Flow World Model
Release realworld robot implement code

Step0: Install environment requirements

Cotracker3, VideoDepthAnything, GroundingSam2

conda env create -f environment.yaml

Step1: Extract 2D optical flow for manipulated object(Moving object detection pipeline)

# We use BridgeV2 as an example to generation task-related 3D Flow
# Source data structure
BridgeV2-Processed
── depth
│   ├── 0_meter.npz
│   ├── 1_meter.npz
├── frames
│   ├── 0.jpg
│   ├── 1.jpg
├── instructions.txt

# Process
cd preprocess/BridgeV2
python moving_obj_det_pipeline_all.py

Step2: Use VideoDepthAnything to estimate depth of frames and Project the 2D flow to 3D space

Step3: Prepare 3D optical flow for training

bash run_scripts/preprocess_bridge_dataset.sh

Step4: Training

run_scripts/train_flow_3d_bridge_wovae_slurm.sh

Step5: Visualization evaluation results

python scripts/flow_generation/viz_3d_flow_batch.py

Step6: Inference using release checkpoints

# Put release checkpoint to results/release/checkpoints/epoch_400
bash run_scripts/inference.sh