3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model
July 6, 2025 ยท View on GitHub
This repository contains PyTorch implementation for 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model
[๐ arXiv] [๐ค model] [๐ dataset]
Overview

Manipulation has been a challenging task for robots, a major obstacle is the lack of a large, uniform dataset for teaching robots manipulation skills. We observe that understanding how objects should move in 3D space is crucial for guiding manipulation actions, and this insight is applicable to both humans and robots. We aim to develop a 3D flow world model, which predicts the future movement of interacting objects in 3D space to guide action planning. We also introduce a flow-guided rendering mechanism that predicts the final state and uses GPT-4o to evaluate whether the predicted flow aligns with the task description, enabling closed-loop planning for robots. The predicted 3D optical flow serves as constraints for an optimization policy that determines the robot's actions for manipulation. Extensive experiments show strong generalization across diverse robotic tasks and effective cross-embodiment adaptation without hardware-specific training.
TODO
- Release Moving object detection pipeline for BridgeV2
- Release ManiFlow-110k
- Release model weight of 3D Flow World Model
- Release inference code of 3D Flow World Model
- Release training code of 3D Flow World Model
- Release realworld robot implement code
Step0: Install environment requirements
Cotracker3, VideoDepthAnything, GroundingSam2
conda env create -f environment.yaml
Step1: Extract 2D optical flow for manipulated object(Moving object detection pipeline)
# We use BridgeV2 as an example to generation task-related 3D Flow
# Source data structure
BridgeV2-Processed
โโ depth
โย ย โโโ 0_meter.npz
โย ย โโโ 1_meter.npz
โโโ frames
โย ย โโโ 0.jpg
โย ย โโโ 1.jpg
โโโ instructions.txt
# Process
cd preprocess/BridgeV2
python moving_obj_det_pipeline_all.py
Step2: Use VideoDepthAnything to estimate depth of frames and Project the 2D flow to 3D space
Step3: Prepare 3D optical flow for training
bash run_scripts/preprocess_bridge_dataset.sh
Step4: Training
run_scripts/train_flow_3d_bridge_wovae_slurm.sh
Step5: Visualization evaluation results
python scripts/flow_generation/viz_3d_flow_batch.py
Step6: Inference using release checkpoints
# Put release checkpoint to results/release/checkpoints/epoch_400
bash run_scripts/inference.sh