Awesome World Models for Robotic Policy Learning

May 16, 2026 · View on GitHub

Awesome World Models for Robotic Policy Learning

arXiv Hugging Face Website GitHub

Bohan Hou1,*,†, Gen Li1,*, Jindou Jia1,*, Tuo An1,*, Xinying Guo1,*, Sicong Leng1,
Haoran Geng2, Yanjie Ze3, Tatsuya Harada4, Philip Torr5, Oier Mees6, Marc Pollefeys7,
Zhuang Liu8, Jiajun Wu3, Pieter Abbeel2, Jitendra Malik2, Yilun Du9, Jianfei Yang1,†

1Nanyang Technological University, 2University of California, Berkeley, 3Stanford University,
4The University of Tokyo, 5University of Oxford, 6Microsoft, 7ETH Zurich,
8Princeton University, 9Harvard University
*Equal Contribution (alphabetical order), Corresponding Author

This repository accompanies our survey World Model for Robot Learning: A Comprehensive Survey — a policy-centric survey of predictive world models for robot policy learning, planning, simulation, evaluation, data generation, and robotic video generation.

  • 📄 We maintain a curated list of papers, code, websites, models, benchmarks, and datasets on world models for robotic policy learning.
  • 🤖 The list is organized around world models as policies, simulators, video-generation backbones, benchmarks, and datasets.
  • 🤝 If you find missing papers, outdated links, or incorrect metadata, please feel free to open an issue or submit a pull request!

Table of Contents


World Model as Policy

World models(video generation models, unified models) used as backbone or components for improving Vision-Language-Action (VLA) policies. Organized by architectural paradigm following the taxonomy in our survey.

IDM-style Policies

Inverse Dynamics Policies: first predict future visual trajectories, then use an inverse dynamics model to recover actions. Decoupled predict-then-act pipeline.

Early Subgoal-Image Instantiations

Earlier instantiations leveraged image-editing diffusion models to predict subgoals for a goal-conditioned policy to follow.

  • [arXiv'23.10] SuSIEZero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
    arXiv Website GitHub

  • [ICRA'25] GHIL-GlueGHIL-Glue: Hierarchical Control with Filtered Subgoal Images
    Paper arXiv GitHub Website

Video-IDM Policies

  • [NeurIPS'23] UniPiLearning Universal Policies via Text-Guided Video Generation
    Paper arXiv Website

  • [ICLR'24] GR-1Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
    Paper arXiv GitHub Website

  • [NeurIPS'24] VidManVidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation
    Paper arXiv GitHub Website

  • [ICML'25] VPPVideo Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
    Paper arXiv GitHub Website

  • [CoRL'25] Gen2ActHuman Video Generation in Novel Scenarios Enables Generalizable Robot Manipulation
    arXiv Website

  • [ICLR'25] V2AGrounding Video Models to Actions through Goal Conditioned Exploration
    arXiv GitHub

  • [arXiv'25.12] Video2ActVideo2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
    arXiv GitHub Website

  • [arXiv'25.12] mimic-videomimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
    arXiv Website GitHub Hugging Face

  • [arXiv'25.12] LVPLarge Video Planner Enables Generalizable Robot Control
    arXiv GitHub Website Hugging Face

  • [arXiv'25.12] VidarcVidarc: Embodied Video Diffusion Model for Closed-loop Control
    arXiv GitHub Website

  • [arXiv'26.01] TC-IDMTC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion
    arXiv GitHub Website

  • [arXiv'26.02] Say, Dream, and ActSay, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation
    arXiv

Structured 3D-aware IDM Extensions

A complementary line within the IDM family extracts 3D-aware motion structure (dense correspondences, hand trajectories, motion fields, 3D flow) from generated/demonstrated videos and uses it as a more action-relevant predictive prior.

  • [IROS'20] Hind4sight-NetHindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction
    Paper arXiv GitHub Website

  • [ICLR'24] AVDCLearning to Act from Actionless Videos through Dense Correspondences
    Paper arXiv GitHub Website

  • [CVPR'25] VidBotVidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

  • [NeurIPS'25] Object-centric 3D Motion FieldObject-centric 3D Motion Field for Robot Learning from Human Videos

  • [arXiv'26] NovaFlowNovaFlow: Zero-shot Manipulation via Actionable Flow from Generated Videos
    arXiv GitHub Website

  • [arXiv'26.04] VAGVAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
    arXiv

Single-backbone Policies

Unified Policies with Single World Model Backbone: a single shared backbone jointly models video and action through joint diffusion/prediction.

  • [RSS'25] UVAUnified Video Action Model
    Paper arXiv GitHub Website

  • [RSS'25] UWMUnified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
    Paper arXiv GitHub Website

  • [NeurIPS'25] VideoVLAVideoVLA: Video Generators Can Be Generalizable Robot Manipulators
    Paper arXiv Website GitHub Hugging Face

  • [ICLR'26] UD-VLAUnified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
    Paper arXiv GitHub Website Hugging Face

  • [arXiv'25.08] VideoPolicyVideo Generators are Robot Policies
    arXiv GitHub Website

  • [arXiv'26.01] Cosmos PolicyCosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
    arXiv GitHub Website Hugging Face

  • [arXiv'26.02] DreamZero (WAM)World Action Models are Zero-shot Policies
    arXiv GitHub Website

  • [arXiv'26.03] GigaWorld-PolicyAn Efficient Action-Centered World-Action Model
    arXiv GitHub Website Hugging Face

  • [arXiv'26.04] MV-VDPMulti-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
    arXiv Website

  • [arXiv'26.04] Action ImagesAction Images: End-to-End Policy Learning via Multiview Video Generation
    arXiv GitHub

MoE/MoT-style Policies

Expert World-Model Backbones: video and action experts remain separated, interacting through shared attention / cross-attention / MoT fusion.

Expert-Coupled / MoT Designs

  • [ICLR'26] GE-ActGenie Envisioner's parallel flow-matching action expert with cross-attention to a video-diffusion world model
    Paper arXiv GitHub Website

  • [arXiv'25.12] MotusMotus: A Unified Latent Action World Model
    arXiv GitHub Website Hugging Face

  • [arXiv'26.01] LingBot-VACausal World Modeling for Robot Control (LingBot-VA)
    arXiv GitHub

  • [arXiv'26.02] BagelVLABagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
    arXiv Website

  • [arXiv'26.02] LDA-1BLDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
    arXiv GitHub Website

  • [arXiv'26.02] FRAPPEFRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
    arXiv GitHub Website Hugging Face

  • [arXiv'26.02] World Guidance (WoG)World Guidance: World Modeling in Condition Space for Action Generation
    arXiv GitHub Website

  • [arXiv'26.03] DiT4DiTJointly Modeling Video Dynamics and Actions for Generalizable Robot Control
    arXiv GitHub Website

  • [arXiv'26.03] Fast-WAMDo World Action Models Need Test-Time Future Imagination?
    arXiv GitHub Website

  • [arXiv'26.04] STARRYSTARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
    arXiv

  • [arXiv'26.04] MotuBrainMotuBrain: An Advanced World Action Model for Robot Control
    arXiv Website

  • [arXiv'26.04] WAVWorld-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
    arXiv GitHub Website

  • [arXiv'26.05] CKT-WAMCKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
    arXiv GitHub

Unified VLA Models

Unified Vision-Language-Action architectures that internalize world modeling as a training objective within a single multimodal backbone.

  • [ICLR'24] GR-1Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
    Paper arXiv GitHub Website

  • [arXiv'24.10] GR-2GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
    arXiv Website

  • [ICML'25] UP-VLAA Unified Understanding and Prediction Model for Embodied Agent
    arXiv GitHub

  • [NeurIPS'25] DreamVLADreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
    Paper arXiv GitHub Website

  • [arXiv'25.05] UniVLA (task-centric latent actions)UniVLA: Learning to Act Anywhere with Task-Centric Latent Actions
    arXiv GitHub

  • [ICLR'26] Unified VLA (UniVLA)Unified Vision-Language-Action Model
    Paper arXiv GitHub Website

  • [ICLR'26] Genie EnvisionerGenie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
    Paper arXiv GitHub Website

  • [CVPR'26] CoWVLAChain of World: World Model Thinking in Latent Motion
    arXiv Website GitHub Hugging Face

  • [arXiv'25.09] F1A Vision-Language-Action Model Bridging Understanding and Generation to Actions
    arXiv GitHub Website Hugging Face

  • [arXiv'25.11] RynnVLA-002RynnVLA-002: A Unified Vision-Language-Action and World Model
    arXiv GitHub Hugging Face

  • [arXiv'25.07] TriVLATriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control
    arXiv Website

  • [arXiv'26.01] InternVLA-A1InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
    arXiv GitHub Website

  • [arXiv'26.02] HALOA Unified VLA Model for Embodied Multimodal Chain-of-Thought Reasoning
    arXiv

  • [arXiv'26.05] OA-WAMOA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
    arXiv

Latent-space World Modeling

Policies with Latent-Space World Modeling: internalize future prediction in representation space without explicit video generation. JEPA-style approaches.

  • [CoRL'25] FLARERobot Learning with Implicit World Modeling
    arXiv Website

  • [arXiv'26.02] VLA-JEPAEnhancing Vision-Language-Action Model with Latent World Model
    arXiv GitHub Website Hugging Face

  • [arXiv'26.02] VISTAScaling World Model for Hierarchical Manipulation Policies
    arXiv GitHub Website

  • [arXiv'26.02] JEPA-VLAVideo Predictive Embedding is Needed for VLA Models
    arXiv

  • [arXiv'26.02] World Guidance (WoG)World Guidance: World Modeling in Condition Space for Action Generation
    arXiv GitHub Website

  • [arXiv'26.03] DIALDecoupling Intent and Action via Latent World Modeling for End-to-End VLA
    arXiv Website GitHub Hugging Face

  • [arXiv'26.04] AIMIntent-Aware Unified World Action Modeling with Spatial Value Maps
    arXiv GitHub

  • [arXiv'26.04] DexWorldModelDexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
    arXiv GitHub


World Model as Simulator

Beyond predictive conditioning, world models can serve as interactive simulators: given observations, instructions, and candidate actions, they roll out future states, provide feedback signals, and support downstream decision-making through imagined interaction. This section covers two complementary uses: reinforcement learning in learned simulators, and evaluation/planning through imagined rollouts.

World Model for Reinforcement Learning

World models as learned environments for policy improvement through imagined rollouts, replacing costly physical interaction.

  • [CoRL'23] DayDreamerDayDreamer: World Models for Physical Robot Learning
    arXiv GitHub Website

  • [ICLR'24] UniSimLearning Interactive Real-World Simulators
    arXiv Website

  • [CoRL'25] DiWADiWA: Diffusion Policy Adaptation with World Models
    arXiv GitHub Website

  • [arXiv'25.09] World-EnvWorld-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
    arXiv GitHub

  • [arXiv'25.09] World4RLDiffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation
    arXiv Code Website

  • [arXiv'25.10] VLA-RFTVision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators
    arXiv GitHub Website Hugging Face

  • [arXiv'25.11] ProphRLReinforcing Action Policies by Prophesying
    arXiv Website

  • [ICLR'26] WMPOWorld Model-based Policy Optimization for Vision-Language-Action Models
    Paper arXiv GitHub Website

  • [CVPR'26] RehearseVLASimulated Post-Training for VLAs with Physically-Consistent World Model
    arXiv GitHub

  • [arXiv'26.02] World-GymnastWorld-Gymnast: Training Robots with Reinforcement Learning in a World Model
    arXiv GitHub Website

  • [arXiv'26.02] RISERISE: Self-Improving Robot Policy with Compositional World Model
    arXiv Website GitHub

  • [arXiv'26.02] VLAWVLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model
    arXiv Website

  • [arXiv'26.02] GigaBrain-0.5M*a VLA That Learns From World Model-Based Reinforcement Learning
    arXiv GitHub Website

  • [arXiv'26.02] WoVRWoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL
    arXiv GitHub Hugging Face

  • [arXiv'26.02] World-VLA-LoopWorld-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy
    arXiv GitHub Website

  • [arXiv'26.03] PlayWorldLearning Robot World Models from Autonomous Play
    arXiv Website

  • [arXiv'26.03] VLA-MBPOTowards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models
    arXiv

  • [arXiv'26.04] ViVaViVa: A Video-Generative Value Model for Robot Reinforcement Learning
    arXiv Website

World Model for Evaluation

World models as evaluators: scoring candidate behaviors, ranking policies, supporting MPC planning, and enabling decision-time action selection through predictive rollout.

  • [ICLR'24] TD-MPC2Scalable, Robust World Models for Continuous Control
    arXiv Website GitHub

  • [ICLR'26] WorldGymWorldGym: World Model as An Environment for Policy Evaluation
    Paper arXiv GitHub Website

  • [ICLR'26] Horizon ImaginationHorizon Imagination: Efficient On-Policy Rollout in Diffusion World Models
    Paper GitHub

  • [arXiv'25.05] WorldEvalWorldEval: World Model as Real-World Robot Policies Evaluator
    arXiv GitHub Website

  • [arXiv'25.11] Scalable Policy Evaluation with Video World Models
    arXiv Website

  • [arXiv'25.12] Evaluating Gemini Robotics Policies in a Veo World Simulator
    arXiv

  • [RA-L'26] GPCInference-Time Enhancement of Generative Robot Policies via Predictive World Modeling
    arXiv GitHub

  • [arXiv'26.03] DreamPlanEfficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
    arXiv Website GitHub

  • [arXiv'26.03] LeWorldModelStable End-to-End Joint-Embedding Predictive Architecture from Pixels
    arXiv Website GitHub

  • [arXiv'25.06] V-JEPA 2V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
    arXiv GitHub

  • [arXiv'26.03] V-JEPA 2.1Unlocking Dense Features in Video Self-Supervised Learning
    arXiv Website GitHub

  • [arXiv'26.04] dWorldEvaldWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
    arXiv Website

  • [arXiv'26.05] FFDC-WAMWhen to Trust Imagination: Adaptive Action Execution for World Action Models
    arXiv


World Model for Video Generation

Video generation / video world models for robotics, including interactive simulators, imagination-based policy learning, and foundation video-world backbones that support robot learning.

  • [ICLR'24] Video Language Planning (VLP)Video Language Planning
    arXiv GitHub Website

  • [CoRL'24] DreamitateDreamitate: Real-World Visuomotor Policy Learning via Video Generation
    Paper arXiv GitHub Website

  • [ICML'24] RoboDreamerRoboDreamer: Learning Compositional World Models for Robot Imagination
    Paper arXiv GitHub Website

  • [ICLR'25] DreMaDream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination
    Paper arXiv GitHub Website

  • [ICLR'25] CogVideoXCogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
    Paper GitHub

  • [CoRL'25] DreamGenDreamGen: Unlocking Generalization in Robot Learning through Video World Models
    Paper arXiv Website

  • [ICCV'25] PhysWorldPhysWorld: Robot Learning from a Physical World Model
    arXiv Website GitHub

  • [ICCV'25] IRASimIRASim: A Fine-Grained World Model for Robot Manipulation
    Paper arXiv GitHub Website

  • [IROS'25] RoboEnvisionRoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation
    arXiv

  • [ICLR'26] RoboMasterLearning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
    Paper GitHub Website

  • [ICLR'26] Vid2WorldVid2World: Crafting Video Diffusion Models to Interactive World Models
    Paper GitHub Website

  • [ICLR'26] Genie EnvisionerGenie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
    Paper arXiv GitHub Website

  • [ICLR'26] Ctrl-WorldCtrl-World: A Controllable Generative World Model for Robot Manipulation
    Paper arXiv GitHub Website

  • [AAAI'26] Mask2IVMask2IV: Interaction-Centric Video Generation via Mask Trajectories
    arXiv GitHub Website

  • [arXiv'25.04] ManipDreamerManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance
    arXiv Website

  • [arXiv'25.04] TesserActTesserAct
    arXiv GitHub Website Hugging Face

  • [arXiv'25.05] EnerVerse-ACEnerVerse-AC
    arXiv GitHub Website

  • [arXiv'25.09] WoWWoW: Towards a World-omniscient World-model Through Embodied Interaction
    arXiv GitHub

  • [Tech Release'25.09] UnifoLM-WMA-0UnifoLM-WMA-0: A World-Model-Action Framework for General-Purpose Robot Learning
    GitHub Website

  • [Tech Report'25.10] Cosmos Predict 2.5Cosmos-Predict2.5: A Suite of Diffusion-based World Foundation Models
    GitHub Website

  • [arXiv'25.11] GigaWorld-0GigaWorld-0
    arXiv GitHub Website

  • [arXiv'26.01] RoboVIPRoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
    arXiv GitHub Website

  • [arXiv'26.02] DreamDojoDreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
    arXiv GitHub Website

  • [arXiv'26.03] Interactive World SimulatorInteractive World Simulator for Robot Policy Training and Evaluation
    arXiv GitHub Website Hugging Face

  • [arXiv'26.03] ABot-PhysWorldInteractive World Foundation Model for Robotic Manipulation with Physics Alignment
    arXiv GitHub

  • [arXiv'26.03] EVA (model)EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
    arXiv Website GitHub Hugging Face

Note: this EVA is the action-controllable video world model (Wang et al., 2026); not to be confused with EVA-Bench (Chi et al., ICML'25) listed under Benchmarks.

  • [arXiv'26.03] Kinema4DKinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
    arXiv GitHub

  • [arXiv'26.03] Persistent Robot World ModelsPersistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
    arXiv GitHub

  • [arXiv'26.04] Cortex 2.0Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
    arXiv

  • [arXiv'26.04] X-WAMUnified 4D World Action Modeling from Video Priors with Asynchronous Denoising
    arXiv Website GitHub

  • [arXiv'26.05] EA-WMEA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
    arXiv


Benchmarks for Evaluation World-Model

Benchmarks / evaluation suites for embodied world models, video world models, and world simulators.
Cross-listing is intentional: if a work releases both a benchmark and a dataset, it can appear here and in Datasets.

  • [ICML'25] EVA-Benchbenchmark introduced in Empowering World Models with Reflection for Embodied Video Prediction (EVA)
    Paper arXiv GitHub Website

  • [ICML'25] WorldSimBenchWorldSimBench: Towards Video Generation Models as World Simulators
    Paper arXiv Website

  • [BMVC'25] EWMBenchEWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
    Paper arXiv GitHub

  • [CoRL'25] DreamGen Benchbenchmark introduced in DreamGen: Unlocking Generalization in Robot Learning through Video World Models
    Paper arXiv GitHub Website

  • [ICLR'26] World-in-World (WoW!)World-in-World: World Models in a Closed-Loop World
    Paper arXiv GitHub Website

  • [arXiv'26.01] WoW-World-EvalWow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test
    arXiv

  • [arXiv'26.01] RBenchRethinking Video Generation Model for the Embodied World
    arXiv GitHub Website

  • [arXiv'26.02] WorldArenaWorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models
    arXiv GitHub Website

  • [ACL Findings'25] WM-ABenchDo Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
    Paper arXiv Website Hugging Face

  • [arXiv'26.04] RoboWM-BenchRoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
    arXiv GitHub

  • [arXiv'26.01] DrivingGenDrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving
    arXiv Website GitHub


Datasets

Training datasets, preference datasets, and instruction-tuning datasets.
Cross-listing is intentional: if a work releases both datasets and benchmarks, it may appear here and in Benchmarks for Evaluation World-Model.

General-Purpose Trajectory Corpora & Cross-Embodiment

  • [CoRL'23] BridgeData V2BridgeData V2: A Dataset for Robot Learning at Scale
    arXiv GitHub Website

  • [ICRA'24] Open X-Embodiment (OXE)Open X-Embodiment: Robotic Learning Datasets and RT-X Models
    Paper arXiv GitHub Website

  • [RSS'24] DROIDDROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
    Paper arXiv GitHub Website

  • [IROS'25] AgiBot-World (Alpha/Beta)AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
    arXiv GitHub Website

  • [arXiv'25.09] Galaxea Open-World DatasetGalaxea Open-World Dataset and G0 Dual-System VLA Model
    arXiv GitHub Website

  • [arXiv'25.10] Humanoid EverydayHumanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation
    arXiv Website Hugging Face

  • [arXiv'25.12] RoboMIND 2.0RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence
    arXiv GitHub ModelScope

  • [arXiv'24.05] BRMDataEmpowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks
    arXiv GitHub

  • [RSS Workshop'23] RH20TRH20T: A Robotic Dataset for Learning Diverse Skills in One-Shot
    Website

  • [IROS'25] RH20T-PRH20T-P: A Primitive-Level Robotic Manipulation Dataset Towards Composable Generalization Agents in Real-World Scenarios
    arXiv Website

UMI / Hand-Held Interface Family

  • [RSS'24] UMIUniversal Manipulation Interface: In-the-Wild Robot Teaching without In-the-Wild Robots
    arXiv GitHub Website

  • [arXiv'25.09] MV-UMIMV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning
    arXiv Website

  • [arXiv'25.10] ActiveUMIActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations
    arXiv Website

  • [arXiv'25.10] FastUMI-100KFastUMI-100K: Advancing Data-Driven Robotic Manipulation with a Large-scale UMI-style Dataset
    arXiv GitHub

  • [arXiv'25.11] TWIST2TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
    arXiv Website GitHub

Human-Video / Egocentric Priors

  • [ICRA'25] EgoMimicEgoMimic: Scaling Imitation Learning via Egocentric Video
    Paper arXiv Website

  • [RSS'25] DexWildDexWild: Dexterous Human Interactions for In-the-Wild Robot Policies
    Website

  • [arXiv'25.07] Being-h0 (UniHand)Being-h0: Vision-Language-Action Pretraining from Large-scale Human Videos
    arXiv Website GitHub Hugging Face

  • [arXiv'26.01] Being-H0.5 (UniHand 2.0)Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
    arXiv Website GitHub Hugging Face

  • [arXiv'25.11] PHSD / In-N-OnIn-N-On: Scaling Egocentric Manipulation with In-the-Wild and On-Task Data
    arXiv Website

Tactile / Force / Contact-Rich Datasets

  • [arXiv'25.06] FreeTacManFreeTacMan: Robot-Free Visuo-Tactile Data Collection System for Contact-Rich Manipulation
    arXiv Website GitHub

  • [arXiv'25.10] Humanoid Visual-Tactile-ActionA Humanoid Visual-Tactile-Action Dataset for Contact-Rich Manipulation
    arXiv

  • [ICLR'25] VTDexManipVTDexManip: A Dataset and Benchmark for Visual-Tactile Pretraining and Dexterous Manipulation with Reinforcement Learning
    Paper

  • [arXiv'25.12] Hoi!Hoi!: A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
    arXiv Website

Synthetic / Recipe-Driven Datasets & Preference / Instruction-Tuning Sets

  • [arXiv'25.06] RoboTwin 2.0RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
    arXiv Website GitHub

  • [ICML'25] EVA-Instructinstruction-tuning dataset released with Empowering World Models with Reflection for Embodied Video Prediction (EVA)
    Paper arXiv GitHub Website

  • [ICML'25] HF-Embodiedhuman-preference dataset introduced in WorldSimBench
    Paper arXiv Website

  • [arXiv'26.01] Action100MAction100M: A Large-scale Video Action Dataset
    arXiv GitHub

  • [arXiv'26.01] RoVid-Xtraining dataset released with Rethinking Video Generation Model for the Embodied World
    arXiv GitHub Website

Citations

If you find this repository useful, please consider citing the original papers listed above and/or citing this collection:

@misc{hou2026worldmodelrobotlearning,
  title         = {World Model for Robot Learning: A Comprehensive Survey},
  author        = {Bohan Hou and Gen Li and Jindou Jia and Tuo An and Xinying Guo and Sicong Leng and Haoran Geng and Yanjie Ze and Tatsuya Harada and Philip Torr and Oier Mees and Marc Pollefeys and Zhuang Liu and Jiajun Wu and Pieter Abbeel and Jitendra Malik and Yilun Du and Jianfei Yang},
  year          = {2026},
  eprint        = {2605.00080},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2605.00080}
}