๐Ÿฆพ Awesome VLA Study

March 21, 2026 ยท View on GitHub

Getting started with VLA? This guide takes you from the foundations to the frontier โ€” diffusion and flow matching, state-of-the-art robot foundation model architectures, data scaling, RL fine-tuning, and world models. Papers in reading order.

๐Ÿ“‹ Prerequisites

  • Basic probability & optimization (enough to follow ELBO, score matching derivations)
  • Deep learning fundamentals (Transformers, attention, tokenization)
  • Paper presentation: 1โ€“2 participants per week, 30 min/paper โ€” architecture, training, key results
  • Discussion: Compare design choices across the week's papers, discuss limitations and open questions (15โ€“20 min)
PhaseWeeksTopicReadings
Phase 1W1โ€“3Generative Model FoundationsMIT 6.S184 course
Phase 2W4โ€“5Early Foundation RFMs & Robot PolicyRT-1, RT-2, Octo, OpenVLA, BeT, Diffusion Policy, ACT
Phase 3W6โ€“7Current RFM ArchitecturesCogACT, GR00T N1, X-VLA, ฯ€0, InternVLA-M1
Phase 4W8โ€“9Data ScalingOXE, AgiBot World, UMI, VITRA, Human to Robot Transfer
Phase 5W10โ€“11Efficient Inference & Dual-SystemRTC, SmolVLA, Helix, Fast-in-Slow
Phase 6W12โ€“14RL Fine-tuning, Reasoning & World ModelHIL-SERL, SimpleVLA-RL, ฯ€*0.6, CoT-VLA, ThinkAct, Fast-ThinkAct, UniVLA, Cosmos Policy, DreamZero

Phase 1: Generative Model Foundations (Weeks 1โ€“3)

๐Ÿ“š Core Material: MIT 6.S184 โ€” Introduction to Flow Matching and Diffusion Models (Holderrieth & Erives, MIT CSAIL, 2025) | Course notes paper

Week 1: ODE/SDE Foundations & Diffusion Models

MaterialTopic
Lectures 1โ€“2ODE/SDE basics, forward/reverse processes, conditional/marginal probability paths
Lab 1Hands-on SDE simulation

Week 2: Flow Matching, Score Matching & Training

MaterialTopic
Lectures 3โ€“4Flow Matching, Score Matching, guidance, classifier-free guidance
Labs 2โ€“3Building a toy diffusion model from scratch

Week 3: Generative Robotics & Review

MaterialTopic
Lecture 5Guest lecture by Benjamin Burchfiel (Toyota Research): diffusion models for robotics
Lecture 6Generative protein design (optional)

Phase 2: Early Foundation Robot Models & Robot Policy (Weeks 4โ€“5)

Week 4: Early Foundation Robot Models โ€” RT-1, RT-2, Octo, OpenVLA

#PaperLinkKey Topic
1RT-1: Robotics Transformer โ€” Brohan et al. (2022)2212.06817First large-scale Robotics Transformer (no VLM)
2RT-2: Vision-Language-Action Models โ€” Brohan et al. (2023)2307.15818VLM backbone โ†’ VLA paradigm
3Octo โ€” Ghosh et al. (2024)2405.12213Open-source generalist policy, modular design, pretrained on OXE (no VLM)
4OpenVLA โ€” Kim et al. (2024)2406.09246First open-source VLM-based VLA

๐Ÿ“Ž Supplementary video: Stanford CS25 V3 โ€” Low-level Embodied Intelligence

Key points: RT-1 (35M, no VLM) โ†’ RT-2 (55B VLM, action as text tokens) establishes the VLA concept. Octo (27Mโ€“93M, diffusion head, no VLM) and OpenVLA (7B, VLM + 256-bin discretization) are the first open-source generalist robot policies enabling community iteration.

Week 5: Core Robot Policies โ€” Diffusion Policy, ACT, BeT

#PaperLinkKey Topic
5Behavior Transformers (BeT) โ€” Shafiullah et al. (2022)2206.11251Multimodal action discretization, k-means + offset
6Diffusion Policy โ€” Chi et al. (2023)2303.04137Diffusion for robot control, action sequence prediction
7ACT/ALOHA โ€” Zhao et al. (2023)2304.13705Action Chunking Transformer, CVAE, bimanual

Key points: Three approaches to the multimodal action problem. Action chunking (predicting K future actions at once) is foundational for later VLA work.


Phase 3: Current RFM Architectures (Weeks 6โ€“7)

Week 6: VLM + Action Head โ€” CogACT, GR00T N1, X-VLA

#PaperLinkKey Topic
8CogACT โ€” Li et al. (2024)2411.19650VLM + DiT action head, action token learning
9GR00T N1 โ€” Bjorck et al. (2025)2503.147342B diffusion transformer, whole-body humanoid control
10X-VLA โ€” Zheng et al. (2025)2510.10274Soft prompts for cross-embodiment, Florence-Large + flow matching

Key points: All three use only the VLM's last hidden state to drive a separate action head.

Week 7: VLM + Action Expert โ€” ฯ€0, InternVLA-M1

#PaperLinkKey Topic
11ฯ€0 โ€” Black et al. (2024)2410.24164Flow matching + action expert accessing VLM intermediate features
12InternVLA-M1 โ€” Chen et al. (2025)2510.13778Spatial grounding โ†’ action generation, AR-based

๐Ÿ“Ž Background: Transfusion โ€” Zhou et al. (2024) | 2408.11039 โ€” AR + diffusion in one transformer; ฯ€0's architectural basis

Key points: Unlike Week 6's action heads that only see the VLM's last hidden state, these action experts access VLM internal hidden states.


Phase 4: Data Scaling (Weeks 8โ€“9)

Week 8: Large-Scale Robot Datasets โ€” OXE, AgiBot World

#PaperLinkKey Topic
13Open X-Embodiment (OXE) โ€” Open X-Embodiment Collaboration (2023)2310.088641M+ trajectories, 22 embodiments, standardized data format
14AgiBot World โ€” Bu et al. (2025)2503.066691M+ trajectories, 217 tasks, 5 deployment scenarios

๐Ÿ“Ž Data formats โ€” Recording-oriented: rosbag (ROS 1), mcap (vendor-neutral, ROS 2 default). Training-oriented: RLDS (TensorFlow/OXE standard), LeRobotDataset (HuggingFace, Parquet + video).
๐Ÿ“Ž From the Evolution of Rosbag to the Future of AI Tooling โ€” by the original rosbag author; covers rosbag V1โ†’V2 โ†’ rosbag2 (sqlite3) โ†’ MCAP evolution

Key points: Large-scale multi-embodiment datasets that enable generalist robot policy pretraining. OXE standardized the data format across 22 robot embodiments via RLDS; AgiBot World provides high-quality data at scale.

Week 9: Data Collection Methods โ€” UMI, VITRA, Human to Robot Transfer

#PaperLinkKey Topic
15UMI โ€” Chi et al. (2024)2402.10329Robot-free SE(3) data collection via handheld gripper
16VITRA โ€” Li et al. (2025)2510.21571Human video โ†’ VLA training data (1M episodes from egocentric human videos)
17Human to Robot Transfer โ€” Kareer et al. (2025)2512.22414Human video โ†’ robot transfer emerges with VLA scaling

Key points: Three data sources beyond robot teleoperation โ€” UMI (embodiment-agnostic physical demos, <$200 hardware), egocentric video, and exocentric video.


Phase 5: Efficient Inference & Dual-System (Weeks 10โ€“11)

Week 10: Fast-Acting VLA โ€” SmolVLA & RTC

#PaperLinkKey Topic
18SmolVLA โ€” Shukor et al. (2025)2506.01844450M params (~1/7 of ฯ€0), model compression + async inference
19RTC โ€” Black et al. (2025)2506.07339Async inference โ€” freezing + inpainting, no retraining needed

Key points: Two complementary approaches โ€” SmolVLA compresses the model itself, RTC optimizes the inference pipeline. Can be combined.

Week 11: Dual-System VLA โ€” Helix & Fast-in-Slow

#PaperLinkKey Topic
20Helix โ€” Figure AI (2025)figure.ai/news/helixS2: 7B VLM @7-9Hz, S1: 80M @200Hz, humanoid
21Fast-in-Slow โ€” Chen et al. (2025)2506.01953Integrated dual-system, end-to-end trainable

Key points: Dual-System separates slow reasoning (VLM) from fast execution (lightweight policy) at different frequencies. Helix (separately trained) vs Fast-in-Slow (end-to-end trainable).


Phase 6: RL Fine-tuning, Reasoning & World Model (Weeks 12โ€“14)

Week 12: RL Fine-tuning & Human-in-the-Loop โ€” HIL-SERL, SimpleVLA-RL, ฯ€*0.6

#PaperLinkKey Topic
22HIL-SERL โ€” Luo et al. (2024)2410.21845Human-in-the-loop RL, sample-efficient real-world training
23SimpleVLA-RL โ€” Li et al. (2025)2509.09674RL fine-tuning for autoregressive VLA, outcome-based rewards
24ฯ€*0.6 / Recap โ€” Physical Intelligence (2025)2511.14759RL for flow-based VLA, advantage-conditioned, learns from suboptimal data

Key points: Three RL approaches โ€” HIL-SERL (human-in-the-loop, sample-efficient), SimpleVLA-RL (outcome rewards), ฯ€*0.6 (advantage-conditioned, learns from suboptimal data).

Week 13: Reasoning VLA โ€” CoT-VLA, ThinkAct, Fast-ThinkAct

#PaperLinkKey Topic
25CoT-VLA โ€” Zhao et al. (2025)2503.22020Visual chain-of-thought reasoning (future image prediction) before action
26ThinkAct โ€” Huang et al. (2025)2507.16815Decouple reasoning from execution; RL grounds plan quality in task success, not language supervision
27Fast-ThinkAct โ€” Huang et al. (2026)2601.09708Text-level CoT dispensable โ€” latent distillation preserves planning capacity at ~10ร— speed

๐Ÿ“Ž Fast-ThinkAct's reasoning compression is orthogonal to Week 10's model compression (SmolVLA, RTC) โ€” the two can stack.

Key points: Reasoning representation โ€” image tokens (CoT-VLA) vs. visual latent (ThinkAct) vs. compressed latent tokens (Fast-ThinkAct). ThinkAct grounds reasoning in task-outcome RL instead of language supervision. Fast-ThinkAct shows planning structure, not verbosity, carries the signal (~10ร— faster, performance preserved).

Week 14: World Model โ€” UniVLA, Cosmos Policy, DreamZero

#PaperLinkKey Topic
28UniVLA โ€” Wang et al. (2025)2506.19850Unified AR VLA with world modeling as training objective
29Cosmos Policy โ€” Kim et al. (2026)2601.16163Pretrained video foundation model as robot policy backbone
30DreamZero โ€” Ye et al. (2026)dreamzero0.github.ioWorld Action Model, joint world+action generation in latent space

Key points: Three ways to leverage world knowledge โ€” training regularizer (UniVLA, no world prediction at inference), pretrained video FM as policy backbone (Cosmos Policy), joint world+action generation in latent space (DreamZero).


Contributing

Suggestions for papers, resources, or structural improvements are welcome โ€” please open an issue or PR.

See Also

  • ๐Ÿ”ฅ vla0-trl โ€” A complete VLA in ~1,200 lines of Python. Fine-tunes Qwen2.5-VL with TRL's SFTTrainer to predict actions as text, scoring ~90% on LIBERO. Read the entire codebase in an afternoon.
  • ๐Ÿ”ฅ vla-eval โ€” One framework to evaluate any VLA model on any robot simulation benchmark.
  • Awesome-RL-VLA โ€” RL for VLA models
  • Awesome-VLA-Robotics โ€” Large-scale VLA paper collection
  • awesome-physical-ai โ€” A curated list of academic papers and resources on Physical AI

Courses covering the prerequisites for this study guide โ€” only those with recent (2023+) video lectures freely available on YouTube. Pick what you need.

AreaCourseInstructorLinkNotes
DL FundamentalsMIT 6.S191: Intro to Deep LearningAlexander Aminiintrotodeeplearning.com ยท YouTube '251-week bootcamp (10 lectures) โ€” CNN, Transformer, generative models, RL
Andrej Karpathy: Neural Networks: Zero to HeroAndrej Karpathykarpathy.ai/zero-to-hero.html ยท YouTubeBackprop โ†’ GPT, build everything from scratch in code
VisionStanford CS231n: DL for Computer VisionFei-Fei Li et al.cs231n.stanford.edu ยท YouTube '25The canonical CV course โ€” backprop to detection/segmentation/video
NLP / TransformersStanford CS224n: NLP with Deep LearningChristopher Manningweb.stanford.edu/class/cs224n ยท YouTube '24Word vectors โ†’ Transformers โ†’ LLMs
RLUC Berkeley CS285: Deep RLSergey Levinerail.eecs.berkeley.edu/deeprlcourse ยท YouTube '23Policy gradients, Q-learning, model-based & offline RL โ€” by a leading robotics RL researcher