๐ฆพ Awesome VLA Study
March 21, 2026 ยท View on GitHub
Getting started with VLA? This guide takes you from the foundations to the frontier โ diffusion and flow matching, state-of-the-art robot foundation model architectures, data scaling, RL fine-tuning, and world models. Papers in reading order.
๐ Prerequisites
- Basic probability & optimization (enough to follow ELBO, score matching derivations)
- Deep learning fundamentals (Transformers, attention, tokenization)
- ๐ก Starting from scratch? MIT 6.S191 โ Intro to Deep Learning covers CNNs, Transformers, and generative models in a 1-week intensive bootcamp. More courses below.
๐ฌ Weekly Format (Recommended)
- Paper presentation: 1โ2 participants per week, 30 min/paper โ architecture, training, key results
- Discussion: Compare design choices across the week's papers, discuss limitations and open questions (15โ20 min)
| Phase | Weeks | Topic | Readings |
|---|---|---|---|
| Phase 1 | W1โ3 | Generative Model Foundations | MIT 6.S184 course |
| Phase 2 | W4โ5 | Early Foundation RFMs & Robot Policy | RT-1, RT-2, Octo, OpenVLA, BeT, Diffusion Policy, ACT |
| Phase 3 | W6โ7 | Current RFM Architectures | CogACT, GR00T N1, X-VLA, ฯ0, InternVLA-M1 |
| Phase 4 | W8โ9 | Data Scaling | OXE, AgiBot World, UMI, VITRA, Human to Robot Transfer |
| Phase 5 | W10โ11 | Efficient Inference & Dual-System | RTC, SmolVLA, Helix, Fast-in-Slow |
| Phase 6 | W12โ14 | RL Fine-tuning, Reasoning & World Model | HIL-SERL, SimpleVLA-RL, ฯ*0.6, CoT-VLA, ThinkAct, Fast-ThinkAct, UniVLA, Cosmos Policy, DreamZero |
Phase 1: Generative Model Foundations (Weeks 1โ3)
๐ Core Material: MIT 6.S184 โ Introduction to Flow Matching and Diffusion Models (Holderrieth & Erives, MIT CSAIL, 2025) | Course notes paper
Week 1: ODE/SDE Foundations & Diffusion Models
| Material | Topic |
|---|---|
| Lectures 1โ2 | ODE/SDE basics, forward/reverse processes, conditional/marginal probability paths |
| Lab 1 | Hands-on SDE simulation |
Week 2: Flow Matching, Score Matching & Training
| Material | Topic |
|---|---|
| Lectures 3โ4 | Flow Matching, Score Matching, guidance, classifier-free guidance |
| Labs 2โ3 | Building a toy diffusion model from scratch |
Week 3: Generative Robotics & Review
| Material | Topic |
|---|---|
| Lecture 5 | Guest lecture by Benjamin Burchfiel (Toyota Research): diffusion models for robotics |
| Lecture 6 | Generative protein design (optional) |
Phase 2: Early Foundation Robot Models & Robot Policy (Weeks 4โ5)
Week 4: Early Foundation Robot Models โ RT-1, RT-2, Octo, OpenVLA
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 1 | RT-1: Robotics Transformer โ Brohan et al. (2022) | 2212.06817 | First large-scale Robotics Transformer (no VLM) |
| 2 | RT-2: Vision-Language-Action Models โ Brohan et al. (2023) | 2307.15818 | VLM backbone โ VLA paradigm |
| 3 | Octo โ Ghosh et al. (2024) | 2405.12213 | Open-source generalist policy, modular design, pretrained on OXE (no VLM) |
| 4 | OpenVLA โ Kim et al. (2024) | 2406.09246 | First open-source VLM-based VLA |
๐ Supplementary video: Stanford CS25 V3 โ Low-level Embodied Intelligence
Key points: RT-1 (35M, no VLM) โ RT-2 (55B VLM, action as text tokens) establishes the VLA concept. Octo (27Mโ93M, diffusion head, no VLM) and OpenVLA (7B, VLM + 256-bin discretization) are the first open-source generalist robot policies enabling community iteration.
Week 5: Core Robot Policies โ Diffusion Policy, ACT, BeT
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 5 | Behavior Transformers (BeT) โ Shafiullah et al. (2022) | 2206.11251 | Multimodal action discretization, k-means + offset |
| 6 | Diffusion Policy โ Chi et al. (2023) | 2303.04137 | Diffusion for robot control, action sequence prediction |
| 7 | ACT/ALOHA โ Zhao et al. (2023) | 2304.13705 | Action Chunking Transformer, CVAE, bimanual |
Key points: Three approaches to the multimodal action problem. Action chunking (predicting K future actions at once) is foundational for later VLA work.
Phase 3: Current RFM Architectures (Weeks 6โ7)
Week 6: VLM + Action Head โ CogACT, GR00T N1, X-VLA
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 8 | CogACT โ Li et al. (2024) | 2411.19650 | VLM + DiT action head, action token learning |
| 9 | GR00T N1 โ Bjorck et al. (2025) | 2503.14734 | 2B diffusion transformer, whole-body humanoid control |
| 10 | X-VLA โ Zheng et al. (2025) | 2510.10274 | Soft prompts for cross-embodiment, Florence-Large + flow matching |
Key points: All three use only the VLM's last hidden state to drive a separate action head.
Week 7: VLM + Action Expert โ ฯ0, InternVLA-M1
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 11 | ฯ0 โ Black et al. (2024) | 2410.24164 | Flow matching + action expert accessing VLM intermediate features |
| 12 | InternVLA-M1 โ Chen et al. (2025) | 2510.13778 | Spatial grounding โ action generation, AR-based |
๐ Background: Transfusion โ Zhou et al. (2024) | 2408.11039 โ AR + diffusion in one transformer; ฯ0's architectural basis
Key points: Unlike Week 6's action heads that only see the VLM's last hidden state, these action experts access VLM internal hidden states.
Phase 4: Data Scaling (Weeks 8โ9)
Week 8: Large-Scale Robot Datasets โ OXE, AgiBot World
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 13 | Open X-Embodiment (OXE) โ Open X-Embodiment Collaboration (2023) | 2310.08864 | 1M+ trajectories, 22 embodiments, standardized data format |
| 14 | AgiBot World โ Bu et al. (2025) | 2503.06669 | 1M+ trajectories, 217 tasks, 5 deployment scenarios |
๐ Data formats โ Recording-oriented: rosbag (ROS 1), mcap (vendor-neutral, ROS 2 default). Training-oriented: RLDS (TensorFlow/OXE standard), LeRobotDataset (HuggingFace, Parquet + video).
๐ From the Evolution of Rosbag to the Future of AI Tooling โ by the original rosbag author; covers rosbag V1โV2 โ rosbag2 (sqlite3) โ MCAP evolution
Key points: Large-scale multi-embodiment datasets that enable generalist robot policy pretraining. OXE standardized the data format across 22 robot embodiments via RLDS; AgiBot World provides high-quality data at scale.
Week 9: Data Collection Methods โ UMI, VITRA, Human to Robot Transfer
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 15 | UMI โ Chi et al. (2024) | 2402.10329 | Robot-free SE(3) data collection via handheld gripper |
| 16 | VITRA โ Li et al. (2025) | 2510.21571 | Human video โ VLA training data (1M episodes from egocentric human videos) |
| 17 | Human to Robot Transfer โ Kareer et al. (2025) | 2512.22414 | Human video โ robot transfer emerges with VLA scaling |
Key points: Three data sources beyond robot teleoperation โ UMI (embodiment-agnostic physical demos, <$200 hardware), egocentric video, and exocentric video.
Phase 5: Efficient Inference & Dual-System (Weeks 10โ11)
Week 10: Fast-Acting VLA โ SmolVLA & RTC
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 18 | SmolVLA โ Shukor et al. (2025) | 2506.01844 | 450M params (~1/7 of ฯ0), model compression + async inference |
| 19 | RTC โ Black et al. (2025) | 2506.07339 | Async inference โ freezing + inpainting, no retraining needed |
Key points: Two complementary approaches โ SmolVLA compresses the model itself, RTC optimizes the inference pipeline. Can be combined.
Week 11: Dual-System VLA โ Helix & Fast-in-Slow
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 20 | Helix โ Figure AI (2025) | figure.ai/news/helix | S2: 7B VLM @7-9Hz, S1: 80M @200Hz, humanoid |
| 21 | Fast-in-Slow โ Chen et al. (2025) | 2506.01953 | Integrated dual-system, end-to-end trainable |
Key points: Dual-System separates slow reasoning (VLM) from fast execution (lightweight policy) at different frequencies. Helix (separately trained) vs Fast-in-Slow (end-to-end trainable).
Phase 6: RL Fine-tuning, Reasoning & World Model (Weeks 12โ14)
Week 12: RL Fine-tuning & Human-in-the-Loop โ HIL-SERL, SimpleVLA-RL, ฯ*0.6
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 22 | HIL-SERL โ Luo et al. (2024) | 2410.21845 | Human-in-the-loop RL, sample-efficient real-world training |
| 23 | SimpleVLA-RL โ Li et al. (2025) | 2509.09674 | RL fine-tuning for autoregressive VLA, outcome-based rewards |
| 24 | ฯ*0.6 / Recap โ Physical Intelligence (2025) | 2511.14759 | RL for flow-based VLA, advantage-conditioned, learns from suboptimal data |
Key points: Three RL approaches โ HIL-SERL (human-in-the-loop, sample-efficient), SimpleVLA-RL (outcome rewards), ฯ*0.6 (advantage-conditioned, learns from suboptimal data).
Week 13: Reasoning VLA โ CoT-VLA, ThinkAct, Fast-ThinkAct
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 25 | CoT-VLA โ Zhao et al. (2025) | 2503.22020 | Visual chain-of-thought reasoning (future image prediction) before action |
| 26 | ThinkAct โ Huang et al. (2025) | 2507.16815 | Decouple reasoning from execution; RL grounds plan quality in task success, not language supervision |
| 27 | Fast-ThinkAct โ Huang et al. (2026) | 2601.09708 | Text-level CoT dispensable โ latent distillation preserves planning capacity at ~10ร speed |
๐ Fast-ThinkAct's reasoning compression is orthogonal to Week 10's model compression (SmolVLA, RTC) โ the two can stack.
Key points: Reasoning representation โ image tokens (CoT-VLA) vs. visual latent (ThinkAct) vs. compressed latent tokens (Fast-ThinkAct). ThinkAct grounds reasoning in task-outcome RL instead of language supervision. Fast-ThinkAct shows planning structure, not verbosity, carries the signal (~10ร faster, performance preserved).
Week 14: World Model โ UniVLA, Cosmos Policy, DreamZero
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 28 | UniVLA โ Wang et al. (2025) | 2506.19850 | Unified AR VLA with world modeling as training objective |
| 29 | Cosmos Policy โ Kim et al. (2026) | 2601.16163 | Pretrained video foundation model as robot policy backbone |
| 30 | DreamZero โ Ye et al. (2026) | dreamzero0.github.io | World Action Model, joint world+action generation in latent space |
Key points: Three ways to leverage world knowledge โ training regularizer (UniVLA, no world prediction at inference), pretrained video FM as policy backbone (Cosmos Policy), joint world+action generation in latent space (DreamZero).
Contributing
Suggestions for papers, resources, or structural improvements are welcome โ please open an issue or PR.
See Also
- ๐ฅ vla0-trl โ A complete VLA in ~1,200 lines of Python. Fine-tunes Qwen2.5-VL with TRL's SFTTrainer to predict actions as text, scoring ~90% on LIBERO. Read the entire codebase in an afternoon.
- ๐ฅ vla-eval โ One framework to evaluate any VLA model on any robot simulation benchmark.
- Awesome-RL-VLA โ RL for VLA models
- Awesome-VLA-Robotics โ Large-scale VLA paper collection
- awesome-physical-ai โ A curated list of academic papers and resources on Physical AI
๐ Recommended Courses
Courses covering the prerequisites for this study guide โ only those with recent (2023+) video lectures freely available on YouTube. Pick what you need.
| Area | Course | Instructor | Link | Notes |
|---|---|---|---|---|
| DL Fundamentals | MIT 6.S191: Intro to Deep Learning | Alexander Amini | introtodeeplearning.com ยท YouTube '25 | 1-week bootcamp (10 lectures) โ CNN, Transformer, generative models, RL |
| Andrej Karpathy: Neural Networks: Zero to Hero | Andrej Karpathy | karpathy.ai/zero-to-hero.html ยท YouTube | Backprop โ GPT, build everything from scratch in code | |
| Vision | Stanford CS231n: DL for Computer Vision | Fei-Fei Li et al. | cs231n.stanford.edu ยท YouTube '25 | The canonical CV course โ backprop to detection/segmentation/video |
| NLP / Transformers | Stanford CS224n: NLP with Deep Learning | Christopher Manning | web.stanford.edu/class/cs224n ยท YouTube '24 | Word vectors โ Transformers โ LLMs |
| RL | UC Berkeley CS285: Deep RL | Sergey Levine | rail.eecs.berkeley.edu/deeprlcourse ยท YouTube '23 | Policy gradients, Q-learning, model-based & offline RL โ by a leading robotics RL researcher |