🦾 Awesome VLA Study

March 21, 2026 · View on GitHub

Getting started with VLA? This guide takes you from the foundations to the frontier — diffusion and flow matching, state-of-the-art robot foundation model architectures, data scaling, RL fine-tuning, and world models. Papers in reading order.

📋 Prerequisites

Basic probability & optimization (enough to follow ELBO, score matching derivations)
Deep learning fundamentals (Transformers, attention, tokenization)
- 💡 Starting from scratch? MIT 6.S191 — Intro to Deep Learning covers CNNs, Transformers, and generative models in a 1-week intensive bootcamp. More courses below.

💬 Weekly Format (Recommended)

Paper presentation: 1–2 participants per week, 30 min/paper — architecture, training, key results
Discussion: Compare design choices across the week's papers, discuss limitations and open questions (15–20 min)

Phase	Weeks	Topic	Readings
Phase 1	W1–3	Generative Model Foundations	MIT 6.S184 course
Phase 2	W4–5	Early Foundation RFMs & Robot Policy	RT-1, RT-2, Octo, OpenVLA, BeT, Diffusion Policy, ACT
Phase 3	W6–7	Current RFM Architectures	CogACT, GR00T N1, X-VLA, π0, InternVLA-M1
Phase 4	W8–9	Data Scaling	OXE, AgiBot World, UMI, VITRA, Human to Robot Transfer
Phase 5	W10–11	Efficient Inference & Dual-System	RTC, SmolVLA, Helix, Fast-in-Slow
Phase 6	W12–14	RL Fine-tuning, Reasoning & World Model	HIL-SERL, SimpleVLA-RL, π*0.6, CoT-VLA, ThinkAct, Fast-ThinkAct, UniVLA, Cosmos Policy, DreamZero

Phase 1: Generative Model Foundations (Weeks 1–3)

📚 Core Material: MIT 6.S184 — Introduction to Flow Matching and Diffusion Models (Holderrieth & Erives, MIT CSAIL, 2025) | Course notes paper

Week 1: ODE/SDE Foundations & Diffusion Models

Material	Topic
Lectures 1–2	ODE/SDE basics, forward/reverse processes, conditional/marginal probability paths
Lab 1	Hands-on SDE simulation

Week 2: Flow Matching, Score Matching & Training

Material	Topic
Lectures 3–4	Flow Matching, Score Matching, guidance, classifier-free guidance
Labs 2–3	Building a toy diffusion model from scratch

Week 3: Generative Robotics & Review

Material	Topic
Lecture 5	Guest lecture by Benjamin Burchfiel (Toyota Research): diffusion models for robotics
Lecture 6	Generative protein design (optional)

Phase 2: Early Foundation Robot Models & Robot Policy (Weeks 4–5)

Week 4: Early Foundation Robot Models — RT-1, RT-2, Octo, OpenVLA

#	Paper	Link	Key Topic
1	RT-1: Robotics Transformer — Brohan et al. (2022)	2212.06817	First large-scale Robotics Transformer (no VLM)
2	RT-2: Vision-Language-Action Models — Brohan et al. (2023)	2307.15818	VLM backbone → VLA paradigm
3	Octo — Ghosh et al. (2024)	2405.12213	Open-source generalist policy, modular design, pretrained on OXE (no VLM)
4	OpenVLA — Kim et al. (2024)	2406.09246	First open-source VLM-based VLA

📎 Supplementary video: Stanford CS25 V3 — Low-level Embodied Intelligence

Key points: RT-1 (35M, no VLM) → RT-2 (55B VLM, action as text tokens) establishes the VLA concept. Octo (27M–93M, diffusion head, no VLM) and OpenVLA (7B, VLM + 256-bin discretization) are the first open-source generalist robot policies enabling community iteration.

Week 5: Core Robot Policies — Diffusion Policy, ACT, BeT

#	Paper	Link	Key Topic
5	Behavior Transformers (BeT) — Shafiullah et al. (2022)	2206.11251	Multimodal action discretization, k-means + offset
6	Diffusion Policy — Chi et al. (2023)	2303.04137	Diffusion for robot control, action sequence prediction
7	ACT/ALOHA — Zhao et al. (2023)	2304.13705	Action Chunking Transformer, CVAE, bimanual

Key points: Three approaches to the multimodal action problem. Action chunking (predicting K future actions at once) is foundational for later VLA work.

Phase 3: Current RFM Architectures (Weeks 6–7)

Week 6: VLM + Action Head — CogACT, GR00T N1, X-VLA

#	Paper	Link	Key Topic
8	CogACT — Li et al. (2024)	2411.19650	VLM + DiT action head, action token learning
9	GR00T N1 — Bjorck et al. (2025)	2503.14734	2B diffusion transformer, whole-body humanoid control
10	X-VLA — Zheng et al. (2025)	2510.10274	Soft prompts for cross-embodiment, Florence-Large + flow matching

Key points: All three use only the VLM's last hidden state to drive a separate action head.

Week 7: VLM + Action Expert — π0, InternVLA-M1

#	Paper	Link	Key Topic
11	π0 — Black et al. (2024)	2410.24164	Flow matching + action expert accessing VLM intermediate features
12	InternVLA-M1 — Chen et al. (2025)	2510.13778	Spatial grounding → action generation, AR-based

📎 Background: Transfusion — Zhou et al. (2024) | 2408.11039 — AR + diffusion in one transformer; π0's architectural basis

Key points: Unlike Week 6's action heads that only see the VLM's last hidden state, these action experts access VLM internal hidden states.

Phase 4: Data Scaling (Weeks 8–9)

Week 8: Large-Scale Robot Datasets — OXE, AgiBot World

#	Paper	Link	Key Topic
13	Open X-Embodiment (OXE) — Open X-Embodiment Collaboration (2023)	2310.08864	1M+ trajectories, 22 embodiments, standardized data format
14	AgiBot World — Bu et al. (2025)	2503.06669	1M+ trajectories, 217 tasks, 5 deployment scenarios

📎 Data formats — Recording-oriented: rosbag (ROS 1), mcap (vendor-neutral, ROS 2 default). Training-oriented: RLDS (TensorFlow/OXE standard), LeRobotDataset (HuggingFace, Parquet + video).
📎 From the Evolution of Rosbag to the Future of AI Tooling — by the original rosbag author; covers rosbag V1→V2 → rosbag2 (sqlite3) → MCAP evolution

Key points: Large-scale multi-embodiment datasets that enable generalist robot policy pretraining. OXE standardized the data format across 22 robot embodiments via RLDS; AgiBot World provides high-quality data at scale.

Week 9: Data Collection Methods — UMI, VITRA, Human to Robot Transfer

#	Paper	Link	Key Topic
15	UMI — Chi et al. (2024)	2402.10329	Robot-free SE(3) data collection via handheld gripper
16	VITRA — Li et al. (2025)	2510.21571	Human video → VLA training data (1M episodes from egocentric human videos)
17	Human to Robot Transfer — Kareer et al. (2025)	2512.22414	Human video → robot transfer emerges with VLA scaling

Key points: Three data sources beyond robot teleoperation — UMI (embodiment-agnostic physical demos, <$200 hardware), egocentric video, and exocentric video.

Phase 5: Efficient Inference & Dual-System (Weeks 10–11)

Week 10: Fast-Acting VLA — SmolVLA & RTC

#	Paper	Link	Key Topic
18	SmolVLA — Shukor et al. (2025)	2506.01844	450M params (~1/7 of π0), model compression + async inference
19	RTC — Black et al. (2025)	2506.07339	Async inference — freezing + inpainting, no retraining needed

Key points: Two complementary approaches — SmolVLA compresses the model itself, RTC optimizes the inference pipeline. Can be combined.

Week 11: Dual-System VLA — Helix & Fast-in-Slow

#	Paper	Link	Key Topic
20	Helix — Figure AI (2025)	figure.ai/news/helix	S2: 7B VLM @7-9Hz, S1: 80M @200Hz, humanoid
21	Fast-in-Slow — Chen et al. (2025)	2506.01953	Integrated dual-system, end-to-end trainable

Key points: Dual-System separates slow reasoning (VLM) from fast execution (lightweight policy) at different frequencies. Helix (separately trained) vs Fast-in-Slow (end-to-end trainable).

Phase 6: RL Fine-tuning, Reasoning & World Model (Weeks 12–14)

Week 12: RL Fine-tuning & Human-in-the-Loop — HIL-SERL, SimpleVLA-RL, π*0.6

#	Paper	Link	Key Topic
22	HIL-SERL — Luo et al. (2024)	2410.21845	Human-in-the-loop RL, sample-efficient real-world training
23	SimpleVLA-RL — Li et al. (2025)	2509.09674	RL fine-tuning for autoregressive VLA, outcome-based rewards
24	*π0.6 / Recap** — Physical Intelligence (2025)	2511.14759	RL for flow-based VLA, advantage-conditioned, learns from suboptimal data

Key points: Three RL approaches — HIL-SERL (human-in-the-loop, sample-efficient), SimpleVLA-RL (outcome rewards), π*0.6 (advantage-conditioned, learns from suboptimal data).

Week 13: Reasoning VLA — CoT-VLA, ThinkAct, Fast-ThinkAct

#	Paper	Link	Key Topic
25	CoT-VLA — Zhao et al. (2025)	2503.22020	Visual chain-of-thought reasoning (future image prediction) before action
26	ThinkAct — Huang et al. (2025)	2507.16815	Decouple reasoning from execution; RL grounds plan quality in task success, not language supervision
27	Fast-ThinkAct — Huang et al. (2026)	2601.09708	Text-level CoT dispensable — latent distillation preserves planning capacity at ~10× speed

📎 Fast-ThinkAct's reasoning compression is orthogonal to Week 10's model compression (SmolVLA, RTC) — the two can stack.

Key points: Reasoning representation — image tokens (CoT-VLA) vs. visual latent (ThinkAct) vs. compressed latent tokens (Fast-ThinkAct). ThinkAct grounds reasoning in task-outcome RL instead of language supervision. Fast-ThinkAct shows planning structure, not verbosity, carries the signal (~10× faster, performance preserved).

Week 14: World Model — UniVLA, Cosmos Policy, DreamZero

#	Paper	Link	Key Topic
28	UniVLA — Wang et al. (2025)	2506.19850	Unified AR VLA with world modeling as training objective
29	Cosmos Policy — Kim et al. (2026)	2601.16163	Pretrained video foundation model as robot policy backbone
30	DreamZero — Ye et al. (2026)	dreamzero0.github.io	World Action Model, joint world+action generation in latent space

Key points: Three ways to leverage world knowledge — training regularizer (UniVLA, no world prediction at inference), pretrained video FM as policy backbone (Cosmos Policy), joint world+action generation in latent space (DreamZero).

Contributing

Suggestions for papers, resources, or structural improvements are welcome — please open an issue or PR.

📚 Recommended Courses

Courses covering the prerequisites for this study guide — only those with recent (2023+) video lectures freely available on YouTube. Pick what you need.

Area	Course	Instructor	Link	Notes
DL Fundamentals	MIT 6.S191: Intro to Deep Learning	Alexander Amini	introtodeeplearning.com · YouTube '25	1-week bootcamp (10 lectures) — CNN, Transformer, generative models, RL
	Andrej Karpathy: Neural Networks: Zero to Hero	Andrej Karpathy	karpathy.ai/zero-to-hero.html · YouTube	Backprop → GPT, build everything from scratch in code
Vision	Stanford CS231n: DL for Computer Vision	Fei-Fei Li et al.	cs231n.stanford.edu · YouTube '25	The canonical CV course — backprop to detection/segmentation/video
NLP / Transformers	Stanford CS224n: NLP with Deep Learning	Christopher Manning	web.stanford.edu/class/cs224n · YouTube '24	Word vectors → Transformers → LLMs
RL	UC Berkeley CS285: Deep RL	Sergey Levine	rail.eecs.berkeley.edu/deeprlcourse · YouTube '23	Policy gradients, Q-learning, model-based & offline RL — by a leading robotics RL researcher