: A Vision-Language-Action Model BridgingUnderstanding and Generation to Actions

January 2, 2026 · View on GitHub

: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

We introduce $\mathcal{F}_1$ , a novel paradigm by integrating visual foresight generation into the decision-making pipeline. Our model employs a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions through predictive inverse dynamics modeling.

🏁 Best viewed with sound on

🚀 Key Innovations

🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
🏗️ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
📈 Three-Stage Training: Progressive alignment, pretraining, and adaptation

🤖 Real-World Robot Experiments

Multi-task Manipulation

9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation

Rapid Adaptation

Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities

Long-horizon Planning

10-step sequential task over 2 minutes, showcasing long-term planning and execution

Dynamic Environment

Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities

Performance Summary

Task	Platform	$\mathcal{F}_1$	$\pi_0$	Improvement
Multi-task	Genie-1	82.2%	65.2%	+17.0%
Adaptation	Franka	66.7%	53.3%	+13.4%
Long-horizon	ARX LIFT II	40.0%	0.0%	+40.0%
Dynamic Env	ARX LIFT II	66.7%	33.3%	+33.4%

🚀 Quick Start

Prerequisites

Python ≥ 3.10
torch ≥ 2.6.0
CUDA ≥ 12.4

Installation

# Clone repository
git clone https://github.com/InternRobotics/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla

# Create environment
conda create -n f1_vla python==3.10
conda activate f1_vla

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124

# install f1_vla
pip install -e .

pip install numpy==1.26.4

For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.

FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.

By using these two tools, the time of loading the video dataset is greatly accelerated.

Download Pretrained Datasets and Models

Name	link
LIBERO_SPATIAL_NO_NOOPS_PATH	IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot
STAGE2_CKPT_PATH	F1_pretrain
LEROBOT_PI0_PATH	lerobot/pi0_base
PALIGEMMA_PATH	google/paligemma-3b-pt-224
VAE_PATH	vae_ch160v4096z32.pth

Basic Usage

f1_vla
├── config
│   ├── debug_test.yaml
│   └── f1_config.json
├── requirements.txt
├── setup.py
├── src
│   ├── configs
│   ├── models
│   ├── policies
│   ├── processors
│   └── utils
└── train_hf.py

Finetune

# 1. edit config file
vim f1_vla/config/debug_test.yaml

# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml

📚 Citation

If you use this work in your research, please cite our paper:

@article{lv2025f1,
  title={F1: A vision-language-action model bridging understanding and generation to actions},
  author={Lv, Qi and Kong, Weijie and Li, Hao and Zeng, Jia and Qiu, Zherui and Qu, Delin and Song, Haoming and Chen, Qizhi and Deng, Xiang and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2509.06951},
  year={2025}
}

📄 License

This project is licensed under the MIT License.