: A Vision-Language-Action Model BridgingUnderstanding and Generation to Actions

January 2, 2026 ยท View on GitHub

F1 Logo: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

Paper Website Demo License


We introduce F1\mathcal{F}_1, a novel paradigm by integrating visual foresight generation into the decision-making pipeline. Our model employs a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions through predictive inverse dynamics modeling.

๐Ÿ Best viewed with sound on

๐Ÿš€ Key Innovations

  • ๐Ÿง  Predictive Inverse Dynamics: Visual foresight generation for planning-based control
  • ๐Ÿ—๏ธ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
  • ๐Ÿ“ˆ Three-Stage Training: Progressive alignment, pretraining, and adaptation

๐Ÿค– Real-World Robot Experiments

Multi-task Manipulation

9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation

Rapid Adaptation

Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities

Long-horizon Planning

10-step sequential task over 2 minutes, showcasing long-term planning and execution

Dynamic Environment

Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities

Performance Summary

TaskPlatformF1\mathcal{F}_1ฯ€0\pi_0Improvement
Multi-taskGenie-182.2%65.2%+17.0%
AdaptationFranka66.7%53.3%+13.4%
Long-horizonARX LIFT II40.0%0.0%+40.0%
Dynamic EnvARX LIFT II66.7%33.3%+33.4%

๐Ÿš€ Quick Start

Prerequisites

  • Python โ‰ฅ 3.10
  • torch โ‰ฅ 2.6.0
  • CUDA โ‰ฅ 12.4

Installation

# Clone repository
git clone https://github.com/InternRobotics/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla

# Create environment
conda create -n f1_vla python==3.10
conda activate f1_vla

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124

# install f1_vla
pip install -e .

pip install numpy==1.26.4

For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.

  • FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
  • TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.

By using these two tools, the time of loading the video dataset is greatly accelerated.

Download Pretrained Datasets and Models

Namelink
LIBERO_SPATIAL_NO_NOOPS_PATHIPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot
STAGE2_CKPT_PATHF1_pretrain
LEROBOT_PI0_PATHlerobot/pi0_base
PALIGEMMA_PATHgoogle/paligemma-3b-pt-224
VAE_PATHvae_ch160v4096z32.pth

Basic Usage

f1_vla
โ”œโ”€โ”€ config
โ”‚   โ”œโ”€โ”€ debug_test.yaml
โ”‚   โ””โ”€โ”€ f1_config.json
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ setup.py
โ”œโ”€โ”€ src
โ”‚   โ”œโ”€โ”€ configs
โ”‚   โ”œโ”€โ”€ models
โ”‚   โ”œโ”€โ”€ policies
โ”‚   โ”œโ”€โ”€ processors
โ”‚   โ””โ”€โ”€ utils
โ””โ”€โ”€ train_hf.py

Finetune

# 1. edit config file
vim f1_vla/config/debug_test.yaml

# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml

๐Ÿ“š Citation

If you use this work in your research, please cite our paper:

@article{lv2025f1,
  title={F1: A vision-language-action model bridging understanding and generation to actions},
  author={Lv, Qi and Kong, Weijie and Li, Hao and Zeng, Jia and Qiu, Zherui and Qu, Delin and Song, Haoming and Chen, Qizhi and Deng, Xiang and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2509.06951},
  year={2025}
}

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ™ Acknowledgments