: A Vision-Language-Action Model BridgingUnderstanding and Generation to Actions
January 2, 2026 ยท View on GitHub
We introduce , a novel paradigm by integrating visual foresight generation into the decision-making pipeline. Our model employs a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions through predictive inverse dynamics modeling.
๐ Best viewed with sound on
๐ Key Innovations
- ๐ง Predictive Inverse Dynamics: Visual foresight generation for planning-based control
- ๐๏ธ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
- ๐ Three-Stage Training: Progressive alignment, pretraining, and adaptation
๐ค Real-World Robot Experiments
Multi-task Manipulation
9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation
Rapid Adaptation
Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities
Long-horizon Planning
10-step sequential task over 2 minutes, showcasing long-term planning and execution
Dynamic Environment
Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities
Performance Summary
| Task | Platform | Improvement | ||
|---|---|---|---|---|
| Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
| Adaptation | Franka | 66.7% | 53.3% | +13.4% |
| Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
| Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
๐ Quick Start
Prerequisites
- Python โฅ 3.10
- torch โฅ 2.6.0
- CUDA โฅ 12.4
Installation
# Clone repository
git clone https://github.com/InternRobotics/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla
# Create environment
conda create -n f1_vla python==3.10
conda activate f1_vla
# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124
# install f1_vla
pip install -e .
pip install numpy==1.26.4
For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.
- FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
- TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.
By using these two tools, the time of loading the video dataset is greatly accelerated.
Download Pretrained Datasets and Models
| Name | link |
|---|---|
| LIBERO_SPATIAL_NO_NOOPS_PATH | IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot |
| STAGE2_CKPT_PATH | F1_pretrain |
| LEROBOT_PI0_PATH | lerobot/pi0_base |
| PALIGEMMA_PATH | google/paligemma-3b-pt-224 |
| VAE_PATH | vae_ch160v4096z32.pth |
Basic Usage
f1_vla
โโโ config
โ โโโ debug_test.yaml
โ โโโ f1_config.json
โโโ requirements.txt
โโโ setup.py
โโโ src
โ โโโ configs
โ โโโ models
โ โโโ policies
โ โโโ processors
โ โโโ utils
โโโ train_hf.py
Finetune
# 1. edit config file
vim f1_vla/config/debug_test.yaml
# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml
๐ Citation
If you use this work in your research, please cite our paper:
@article{lv2025f1,
title={F1: A vision-language-action model bridging understanding and generation to actions},
author={Lv, Qi and Kong, Weijie and Li, Hao and Zeng, Jia and Qiu, Zherui and Qu, Delin and Song, Haoming and Chen, Qizhi and Deng, Xiang and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2509.06951},
year={2025}
}
๐ License
This project is licensed under the MIT License.