Improving LLM Video Understanding with 16 Frames Per Second
July 3, 2025 ยท View on GitHub
๐๐ Welcome to the repo of F-16!
F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.
๐ฅ News
- 2025-07-03: We release the final checkpoint of F-16.
- 2025-06-18: We release the code of F-16.
โก๏ธ Future Plans
Release the code.Release final F-16.
๐ How to Use
How to train a model
- Prepare the dataset following
scripts/example_sft.json. - Download LLaVA-OneVision Model from huggingface.
- Modify the parameters in
scripts/train_sft.sh. - Run
bash scripts/train_sft.sh.
How to evaluate a checkpoint
- Prepare the dataset following
scripts/example_sft.json. - Modify the parameters in
scripts/eval.sh. - Run
bash scripts/eval.sh.
๐ Team
Team Tsinghua: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Chao Zhang
Team ByteDance: Wei Li, Zejun Ma
โจ Citation
If you find F-16 useful, please cite the paper:
@inproceedings{li2025improving,
title={Improving LLM Video Understanding with 16 Frames Per Second},
author={Li, Yixuan and Tang, Changli and Zhuang, Jimin and Yang, Yudong and Sun, Guangzhi and Li, Wei and Ma, Zejun and Zhang, Chao},
booktitle={Proc. ICML},
year={2025},
address={Vancouver}
}