README.md
December 2, 2025 · View on GitHub
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author
🔥LION-FS is accepted to CVPR 2025!🔥
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨
:fire: Updates
- [12/2025] Code released. Enjoy it!
- [03/2025] Arxiv paper released.
- [02/2025] LION-FS has been accepted by CVPR 2025!
Introduction
This is the github repository of LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant. In this work, we propose “Fast & Slow Video-Language Thinker” as onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses.
The whole framework of LION-FS:
Installation
Ensure you have Miniconda and Python version >= 3.10 installed, then run:
conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation
Install newest ffmpeg following:
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.2-amd64-static ffmpeg
How to run
sh scripts/ego_exo4d/train.sh
Experiments
Performance. LION-FS consistently outperforms other methods, particularly in Fluency and LM-Correctness metrics, showcasing its advanced capabilities in language modeling and temporal alignment.
Ablation Study on Token Aggregation Router. Experiments show that SigLIP outperforms EgoVLPv2 in LL-PPL and TimeDiff, while EgoVLPv2 excels in Fluency and LM-Correctness. These findings highlight the limitations of using on a single visual encoder, as it cannot provide comprehensive visual information. Features from two different encoders can complement each other, offering a more complete representation.
Ablation Study on Token Dropping Router. The results demonstrate that even with fewer visual tokens, LION-FS achieves satisfactory efficacy, highlighting the importance of the token dropping router’s visual selection capability and revealing the significant redundancy present in video data.
Acknowledgement
- We built our code based on: VideoLLM-online.
Citation
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{li2025lion,
title={Lion-fs: Fast \& slow video-language thinker as online video assistant},
author={Li, Wei and Hu, Bing and Shao, Rui and Shen, Leyang and Nie, Liqiang},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={3240--3251},
year={2025}
}