README.md

December 2, 2025 · View on GitHub

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

Wei Li, Bing Hu, Rui Shao*, Leyang Shen, Liqiang Nie

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author

🔥LION-FS is accepted to CVPR 2025!🔥
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨

:fire: Updates

[12/2025] Code released. Enjoy it!
[03/2025] Arxiv paper released.
[02/2025] LION-FS has been accepted by CVPR 2025!

This is the github repository of LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant. In this work, we propose “Fast & Slow Video-Language Thinker” as onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses.

The whole framework of LION-FS:

Installation

Ensure you have Miniconda and Python version >= 3.10 installed, then run:

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

Install newest ffmpeg following:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.2-amd64-static ffmpeg

How to run

sh scripts/ego_exo4d/train.sh

Experiments

Performance. LION-FS consistently outperforms other methods, particularly in Fluency and LM-Correctness metrics, showcasing its advanced capabilities in language modeling and temporal alignment.

Ablation Study on Token Aggregation Router. Experiments show that SigLIP outperforms EgoVLPv2 in LL-PPL and TimeDiff, while EgoVLPv2 excels in Fluency and LM-Correctness. These findings highlight the limitations of using on a single visual encoder, as it cannot provide comprehensive visual information. Features from two different encoders can complement each other, offering a more complete representation.

Ablation Study on Token Dropping Router. The results demonstrate that even with fewer visual tokens, LION-FS achieves satisfactory efficacy, highlighting the importance of the token dropping router’s visual selection capability and revealing the significant redundancy present in video data.

Acknowledgement

We built our code based on: VideoLLM-online.

Citation

If you find this work useful for your research, please kindly cite our paper:

@inproceedings{li2025lion,
  title={Lion-fs: Fast \& slow video-language thinker as online video assistant},
  author={Li, Wei and Hu, Bing and Shao, Rui and Shen, Leyang and Nie, Liqiang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={3240--3251},
  year={2025}
}