VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
February 27, 2025 · View on GitHub
Recent advancements in large-scale video-language models, demonstrate remarkable capabilities in real-time planning and interaction with real-world environments, yet their training is constrained by high computational costs and limited annotated datasets. Traditional methods, like video compression and sliding window techniques, often compromise critical visual information or disrupt semantic flow. In addition, current predesigned QA benchmarks fail to adequately assess long video understanding due to inherent biases from static image features and the base LLM. To address these issues, we introduce VideoLLaMB, a framework utilizing Memory Bridge Layers with recurrent memory tokens to encode entire video content without discarding vital information. We also propose SceneTilling algorithm to split video into semantic units to keep the semantic flow. Finally, We present the "Needle in a Video Haystack" benchmark to evaluate long video understanding over needle of different modalities comprehensively.
Table of Contents
- Install
- Quick Start with CLI
- Streaming Caption with CLI
- Demo
- Train
- Evaluate
- Model Zoo
- Citation
- Acknowledgement
Install
- Clone this repository and navigate to VideoLLaMB folder
git clone https://github.com/bigai-nlco/VideoLLaMB.git
cd VideoLLaMB
- Install Package
conda create -n videollamb python=3.10 -y
conda activate videollamb
pip install --upgrade pip
pip install -e .
conda install ffmpeg
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install flash-attn --no-build-isolation --no-cache-dir
QuickStart With CLI
Download the checkpoint, place it to the checkpoints directory, then run following command:
python -m llava.serve.cli --model-path checkpoints/videollamb-llava-1.5-7b --video-file XXX.mp4
Streaming Video Caption with CLI
Download the checkpoint, place it to the checkpoints directory, then run following command:
python -m llava.serve.cli_streaming --model_path checkpoints/videollamb-llava-1.5-7b
https://github.com/user-attachments/assets/96c32452-f910-4c6c-9feb-0e98134d45a1
Gradio Demo
Download the checkpoint, place it to the checkpoints directory, then run following command:
python -m llava.serve.gradio_demo
https://github.com/user-attachments/assets/2ea521e5-4bf2-415c-b20d-f5663c93af57
Train
- Prepare data
We combine the video instruction from PLLaVA and image instruction from LLaVA for training. Please check DATA for details.
- Prapare model weights for initialization
Our model is initialized on LLaVA, you can download the llava-v1.5-7b, and put it to checkpoints/llava-v1.5-7b. For visual encoders, we select them from LanguageBind, you can download LanguageBind_Image and LanguageBind_Video_merge, and put them to checkpoints/LanguageBind_Image and checkpoints/LanguageBind_Video_merge
- Start Training
Training takes 23 hours for LLaVA-1.5-7B in 4-A800-80G
bash scripts/finetune_video_image.slurm # bash
sbatch scripts/finetune_video_image.slurm # slurm cluster
We also provide a script to backpropagate the LLM loss to the bridge for each recurrent iteration.
bash scripts/finetune_video_image_loss.slurm # bash
sbatch scripts/finetune_video_image_loss.slurm # slurm cluster
Evaluate
- Prepare data
We provide evaluation pipelines for EgoScheme, NExTQA, EgoPlan, and MVBench. Please check DATA for details.
- Start Evaluating
a. Traditional Benchmark
bash scripts/eval/egoschema.sh # egoschema
bash scripts/eval/nextqa.sh # nextqa
bash scripts/eval/egoplan.sh # egoplan
bash scripts/eval/mvbench.sh # mvbench
b. MM-NIAVH
check our benchmark Needle In A Video Haystack (NIAVH)
Model Zoo
| Model | Base Model | Training Data | Download Link |
|---|---|---|---|
| VideoLLaMB-7B | llava-v1.5-7b | magic_json, LLaVA | 🤗videollamb-llava-1.5-7b |
| VideoLLaMB-7B-Mem (MM-NIAVH) | llava-v1.5-7b | magic_json, LLaVA | 🤗videollamb-mem-llava-1.5-7b |
Acknowledgement
Model:
Data:
Demo:
Citation
@misc{mm-niavh,
title={MLLM Pressure Test: Needle In A Video Haystack},
author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
publisher={github},
url={https://github.com/bigai-nlco/NeedleInAVideoHaystack},
year={2024}
}
@article{videollamb,
title={VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges},
author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
journal={arxiv},
year={2024}
}