VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges

February 27, 2025 · View on GitHub

VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges

Recent advancements in large-scale video-language models, demonstrate remarkable capabilities in real-time planning and interaction with real-world environments, yet their training is constrained by high computational costs and limited annotated datasets. Traditional methods, like video compression and sliding window techniques, often compromise critical visual information or disrupt semantic flow. In addition, current predesigned QA benchmarks fail to adequately assess long video understanding due to inherent biases from static image features and the base LLM. To address these issues, we introduce VideoLLaMB, a framework utilizing Memory Bridge Layers with recurrent memory tokens to encode entire video content without discarding vital information. We also propose SceneTilling algorithm to split video into semantic units to keep the semantic flow. Finally, We present the "Needle in a Video Haystack" benchmark to evaluate long video understanding over needle of different modalities comprehensively.

Table of Contents

Install
Quick Start with CLI
Streaming Caption with CLI
Demo
Train
Evaluate
Model Zoo
Citation
Acknowledgement

Install

Clone this repository and navigate to VideoLLaMB folder

git clone https://github.com/bigai-nlco/VideoLLaMB.git
cd VideoLLaMB

Install Package

conda create -n videollamb python=3.10 -y
conda activate videollamb
pip install --upgrade pip
pip install -e .
conda install ffmpeg

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install flash-attn --no-build-isolation --no-cache-dir

QuickStart With CLI

Download the checkpoint, place it to the checkpoints directory, then run following command:

python -m llava.serve.cli --model-path checkpoints/videollamb-llava-1.5-7b --video-file XXX.mp4

Streaming Video Caption with CLI

Download the checkpoint, place it to the checkpoints directory, then run following command:

python -m llava.serve.cli_streaming --model_path checkpoints/videollamb-llava-1.5-7b

https://github.com/user-attachments/assets/96c32452-f910-4c6c-9feb-0e98134d45a1

Gradio Demo

Download the checkpoint, place it to the checkpoints directory, then run following command:

python -m llava.serve.gradio_demo

https://github.com/user-attachments/assets/2ea521e5-4bf2-415c-b20d-f5663c93af57

Train

Prepare data

We combine the video instruction from PLLaVA and image instruction from LLaVA for training. Please check DATA for details.

Prapare model weights for initialization

Our model is initialized on LLaVA, you can download the llava-v1.5-7b, and put it to checkpoints/llava-v1.5-7b. For visual encoders, we select them from LanguageBind, you can download LanguageBind_Image and LanguageBind_Video_merge, and put them to checkpoints/LanguageBind_Image and checkpoints/LanguageBind_Video_merge

Start Training

Training takes 23 hours for LLaVA-1.5-7B in 4-A800-80G

bash scripts/finetune_video_image.slurm # bash
sbatch scripts/finetune_video_image.slurm # slurm cluster

We also provide a script to backpropagate the LLM loss to the bridge for each recurrent iteration.

bash scripts/finetune_video_image_loss.slurm # bash
sbatch scripts/finetune_video_image_loss.slurm # slurm cluster

Evaluate

Prepare data

We provide evaluation pipelines for EgoScheme, NExTQA, EgoPlan, and MVBench. Please check DATA for details.

Start Evaluating

a. Traditional Benchmark

bash scripts/eval/egoschema.sh # egoschema
bash scripts/eval/nextqa.sh # nextqa
bash scripts/eval/egoplan.sh # egoplan
bash scripts/eval/mvbench.sh # mvbench

b. MM-NIAVH

check our benchmark Needle In A Video Haystack (NIAVH)

Model Zoo

Model	Base Model	Training Data	Download Link
VideoLLaMB-7B	llava-v1.5-7b	magic_json, LLaVA	🤗videollamb-llava-1.5-7b
VideoLLaMB-7B-Mem (MM-NIAVH)	llava-v1.5-7b	magic_json, LLaVA	🤗videollamb-mem-llava-1.5-7b

Acknowledgement

Model:

Data:

PLLaVA

Demo:

videollm-online

Citation

@misc{mm-niavh,
    title={MLLM Pressure Test: Needle In A Video Haystack},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    publisher={github},
    url={https://github.com/bigai-nlco/NeedleInAVideoHaystack},
    year={2024}
}

@article{videollamb,
    title={VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    journal={arxiv},
    year={2024}
}