Introduction
November 28, 2024 ยท View on GitHub
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)
Introduction
This is the official code repository of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding . VideoAgent is a mulit-modal agent that can understand the input video and answer the questions raised by you.
Given a video and a question, VideoAgent has two phases: memory construction phase and inference phase. During the memory construction phase, structured information is extracted from the video and stored in the memory. During the inference phase, a LLM is prompted to use a set of tools interacting with the memory to answer the question.
Prerequisites
This project is tested on Ubuntu 20.04 with a NVIDIA RTX 4090(24GB).
Installation Guide
Use the following command to create the environment named as videoagent:
conda env create -f environment.yaml
Create the environment of Video-LLaVA by running the following command:
git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
Note: Only the conda envrionment named videollava is required for this project, while the Video-LLaVA repository is not required. You can clone Video-LLaVA repository to anywhere you want and build the conda environment named videollava.
Download the cache_dir.zip and tool_models.zip from here and unzip them to the directory of VideoAgent. This will create two folder cache_dir(the model weights of VideoLLaVA) and tool_models(the model weights of all other models) under VideoAgent.
Usage
Make sure you are under VideoAgent directory.
Enter your OpenAI api key in config/default.yaml.
First, open a terminal and run:
conda activate videollava
python video-llava.py
This will start a Video-LLaVA server process that will deal with Visual Question Answering request raised by VideoAgent.
Once you see ready for connection! in the first process, Then, open another terminal and run:
conda activate videoagent
python demo.py
This will create a Gradio demo shown as follows.
The results will provide:
- the answer to the question
- the replay with object re-ID of the input video
- the inference log (chain-of-thought) of VideoAgent
For batch inference, you can run
conda activate videoagent
python main.py
Citation
If you find our paper and code useful in your research, please consider giving a star โญ and citation ๐.
@inproceedings{fan2025videoagent,
title={Videoagent: A memory-augmented multimodal agent for video understanding},
author={Fan, Yue and Ma, Xiaojian and Wu, Rujie and Du, Yuntao and Li, Jiaqi and Gao, Zhi and Li, Qing},
booktitle={European Conference on Computer Vision},
pages={75--92},
year={2025},
organization={Springer}
}