Project-Ava
November 5, 2025 Β· View on GitHub
An implementation of Paper "AVA: Towards Agentic Video Analytics with Vision Language Models"
π° Paper: Arxiv
π° News
- [2025/08/01] Gemini-1.5-Pro has been officially deprecated. We recommend everyone to try the more powerful Gemini-2.5-Pro. You can apply for the API here.
- [2025/06/27] We have released the AVA-100 dataset on Hugging Face.
π§ Key Features
- Task Definition: Define five levels of intelligence (L1βL5) for current and future video analysis systems. AVA is the first L4 video analytics system powered by VLMs, enabling open-ended comprehension, reasoning, and analyticsβmarking a significant advancement.
- Near-real-time index construction: AVA employs Event-Knowledge Graphs (EKGs) to construct video index, supporting near real-time indexing even on common edge devices (2 RTX 4090).
- Agentic retrieval and generation: AVA uses LLM as an agent to proactively explore and retrieve additional event information related to prior results, enabling multi-path reasoning and response generation based on aggregated data.
- Proposed benchmark: Proposes AVA-100, an ultra-long video benchmark designed to evaluate video analysis capabilities, comprising 8 videos (each over 10 hours) and 120 manually annotated questions across four scenarios: human daily activities, city walking, wildlife surveillance, and traffic monitoring.
- AVA achieves $62.3% on LVBench, \62.3% on VideoMME-Long, and \75.8%$ on the proposed AVA-100 benchmark, outperforming mainstream VLMs and Video-RAG methods under the same settings.

πΉ AVA-100
AVA-100 is proposed by us, which is an ultra-long video benchmark specially designed to evaluate video analysis capabilities Avas-100 consists of 8 videos, each exceeding 10 hours in length, and includes a total of 120 manually annotated questions. The benchmark covers four typical video analytics scenarios: human daily activities, city walking, wildlife surveillance, and traffic monitoring, each scenario contains two videos. All questions are carefully designed by human annotators, who also provide reference answers as the ground truth. In addition, GPT-4o is utilized to generate plausible distractor options.
- Human daily activities: Selected and stitched from egocentric footage in the Ego4D.
- City walking: Selected from publicly available YouTube videos, capturing urban exploration.
- Wildlife surveillance: Selected from publicly available YouTube videos, capturing animal monitoring.
- Traffic monitoring: Selected and stitched from monitoring videos in the Bellevue Traffic Video Dataset.

π¦ Installation
git clone https://github.com/I-ESC/Project-AVA.git
cd Project-AVA
conda create -n ava python=3.9
conda activate ava
pip install -r requirements.txt
Dataset Preparation
In our project, the dataset is organized as follows:
datas/
βββ [dataset_name]/
βββ videos/ # folder stores raw videos
βββ *.json. # raw videos information and question-answers
βββ download.sh # quick download script
We recommend using the provided script to download the dataset:
# supported dataset: LVBench, VideoMME, AVA100
cd datas/[dataset-name]
./download.sh
Video Preprocessing
Since the videos processed in this project are relatively long, we choose to preprocess them into frames and store them on disk, instead of dynamically loading them into memory for each operation. This method improves processing speed and is more friendly to hardware resources. Additionally, to speed up the processing, we employ parallel processing techniques:
python preprocess_videos.py --dataset [lvbench/videomme/ava100] --num_threads 10 # Set num_threads based on your hardware capabilities.
All intermediate results in our project are organized under the AVA_cache folder. After processing the video frames, you should see the following files:
AVA_cache/
βββ LVBench/
βββ 1 # video id
βββ frames # folder stores raw frames
βββ config.json # video's raw info, including resolution, duration, and so on.
βββ ...
βββ 2
βββ ...
βββ VideoMME/
βββ 601
βββ 602
βββ ...
βββ AVA100/
βββ 1
βββ 2
βββ ...
Graph Construction
Construct event knowledge graph for the video:
# View llms.init_model.py for supported models
# View dataset.init_dataset.py for supported dataset
# View datas/[dataset_name]/[dataset_name.json] for video_id, LVBench: 1-103, VideoMME: 601-900, AVA-100: 1-8
python graph_construction.py --model [name_of_model] --dataset [name_of_dataset] --video_id [id_of_video] --gpus [num_of_gpus]
Constructing the graph is a time-consuming process. For example, building a graph for a 10-hour video can take 3β5 hours (the actual time depends on the hardware used). Therefore, we recommend directly downloading our pre-built graphs from here and merging them into the AVA_cache folder. After this, the folder structure should be as follows:
AVA_cache/
βββ LVBench/
βββ 1
βββ frames
βββ config.json
βββ kg
βββ vdb_events.json
βββ vdb_entities.json
βββ vdb_relations.json
βββ graph_event_knowledge_graph.graphml
βββ ...
βββ 2
βββ ...
Generate Summary_and_Answer Result
You can generate answers to questions about a specific video in the following way:
# View datas/[dataset_name]/[dataset_name.json] for question_id
python query_SA.py --model [name_of_model] --dataset [name_of_dataset] --video_id [id_of_video] --question_id [id_of_question]--gpus [num_of_gpus]
# example
python query_SA.py --model qwenlm --dataset lvbench --video_id 1 --question_id 0 --gpus 1
You can also run the results for the entire dataset using the following way:
# Setting video_id to -1 indicates that the entire dataset will be processed.
python query_SA.py --model [name_of_model] --dataset [name_of_dataset] --video_id -1 --gpus [num_of_gpus]
# example
python query_SA.py --model qwenlm --dataset lvbench --video_id -1 --gpus 1
Generate Check_raw_frame_and_Answer Result
Here, we would like to clarify that the results in the paper were produced using Gemini-1.5-Pro. However, over time, Gemini-1.5-Pro has been deprecated by Google. Therefore, we recommend using QwenVL2.5-7B. (results using this model are also presented in the paper).
You can generate answers to questions about a specific video in the following way ( NoteοΌBefore generating the CA Result, the corresponding SA Result must have already been produced ):
# View datas/[dataset_name]/[dataset_name.json] for question_id
python query_SA.py --model [name_of_model] --dataset [name_of_dataset] --video_id [id_of_video] --question_id [id_of_question]--gpus [num_of_gpus]
# example
python query_CA.py --model qwenvl --dataset lvbench --video_id 1 --question_id 0 --gpus 1
You can also run the results for the entire dataset using the following way:
# Setting video_id to -1 indicates that the entire dataset will be processed.
python query_CA.py --model [name_of_model] --dataset [name_of_dataset] --video_id -1 --gpus [num_of_gpus]
# example
python query_CA.py --model qwenvl --dataset lvbench --video_id -1 --gpus 1
π Citation
If you use this repo, please cite our paper:
@inproceedings{ava,
title={AVA: Towards Agentic Video Analytics with Vision Language Models},
author={Yan, Yuxuan and Jiang, Shiqi and Cao, Ting and Yang, Yifan and Yang, Qianqian and Shu, Yuanchao and Yang, Yuqing and Qiu, Lili},
booktitle={USENIX Symposium on Networked Systems Design and Implementation (NSDI)},
year={2026}
}