DIVE: Deep-search Iterative Video Exploration
June 30, 2025 · View on GitHub
A winning solution for the CVRR (Complex Video Reasoning & Robustness Evaluation) Challenge at CVPR 2025🏆
Overview
DIVE(Deep-search Iterative Video Exploration) is an advanced Video Question Answering (VQA) method that won 1st place at the CVRR-ES Challenge (CVPR 2025). Our method uses an iterative approach and question breakdown to handle complex queries about videos, delivering accurate and context-aware answers. Features include object-centric video summarization, agent-based analysis with multiple tools, and strong results on diverse, real-world video datasets.
📊 Performance
Our method, DIVE (Deep-search Iterative Video Exploration), achieved 1st place in the CVRR-ES Challenge at CVPR 2025.
-
🥇 Test Accuracy: 81.44% (see Official leaderboard)
-
✅ Validation Accuracy: 91.54% (see EvalAI leaderboard)
-
📈 Outperformed all other participants and strong baselines, including GPT-4V (70.78%) and GPT-4o (75.03%).
DIVE's strong performance comes from three core components:
-
Breakdown Question module: Decomposes each complex question into semantically meaningful sub-questions, enabling targeted and stepwise analysis.
-
Intent Estimation: Identifies the underlying purpose behind the question by jointly analyzing the video and language context.
-
Video Summarization: Uses object-centric summaries based on visual detection to provide interpretable, context-rich input for downstream reasoning.
These modules work together in an iterative loop, allowing DIVE to deliver accurate and contextually grounded answers even for highly challenging queries.
🚀How To Use
📂Dataset Preparation
Step1. Downloading and Setting Up CVRR-ES Dataset
Set up the CVRR-ES dataset by following the below steps.
- CVRR-ES dataset can be downloaded using this link (zipped). CVRR-ES benchmark consists of 2400 open-ended question-answer (QA) pairs spanning over 214 unique videos and covers 11 diverse evaluation dimensions. After unzipping, the CVRR-ES dataset structure looks like the following:
CVRR-ES/
|–– interpretation_of_visual_context/
| |–– annotations_interpretation_of_visual_context.json
| |–– captions_interpretation_of_visual_context.json
| |–– 163.mp4
| |–– ... # remaining videos
|–– partial_actions/
| |–– annotations_partial_actions.json
| |–– captions_partial_actions.json
| |–– 121.mp4
| |–– ... # remaining videos
|–– unusual_and_physically_anomalous_activities/
| |–– annotations_interpretation_of_visual_counusual_and_physically_anomalous_activities.json
| |–– captions_unusual_and_physically_anomalous_activities.json
| |–– 101.mp4
| |–– ... # remaining videos
... # remaining video-evaluation dimension folders
Here, each folder corresponds to a single video evaluation dimension and contains annotations (QA pairs and captions) alongside videos.
Now note that videos utilized from Something-Something V2 Dataset (SSv2) have been not included in the zipped folder due to copyright policies. In order to complete the dataset, first:
-
Download SSv2 dataset from official website (it is publicly available). You will be prompted to register yourself by creating an account.
-
Identify the videos for CVRR-ES dataset by retrieving the videos with ids given in this text file.
-
Rename the videos following the mapping in the text file and add them to their respective evaluation dimension folder in the unzipped CVRR-ES folder.
Step2. Extracting Frames from Videos
To extract frames from the videos, you can use the extract_video_frames.py script provided in this repository. This script will extract frames from each video and save them in a specified directory.
python3 extract_video_frames.py
🧠Google Cloud Setup
Step1. You need to set up authentication credentials to use VertexAI.
-
Method 1: Using a Service Account
Create a service account from the GCP Console and set the environment variable to the path of the credentials JSON file.
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json -
Method 2: Set up ADC with User Account
gcloud auth application-default loginWith this method, you may receive quota warnings. To suppress these warnings, explicitly set the quota project using the following command:
gcloud auth application-default set-quota-project GCP_PROJECT_ID
Step2. Set up Google Cloud Storage Bucket
You need to create a Google Cloud Storage bucket to store the Video data of the CVRR-ES dataset. You can create a bucket using the GCP Console.
After creating the bucket, you need to upload the CVRR-ES dataset videos to the bucket. The structure of the bucket should look like this:
YOUR_BUCKET_NAME/
|–– interpretation_of_visual_context/
| |–– XXXX.mp4
|–– partial_actions/
| |–– XXXX.mp4
|–– unusual_and_physically_anomalous_activities/
| |–– XXXX.mp4
... # remaining video-evaluation dimension folders
🧠OpenAI Setup
You need to set up your OpenAI API key to use the OpenAI API for generating responses.
You can set the API key as an environment variable:
export OPENAI_API_KEY=your_openai_api_key
📋Video Summary Generation
DIVE uses object-centric video summaries based on visual detection to provide interpretable, context-rich input for downstream reasoning. The video summary generation process consists of two main steps: object detection and video summarization.
Object Detection
The system uses Grounding DINO for zero-shot object detection across video frames:
- Model: IDEA-Research/grounding-dino-base
- Process: Analyzes sampled frames to identify relevant objects using GPT-4.1 vision analysis
- Output: CSV files containing frame-by-frame object detection results with bounding boxes and confidence scores
Video Summarization
The summarization process integrates object detection results with visual analysis:
- Input: Object detection CSV data + sampled video frames
- Model: OpenAI GPT-4.1
- Features:
- Chronological event tracking with explicit frame numbers
- Object behavior analysis and temporal patterns
- Scene transition detection
- Integration of detection data with visual cues
Generating Video Summaries
To generate video summaries for your dataset, use the generate_video_summary.py script:
python generate_video_summary.py \
--cvrr_dataset_path /path/to/CVRR-ES \
--image_base_path /path/to/extracted/frames \
--output_dir /path/to/output
Output Files: For each video, the script generates:
{video_id}.csv: Object detection results in CSV format{video_id}.json: Raw object detection data in JSON format{video_id}.txt: Generated video summary text{video_id}_objects.txt: List of detected object classes
Configuration:
The system uses the following default settings (configurable in Config class):
- Frame sampling: 32 frames per video using divisor-based sampling
- Detection thresholds: Box threshold 0.3, Text threshold 0.25
- CSV limit: Maximum 4000 lines for efficient processing
- GPU support: Automatic CUDA detection for faster processing
Pre-generated Summary Data
Alternatively, you can download pre-generated video summary data from this link to skip the generation step.
🐋Docker Setup
Before running the code, you need to set up the environment variables for Google Cloud and OpenAI. You can do this by creating a .env file in the docker directory.
GOOGLE_CLOUD_PROJECT=gcp_project_id
GOOGLE_CLOUD_LOCATION=region
GOOGLE_CLOUD_BUCKET_NAME=bucket_name
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials.json
OPENAI_API_KEY=your_openai_api_key
To run the code in a Docker container, you can use the provided Dockerfile. Follow these steps:
cd docker
docker compose build
🚀Execute the DIVE
To execute the DIVE, you can use the main.py script. This script will process the CVRR-ES dataset and generate the results.
You can run the DIVE with the following command:
python3 main.py \
--cvrr_dataset_path CVRR_DATASET_PATH \
--image_base_path IMAGE_BASE_PATH \
--video_summary_path VIDEO_SUMMARY_PATH \
--output_dir OUTPUT_DIR \
--num_frames NUM_FRAMES \
--processes NUM_PROCESSES \
--resume
To evaluate the results, you can use the official evaluation script provided in the CVRR-ES dataset. The evaluation script will compare the generated responses with the ground truth annotations and calculate the scores for each evaluation dimension.
please refer to the CVRR-Evaluation-Suite for more details.
✅Citation
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
@article{dive2025cvpr,
title={DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025},
author={Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo},
journal={arXiv:2506.21891},
year={2025}
}
💬Contact
If you have any questions, please create an issue on this repository or contact at kamoto.umihiro@jp.panasonic.com.