RVTBench: A Multi-Task Video Reasoning Dataset across Semantic, Spatial, and Temporal Categories with Four Difficulty Levels
May 28, 2026 · View on GitHub
This repository contains the code and benchmark for the paper:
A multi-task video reasoning dataset across semantic, spatial, and temporal categories with four difficulty levels
Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath
Johns Hopkins University
Overview
This repository provides (1) the automated benchmark construction pipeline for building RVTBench using digital twin (DT) representations, and (2) the pre-constructed digital twin JSON files for the RVTBench dataset. RVTBench is a reasoning visual task (RVT) benchmark containing 3,896 queries of over 1.2 million tokens across:
- 4 task types: Segmentation, Grounding, VQA, and Summary
- 3 reasoning categories: Semantic, Spatial, and Temporal
- 4 difficulty levels: L1 (simple attribute identification) through L4 (complex multi-step reasoning chains)
derived from 200 video sequences sourced from DAVIS 2017 [1] and SA-V [2].
Dataset
The complete RVTBench dataset is publicly available on Zenodo:
Shen, Y., Li, C., Fan, C. & Unberath, M. A multi-task video reasoning dataset across semantic, spatial, and temporal categories with four difficulty levels. Zenodo https://doi.org/10.5281/zenodo.19697191 (2026). [3]
Repository Structure
rvt/
├── dataset_construction/ # Automated benchmark construction pipeline
│ ├── vlm_inf.py # VLM (LLaVA-v1.6-34B) inference for object labeling and description
│ ├── depth.py # Depth estimation using Depth Anything V2
│ ├── depth_description.py # Spatial relationship summarization from depth values via GPT-4o
│ ├── depth_prompt.txt # Prompt template for depth-based spatial description
│ ├── features.py # Geometric feature extraction (center, velocity, bounding box)
│ ├── area_center.py # Mask area and center computation
│ ├── mask2bbox.py # Mask-to-bounding-box visualization
│ ├── merge.py # Ground truth and SAM2 mask merging via IoU matching
│ ├── dt_json.py # Digital twin JSON structure assembly
│ └── chatgpt_interface.py # OpenAI API interface for query generation
├── rvtbench/ # Pre-constructed digital twin JSON files (974 files)
│ ├── <video_name>_<idx>.json # Per-video DT representations with annotations
│ └── ...
└── README.md
Dataset Construction Pipeline
The benchmark construction pipeline leverages digital twin (DT) representations as structured intermediaries between raw video perception and the generation of implicit text queries. The pipeline consists of the following stages:
Stage 1: Video Segmentation and Mask Processing
- SAM2 Segmentation: Videos are segmented using SAM 2 [4] to produce per-object mask sequences.
- Ground Truth Merging (
merge.py): For videos with existing ground truth annotations (e.g., from DAVIS), masks are merged with SAM2 outputs via IoU-based matching (threshold = 0.6).
Stage 2: Digital Twin Feature Extraction
- Mask-to-BBox Visualization (
mask2bbox.py): Binary masks are converted to bounding box overlays on original video frames for VLM input. - VLM Inference (
vlm_inf.py): LLaVA-v1.6-34B [5] generates per-object labels and detailed visual descriptions (color, texture, shape, clothing, etc.) from bounding box–annotated frames. - Depth Estimation (
depth.py): Depth Anything V2 [6] estimates per-frame depth maps, and per-object depth values are computed via mask-weighted centroid or trimmed-mean aggregation. - Geometric Features (
features.py,area_center.py): Per-object center coordinates, velocity (speed and angle), bounding box, and area are extracted from binary masks using OpenCV moments. - Spatial Summarization (
depth_description.py): GPT-4o summarizes relative spatial positions of objects from depth sequences using a structured prompt (depth_prompt.txt).
Stage 3: Digital Twin Assembly and Query Generation
- DT JSON Assembly (
dt_json.py): All extracted features are assembled into a hierarchical three-level JSON schema (video → frame → instance) following Equations (3) and (4) of the paper. - Query Generation (
chatgpt_interface.py): GPT-4o generates implicit text queries at four difficulty levels from the structured DT representations, producing queries for segmentation, grounding, VQA, and summary tasks across semantic, spatial, and temporal reasoning categories.
Pre-constructed Digital Twin Files
The rvtbench/ directory contains 974 pre-constructed digital twin JSON files. Each file (<video_name>_<idx>.json) contains:
- Video path: Reference to the source video
- Annotations: A dictionary of query entries, each containing:
category: Reasoning categories (semantic, spatial, temporal)difficulty: Difficulty level (level 1–4)task: Task type (Segmentation, Grounding, VQA, Summary)query: The implicit text query stringreference: Reasoning chain referencegt: Ground truth (mask path, bounding box path, or text answer)
Requirements
The dataset construction pipeline depends on the following:
- Python 3.8+
- PyTorch
- Transformers (for LLaVA-v1.6-34B)
- Depth Anything V2
- SAM 2
- OpenCV (
cv2) - NumPy
- Pillow
- tqdm
- OpenAI Python SDK (for GPT-4o query generation)
Usage
Note: The dataset construction pipeline requires GPU resources and pre-trained model weights. The pre-constructed digital twin files in
rvtbench/can be used directly without running the pipeline.
To use the pre-constructed benchmark, download the full RVTBench dataset (including videos, masks, and annotations) from Zenodo [3] and load the digital twin files:
import json
# Load a digital twin file
with open("rvtbench/bear_0.json", "r") as f:
data = json.load(f)
# Access video path
video_path = data["video"]
# Iterate over annotations
for query_id, annotation in data["annotations"].items():
print(f"Task: {annotation['task']}")
print(f"Difficulty: {annotation['difficulty']}")
print(f"Category: {annotation['category']}")
print(f"Query: {annotation['query']}")
print(f"GT: {annotation['gt']}")
Citation
If you use RVTBench or this code in your research, please cite:
@article{shen2026rvtbench,
title={A multi-task video reasoning dataset across semantic, spatial, and temporal categories with four difficulty levels},
author={Shen, Yiqing and Li, Chenjia and Fan, Chenxiao and Unberath, Mathias},
year={2026},
}
Data Citations
[1] Pont-Tuset, J. et al. The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 (2017).
[2] Ravi, N., Gabeur, V., Hu, Y.-T. et al. SAM 2: Segment anything in images and videos. arXiv:2408.00714 (2024).
[3] Shen, Y., Li, C., Fan, C. & Unberath, M. A multi-task video reasoning dataset across semantic, spatial, and temporal categories with four difficulty levels. Zenodo https://doi.org/10.5281/zenodo.19697191 (2026).
[4] Ravi, N., Gabeur, V., Hu, Y.-T. et al. SAM 2: Segment anything in images and videos. arXiv:2408.00714 (2024).
[5] Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26296–26306 (2024).
[6] Yang, L., Kang, B., Huang, Z. et al. Depth Anything V2. arXiv:2406.09414 (2024).
License
This repository is licensed under CC0 1.0 Universal (Public Domain Dedication). You can copy, modify, distribute, and perform the work, even for commercial purposes, all without asking permission. See the LICENSE file for the full legal text.
Acknowledgments
This work was supported in part by the JHU Amazon Initiative for Artificial Intelligence (AI2AI) fellowship program.