๐ŸŽฌ HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

September 14, 2025 ยท View on GitHub

Paper Dataset QA Pairs

๐Ÿ“– Introduction

HLV-1K is a comprehensive benchmark designed to evaluate the capabilities of multimodal large language models (MLLMs) in understanding hour-long videos with time-specific queries. Unlike existing video understanding benchmarks that focus on short clips, HLV-1K addresses the critical challenge of long-term video comprehension by providing:

  • ๐Ÿ• Hour-long Videos: 1,009 videos with an average duration of 1 hour
  • ๐Ÿ“Š Diverse Reasoning Tasks: 14,847 QA and MCQA pairs across multiple reasoning levels
  • โฐ Time-specific Queries: Questions that require understanding of specific temporal segments
  • ๐ŸŽฏ Multi-level Evaluation: Frame-level, within-event, cross-event, and long-term reasoning

As video content becomes increasingly prevalent and lengthy, HLV-1K provides a robust evaluation framework for assessing models' ability to comprehend and reason about extended video sequences with precise temporal understanding.

Leaderboard

Accuracy scores on HLV-1K are presented on frame-level, within-event-level, cross-event-level and long-term-level.

#ModelLLM
Params
FramesDateFrame-levelWithin-event-levelCross-event-levelLong-term-levelOverall
3LLaVA-Video72B1202025-01-0384.4178.4380.1075.6578.93
2LLaVA-OneVision72B1202025-01-0380.3375.0677.2568.7474.01
1Qwen2-VL72B1202025-01-0361.4466.8366.9667.1765.78
4Kangaroo8B1202025-01-0375.2363.5765.0454.6062.71
6Gemini 1.5 Pro-1202025-01-0360.3964.4663.0862.3762.41
2LongVA7B1202025-01-0367.8959.1261.3759.6761.74
1InternVL2.58B1202025-01-0360.7265.0262.7359.3461.24
5GPT-4o-1202025-01-0353.8859.0856.6454.3755.48
4Claude 3.5 Sonnet-202025-01-0326.2123.9827.7328.8927.24

๐Ÿ“Š Benchmark Details

๐ŸŽฏ Key Features

  • ๐Ÿ“น Video Scale: 1,009 hour-long videos (average duration: ~1 hour)
  • โ“ Question Diversity: 14,847 QA and MCQA pairs with time-specific queries
  • ๐Ÿ” Multi-level Reasoning: Four distinct reasoning levels for comprehensive evaluation
  • โฑ๏ธ Temporal Precision: Questions anchored to specific time segments within videos

๐Ÿ“ˆ Dataset Statistics

MetricCountPercentage
Total Videos1,009100%
Total QA Pairs14,847100%
QA Type
- Multiple Choice (MCQA)10,53370.9%
- Open-ended (QA)4,31429.1%
Reasoning Level
- Long-term6,21341.8%
- Frame-level3,33522.5%
- Cross-event2,80918.9%
- Within-event2,49016.8%

๐ŸŽญ Task Distribution

Task TypeCountPercentage
Object Understanding2,39616.1%
Character Understanding2,19114.8%
Speed Analysis1,70111.5%
Camera Direction1,2758.6%
Spatial Relationship1,2558.5%
Attribute Change1,1597.8%
Descriptive Scene9646.5%
Action Understanding8265.6%
Time Order7304.9%
Plot Understanding6494.4%
Temporal Relationship6414.3%
Object Direction4292.9%
Causal Reasoning3222.2%
Scene Understanding2121.4%
Counting970.7%

Data Examples

HLV-1K

Benchmark construction and examples.

Benchmark Statistics

HLV-1K

HLV-1K: (a) Video category distribution, (b) Video duration distribution, and (c) Duration distribution of time-specific query.

HLV-1K

HLV-1K: Distribution of benchmark annotations.

๐Ÿ”ง Dataset Construction

๐Ÿ“ Annotation Pipeline

HLV-1K employs a sophisticated annotation pipeline using GPT-4o for high-quality question generation:

  1. Frame Description Extraction: Detailed descriptions of video frames at specific timestamps
  2. Event Summarization: Coherent event descriptions spanning ~60 seconds with precise temporal boundaries
  3. Question Generation: Time-specific questions across four reasoning levels
  4. Quality Assurance: Multi-round validation to ensure question accuracy and temporal precision

๐ŸŽฏ Reasoning Levels

LevelDescriptionExample
Frame-levelQuestions about specific frames"What object is visible at 1290.0 seconds?"
Within-eventQuestions within single events"Are the individuals working at a fast pace between 1290.0-1350.0 seconds?"
Cross-eventQuestions spanning multiple events"What activity follows the circuit board assembly?"
Long-termQuestions requiring full video understanding"What is the overall project being completed in this video?"

๐Ÿ“Š Evaluation Metrics

  • Accuracy: Overall correctness across all question types
  • Level-wise Performance: Accuracy breakdown by reasoning level
  • Task-specific Metrics: Performance on different cognitive tasks
  • Temporal Understanding: Accuracy on time-specific queries

๐Ÿ” Benchmark Comparison

HLV-1K

Experiment Results

Different Question Types

HLV-1K

Evaluation results of four representative MLLMs.

Comprehensive-Long-Video-Understanding-Survey

๐Ÿš€ Getting Started

๐Ÿ“ฅ Dataset Download

The HLV-1K dataset is available for research purposes. Please follow these steps:

  1. Clone the repository:

    git clone https://github.com/Vincent-ZHQ/HLV_1K.git
    cd HLV_1K
    
  2. Dataset structure:

    HLV_1K/
    โ”œโ”€โ”€ dataset/           # 1,009 JSON files with QA pairs
    โ”œโ”€โ”€ static/           # Web interface assets
    โ”œโ”€โ”€ gpt_evaluation.py # Evaluation script
    โ””โ”€โ”€ index.html        # Web interface
    

๐Ÿ”ง Usage

  1. Load dataset:

    import json
    
    # Load a single video's QA pairs
    with open('dataset/video_id.json', 'r') as f:
        qa_pairs = json.load(f)
    
    for qa in qa_pairs:
        print(f"Question: {qa['question']}")
        print(f"Answer: {qa['answer']}")
        print(f"Level: {qa['qa_level']}")
        print(f"Task: {qa['qa_task']}")
    
  2. Evaluation:

    python gpt_evaluation.py --model_name your_model --results_file your_results.json
    

๐Ÿ“‹ Data Format

Each JSON file contains QA pairs with the following structure:

{
  "qa_idx": 1,
  "qa_type": "mcqa",
  "qa_level": "within_event",
  "qa_task": "speed",
  "question": "Are the individuals working at a fast pace between 1290.0 and 1350.0 seconds?",
  "answer": "No",
  "options": ["A. Yes", "B. No"]  // For MCQA only
}

๐Ÿค Contributing

We welcome contributions to improve HLV-1K! Please feel free to:

  • Report issues or bugs
  • Suggest new features or improvements
  • Submit pull requests

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“š Citation

If you find our work helpful, please consider citing:

@article{zou2025hlv,
  title={Hlv-1k: A large-scale hour-long video benchmark for time-specific long video understanding},
  author={Zou, Heqing and Luo, Tianze and Xie, Guiyang and Zhang, Victor Xiao Jie and Lv, Fengmao and Wang, Guangcong and Chen, Junyang and Wang, Zhuochen and Zhang, Hansheng and Zhang, Huaijian},
  journal={arXiv preprint arXiv:2501.01645},
  year={2025}
}

๐Ÿ™ Acknowledgments

We thank all contributors and the research community for their valuable feedback and support in developing HLV-1K.