SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

November 16, 2025 Β· View on GitHub

SciVideoBench Dataset Examples

[πŸ”¬ Project Page] | [πŸ“– arXiv Paper] | [πŸ“Š Dataset]


πŸ”₯ News

  • 2025.11.15 πŸ”₯ SciVideoBench is is now integrated into lmms-eval!
  • 2025.10.25 πŸš€ GPT-5 achieves 55.7% overall accuracy on SciVideoBench, setting a new record in Quantitative Reasoning (55.1%)!
  • 2025.10.20 πŸ† Honored to share that SciVideoBench received the Best Paper Award (Benchmark Track) at the ICCV 2025 KnowledgeMR Workshop!
  • 2025.10.17 πŸ”₯ We evaluate Qwen3-VL on SciVideoBench. Check our leaderboard!
  • 2025.10.10 πŸŽ‰ SciVideoBench introduced as the first benchmark for scientific video reasoning.

πŸ‘€ Overview

Scientific experiments present unique challenges for video-language models (VLMs): precise perception of visual details, integration of multimodal signals (video, audio, transcripts), and complex reasoning across temporal scales. To address this gap, we introduce SciVideoBench, the first comprehensive benchmark dedicated to scientific video reasoning.

SciVideoBench evaluates models across Physics, Chemistry, Biology, and Medicine, covering both perceptual understanding and high-level reasoning tasks. It provides a rigorous benchmark for evaluating long-form video reasoning in domains where accuracy and explainability matter most.

SciVideoBench Overview

Figure 1: The overall design of SciVideoBench, showing multi-stage data construction, annotation protocol, and evaluation pipeline.

πŸŽ₯ Dataset Examples

SciVideoBench Dataset Examples

Figure 2: Examples of SciVideoBench videos and their associated QA pairs across Physics, Chemistry, Biology, and Medicine.

πŸ“Œ Key Features

  • Domain Coverage: 4 scientific disciplines (Physics, Chemistry, Biology, Medicine) with diverse experimental settings.
  • Scale: 1,000 high-quality, human-verified multiple-choice questions.
  • Reasoning Dimensions:
    • Conceptual Reasoning – understanding principles and experimental setups.
    • Quantitative Reasoning – extracting and reasoning with measurements, numbers, and calculations.
    • Hypothetical Reasoning – counterfactual and β€œwhat-if” scientific scenarios.
  • Rich Metadata: Each QA pair is annotated with discipline, subject, timestamp breakdowns, and rationale.
  • Evaluation Protocols: Compatible with lmms-eval for standardized model comparison.

πŸ“‚ Dataset

  • Total Videos: 240+ scientific experiments
  • Disciplines: Physics, Chemistry, Biology, Medicine
  • QA Pairs: 1,000 carefully curated multiple-choice questions
  • Formats: JSON/JSONL with video-level metadata and QA annotations
  • Annotations: Timestamp breakdowns, rationales, difficulty levels

Example QA format:

{
  "video_id": "58827",
  "question_id": "1",
  "question": "What is the purpose of transferring the sample between chambers as shown between 02:22 and 02:33?",
  "options": {
    "A": "Align the sample with the deposition target",
    "B": "Cool the sample before deposition",
    "C": "Measure sample thickness prior to coating",
    "D": "Preserve main chamber vacuum integrity"
  },
  "answer": "D",
  "question_type": "Conceptual Reasoning",
  "discipline": "Chemistry",
  "subject": "Nanomaterials",
  "timestamp_breakdown": ["..."]
}

πŸ† Leaderboard

Part of the evaluation results of proprietary and open-source models on SciVideoBench (%, higher is better).

ModelsOverallConceptualHypotheticalQuantitativeBiologyChemistryMedicinePhysics
Random Guess10.0010.0010.0010.0010.0010.0010.0010.00
Human Evaluation17.4018.1118.7014.2915.8816.0621.1918.88
Gemini-2.5-Pro64.3069.7367.7950.6164.7961.8274.7761.44
GPT-555.7061.3550.6555.1057.1252.7358.8854.23
Gemini-2.5-Flash46.4050.8144.1643.2744.0149.7055.1444.83
InternVL-3-78B38.5056.7639.229.8037.6537.5846.7337.30
InternVL-3-14B35.7053.5135.329.3935.9433.9438.3235.42
Gemini-1.5-Pro27.5027.8428.3125.7127.3826.0627.1028.53
InternVL2-Llama3-76B26.3029.7029.9125.0824.9427.0129.9128.38
Gemini-2.0-Flash25.7028.3824.9422.8624.6926.0622.4327.90
GPT-4o24.9030.2728.0511.8421.5229.7031.7824.45
LLaVA-NeXT-Video-32B21.1022.4223.3621.3219.8022.8623.3626.22
Qwen2.5-VL-72B20.3016.9724.3019.4421.2720.7824.3021.62

πŸ§ͺ Evaluation (via lmms-eval)

SciVideoBench has now been integrated into lmms-eval!

1) Install (Please refer to lmms-eval)

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

2) Repo Layout

After cloning lmms-eval, place the scivideobench/ folder under tasks/:

lmms-eval/
  tasks/
    β”œβ”€β”€ activitynetqa/
    β”œβ”€β”€ ai2d/
    β”œβ”€β”€ aime/
    β”œβ”€β”€ air_bench/
    β”œβ”€β”€ ...
    β”œβ”€β”€ scivideobench/              # βœ… our benchmark lives here
    β”‚   β”œβ”€β”€ scivideobench.yaml      # task definition(s) for evaluation
    β”‚   β”œβ”€β”€ utils.py                # dataset loader, metrics, post-processing
    β”‚   └── (optional) extra yaml   # if you split configs (chat, cot, etc.)
  ...
  • scivideobench.yaml β†’ Defines how lmms-eval loads SciVideoBench (dataset path, media fields, eval settings).
  • utils.py β†’ Custom dataloader + evaluation metrics (accuracy, discipline/reasoning type breakdown).
  • You can create multiple YAMLs (e.g., scivideobench_chat.yaml, scivideobench_cot.yaml) if you want variants, similar to how air_bench has multiple YAMLs.

3) Quick Start

Local Hugging Face models (Qwen2.5-VL, InternVL-3, etc.)

accelerate launch --num_processes 8 --main_process_port 12380 -m lmms_eval \
    --model internvl3 \
    --config lmms-eval/lmms_eval/tasks/scivideobench/scivideobench.yaml \
    --model_args pretrained=OpenGVLab/InternVL3-2B,modality=video,num_frame=32 \
    --gen_kwargs=max_new_tokens=1024 \
    --tasks scivideobench \
    --batch_size 1 \
    --log_samples \

πŸ“‚ License

License & Access:

SciVideoBench is only used for academic research. Commercial use in any form is strictly prohibited.
The copyright of all videos belongs to the original video owners and JoVE.
If there is any infringement in SciVideoBench, please email us and we will promptly remove the content.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify SciVideoBench.
You must strictly comply with the above restrictions.

➑️ Access requirement: Please complete and sign our Dataset Access Agreement before using SciVideoBench:
πŸ” Google Form β€” SciVideoBench Dataset Access Agreement

The SciVideoBench is available in HuggingFace.

For any questions, contact andongdeng69@gmail.com.

✨ Citation

If you use SciVideoBench, please cite our paper:

    @article{deng2025scivideobench,
        title={SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models},
        author={Andong Deng and Taojiannan Yang and Shoubin Yu and Lincoln Spencer and Mohit Bansal and Chen Chen and Serena Yeung-Levy and Xiaohan Wang},
        journal={arXiv preprint arXiv:2510.08559},
        year={2025}
    }