โณ TIME

October 5, 2025 ยท View on GitHub

โณ TIME

Paper Code TIME Dataset TIME-Lite TIME-Lite TIME-Lite

[NeurIPS'25 Spotlight] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Peking University Huawei Noah's Ark Lab

๐ŸŽ‰๐ŸŽ‰ Congratulations! This paper has been accepted as NeurIPS 2025 Spotlight ๐ŸŒŸ๐Ÿ”ฅ at D&B track.

๐ŸŒŸ If you found this work helpful, please consider giving us a โญ on GitHub!

GitHub stars Hugging Face

๐Ÿ“‹ Project Information

Authors: Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
Affiliation: Peking University, Noah's Ark Lab
Contact: shaohang@stu.pku.edu.cn

๐Ÿ“– Abstract

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning:

  • Intensive temporal information
  • Fast-changing event dynamics
  • Complex temporal dependencies in social interactions

To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios.

TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial.

We conduct extensive experiments on reasoning models and non-reasoning models, and conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

TIME Dataset Overview

๐Ÿš€ Get Started

๐Ÿ“ฅ Step 1: Install Dependencies

# Install git-lfs
pip install git-lfs

๐Ÿ“Š Step 2: Download Dataset

We provide two datasets. Choose according to your needs:

โš ๏ธ Option 1: Complete TIME Dataset (Large dataset - may be too large for quick evaluation)

# Navigate to the working directory and download the benchmark dataset TIME
chmod +x scripts/download_data_time.sh

# Download the data
./scripts/download_data_time.sh

โœ… Option 2: TIME-Lite Dataset (Recommended - High-quality subset)

# Navigate to the working directory and download the benchmark dataset TIME-Lite
chmod +x scripts/download_data_time_lite.sh

# Download the data
./scripts/download_data_time_lite.sh

๐Ÿ”ง Step 3: Install Evaluation Dependencies

pip install -r evaluation/requirements.txt

โ–ถ๏ธ Step 4: Run Evaluation

Option A: Evaluate TIME dataset

./scripts/eval_time.sh

Option B: Evaluate TIME-Lite dataset (Recommended)

./scripts/eval_timelite.sh

๐Ÿง  Construction Pipeline

TIME Construction Pipeline

๐Ÿ“Š Data Quantity

๐Ÿ“ˆ Dataset Statistics:

  • TIME: 38,522 QA pairs (Complete benchmark)
  • TIME-Lite: 943 QA pairs (High-quality subset)

Here is a detailed breakdown of the dataset statistics:

DatasetAll TasksExt.Loc.Comp.D.C.O.C.E.R.O.R.R.R.C.T.T.L.C.F.
TIME3852214803546337634013549353735383537351355083537
TIME-Wiki1384812611299112611511299128712881287126313001287
TIME-News1995801800180018001800180018001800180037581800
TIME-Dial4716219447450450450450450450450450450
TIME-Lite9436090788690909090908990
TIME-Lite-Wiki3223030242830303030303030
TIME-Lite-News299030303030303030302930
TIME-Lite-Dial3223030242830303030303030

Task abbreviations: Ext. (Extract), Loc. (Localization), Comp. (Computation), D.C. (Duration Compare), O.C. (Order Compare); E.R. (Explicit Reasoning), O.R. (Order Reasoning), R.R. (Relative Reasoning); C.T. (Co-temporality), T.L. (Timeline), C.F. (Counterfactual).

๐Ÿ’ช๐Ÿป Evaluation Results

๐Ÿ“Š TIME-Lite Results Radar Charts

Here are the detailed evaluation results for the TIME-Lite dataset on different sub-datasets:

๐Ÿ’ฌ Citation

If you find our work interesting and meaningful, welcome to star this repo, give an upvote to our HF repo TIME and cite our paper as follows.

@article{wei2025time,
  title={TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios},
  author={Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng},
  journal={arXiv preprint arXiv:2505.12891},
  year={2025}
}