TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks (EACL 2026)
April 1, 2026 ยท View on GitHub
TimeMachine-bench is a benchmark designed to evaluate model capabilities in repository-level migration tasks. The benchmark consists of real-world GitHub repositories whose tests begin to fail in response to dependency updates.
The name TimeMachine-bench is derived from the idea of time travel for dependency solvers, where the solvers are provided with a date-filtered index to resolve dependencies as if they were operating in the past. Our framework enables strict reproduction of environments at any specific point in time across the entire ecosystem, allowing for scalable evaluation without the need for predefined sets of target libraries.
Repository Structure
| Directory | Description |
|---|---|
services/ | Backend service for date-filtered PyPI servers (pypi-timemachine). |
benchmark/ | Scripts for the automated data construction pipeline. |
agents/ | Implementation of baseline agents and evaluation metrics. |
Dataset
The datasets are provided in the benchmark/data/v1 directory in JSONL format.
| File | # Repos | Description |
|---|---|---|
| timemachine-bench-full.jsonl | 1,145 | The full dataset generated by our automated pipeline. |
| timemachine-bench-verified.jsonl | 100 | The human-verified subset with guaranteed solvability and difficulty labels. |
| timemachine-bench-random.jsonl | 100 | Randomly sampled subset of the full dataset. |
Getting Started
Prerequisites
- uv: For managing dependencies for the pipeline scripts.
- docker: To provide isolated environments for secure execution of test suites.
- jq: For parsing and processing JSON data within shell scripts.
Quick Start
1. Clone the repository
Use the --recursive flag to clone the main repository along with the pypi-timemachine submodule.
git clone --recursive https://github.com/tohoku-nlp/timemachine-bench.git
cd timemachine-bench
2. Start the pypi-timemachine server
docker compose up -d
3. Now, let's run the experiments!
- Dataset construction (Optional): See
benchmark/README.mdfor instructions on the automated construction pipeline - Running baseline agents: See
agents/README.mdfor instructions on running baseline agents on the dataset.
Citation
If you find TimeMachine-bench useful in your research, please consider citing the following paper:
@inproceedings{fujii-etal-2026-timemachine-bench,
title = {{TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks}},
author = {Fujii, Ryo and Morishita, Makoto and Yano, Kazuki and Suzuki, Jun},
year = {2026},
booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages = {8233--8264}
}