README.md

March 18, 2026 · View on GitHub

⏳ TimeWarp: Evaluating Web Agents by Revisiting the Past

project arXiv code HuggingFace license

tldr. TimeWarp is a benchmark for evaluating the robustness of agents to temporal changes in web UI. TimeWarp consists of three web environments: Wiki, News, and Shop, each with six UI versions across different eras of the internet. The benchmark also includes TimeTraj, a method for scalably collecting trajectories via human-refined plans, and TimeWarp-BC, a variant of Behavior Cloning (BC) to train agents better via knowledge distillation on complex tasks that require memory and planning.


Table of Contents


📦 Installation

⚠️ Ensure conda is installed on your system. If you don't have conda installed then please follow these instructions from here. ⚠️

Simply run setup.sh which will create a conda environment called timewarp and install the required dependencies:

bash setup.sh

⚠️ You might encounter issues when setting up env/webshop, e.g., Google Drive rate limits getting exceeded, which would require you to download the files manually. You can also check the original webshop repo. ⚠️

🌐 Running Environments

Single Environment

Run the following commands to start a single or multiple versions of the environment by passing the version number [1-6] or all argument:

bash env/wiki/start_wiki.sh [-1|-2|-3|-4|-5|-6|-all] # Wiki
bash env/news/start_news.sh [-1|-2|-3|-4|-5|-6|-all] # News
bash env/webshop/start_webshop.sh [-1|-2|-3|-4|-5|-6|-all] # Shop

Example Usage:

bash env/webshop/start_webshop.sh -1

Multiple Environments

Helper scripts for running multiple environments are provided in scripts/environment, with additional instructions. Sample usage is given below:

# Start all environments with theme version 1 (default)
./run_all_env.sh

# Start with a specific version
./run_all_env.sh 3

# Start and block the terminal (useful for foreground monitoring)
./run_all_env.sh 1 --wait

# Stop all tunnels and servers (default)
./stop_all_ports.sh

Ports are assigned automatically starting from 5000. On startup, the following environment variables are exported:

VariableDefaultDescription
TW_WIKIhttp://localhost:<port>Wiki environment URL
TW_NEWShttp://localhost:<port>News environment URL
TW_WEBSHOPhttp://localhost:<port>/abcWebshop environment URL

🎨 Create your Own Theme!

Each environment loads its UI from a theme folder. To add a new theme, create a folder under the appropriate path:

EnvironmentTheme directory
Wikienv/wiki/themes/<your-theme>/
Newsenv/news/themes/<your-theme>/
Shopenv/webshop/web_agent_site/themes/<your-theme>/

Wiki & News themes are flat directories. Drop in HTML templates and a stylesheet:

<your-theme>/
├── base.html
├── index.html
├── article.html
├── 404.html
├── style.css
└── script.js

News also expects browse.html and search.html. If you prefer, you can use templates/ and static/ subdirectories instead of the flat layout — the apps detect either structure automatically (Wiki only; News expects a flat layout).

Shop themes use a two-subfolder layout:

<your-theme>/
├── templates/   # search_page.html, results_page.html, item_page.html,
│                # description_page.html, features_page.html, attributes_page.html,
│                # review_page.html, done_page.html
└── static/      # style.css (and any images)

Once the folder is ready, register it by adding an entry to num_to_theme (and optionally name_aliases) inside _parse_args in the corresponding app file:

EnvironmentApp file
Wikienv/wiki/wiki_app.py
Newsenv/news/news_app.py
Shopenv/webshop/web_agent_site/app.py

Then launch the environment with your theme name or its assigned number:

bash env/wiki/start_wiki.sh -<number>
# or
python env/wiki/wiki_app.py --<your-theme-name>

📝 Running Tasks on Environment

You can use TimeWarp directly with BrowserGym:

import gymnasium as gym
import browsergym.timewarp

env = gym.make("browsergym/timewarp.1")
obs, info = env.reset()
# Run your agent
env.close()

Make sure the TimeWarp environments are running (see Running Environments) and the following environment variables are set:

export TW_WIKI="http://localhost:5000"
export TW_WEBSHOP="http://localhost:5001"
export TW_NEWS="http://localhost:5002"
export OPENAI_API_KEY="your-key"  # For judge evaluation

🤖 Running your Web Agent

To benchmark a model on TimeWarp you need three things running: a model, the environments, and a benchmark script.

1. Host a model. Use an API key (e.g. OPENAI_API_KEY) or serve a local model with vllm. The startVLMmodel.sh script handles both LLMs and VLMs:

bash scripts/startVLMmodel.sh --port <port> --model <name_or_path>

2. Start the environments. Run all three environments at once with a single version flag:

bash scripts/environment/run_all_env.sh <version_number>   # e.g. 3

Stop everything when done:

bash scripts/environment/stop_all_ports.sh

3. Run a benchmark. The recommended way is AgentLab. After installing it, run a single benchmark script:

python scripts/singleBenchmark/benchmarkGeneralWiki.py \
  --port 9000 \
  --version v1 \
  --model <model_name_or_path>

To sweep across multiple models and environment versions automatically, use the multi-benchmark entry point:

bash scripts/multiBenchmark/_run_multi.sh \
  --models  "path/to/model1,path/to/model2" \
  --scripts "singleBenchmark/benchmarkGeneralWiki.py,..." \
  --versions "1,2,3"

See scripts/README.md for the full setup and AgentLab configuration details.


🏋️ Training your Web Agent

TimeWarp agents are fine-tuned on teacher trajectories using LlamaFactory. Multi-GPU training with DeepSpeed ZeRO-3 is recommended.

1. Set up LlamaFactory.

git clone --depth 1 https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory && pip install -e .

2. Get training data. Generate teacher trajectories or download our GPT-5 traces directly:

git clone https://huggingface.co/datasets/sparklabutah/TimeWarp-GPT5-Traces

Convert them to ShareGPT format using convert2sgptArgs.py, then place the output JSON in LlamaFactory/data/ and register it in dataset_info.json.

3. Train.

llamafactory-cli train examples/train_full/your_training_config.yaml

Example .yaml configs for both full fine-tuning and LoRA are provided in llamafactory/train_full and llamafactory/train_lora. See llamafactory/README.md for the complete walkthrough.


Citation

Don't forget to cite all the repos that have helped us!

Browsergym and AgentLab

@article{
    chezelles2025browsergym,
    title={The BrowserGym Ecosystem for Web Agent Research},
    author={Thibault Le Sellier de Chezelles and Maxime Gasse and Alexandre Lacoste and Massimo Caccia and Alexandre Drouin and L{\'e}o Boisvert and Megh Thakkar and Tom Marty and Rim Assouel and Sahar Omidi Shayegan and Lawrence Keunho Jang and Xing Han L{\`u} and Ori Yoran and Dehan Kong and Frank F. Xu and Siva Reddy and Graham Neubig and Quentin Cappart and Russ Salakhutdinov and Nicolas Chapados},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2025},
    url={https://openreview.net/forum?id=5298fKGmv3},
    note={Expert Certification}
}

WebShop

@inproceedings{yao2022webshop,
  bibtex_show = {true},
  title = {WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents},
  author = {Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik},
  booktitle = {ArXiv},
  year = {preprint},
  html = {https://arxiv.org/abs/2207.01206},
  tag = {NLP}
}

If you enjoyed using this repo, also consider citing us! 😊

TimeWarp

@misc{timewarp2026,
      title={TimeWarp: Evaluating Web Agents by Revisiting the Past}, 
      author={Md Farhan Ishmam and Kenneth Marino},
      year={2026},
      eprint={2603.04949},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.04949}, 
  }