README.md
June 26, 2026 · View on GitHub
Note: Paper link arxiv:2605.17373. The previous version (arxiv:2510.10472) lives on the
legacybranch.
A benchmark for automatic ML research agents on fundamental machine learning problems. Agents are given a baseline codebase, evaluation harness, and task description, and are asked to iteratively improve the baseline.
Contents
- Quick Start
- Setup
- Run an agent (example)
- Run with other models
- Run on other tasks
- FML-bench-Lite
- Remote GPU execution (Modal)
- Score a run
- Available agents
- Repository layout
- Citation
- Acknowledgements
Quick Start
Install everything (task repos, datasets, conda envs):
python setup.py
Run an agent (e.g. AI Scientist v2 on Causality_causalml with GPT-5.4):
conda activate fmlbench
export OPENAI_API_KEY="your_openai_api_key"
python run_agent_benchmark.py \
--agent-config configs/agents/ai_scientist_v2.yaml \
--task-config configs/tasks/causality_causalml.yaml \
--model gpt-5.4 --provider OpenAI \
--output-dir results \
agent.ai_scientist_v2.max_steps=100
See Setup and Run an agent (example) for full instructions (per-task setup, GPU selection, other providers).
Setup
Make sure you have Anaconda/Miniconda installed before running setup.
Everything — task repositories, datasets, and the conda environments required to run agents — is bootstrapped by a single command:
python setup.py
setup.py is idempotent: re-running it skips repos, datasets, and envs that
are already present.
If you only want to set up a single task (much faster — only the conda
environments that task needs are created), pass --task:
python setup.py --task Causality_causalml
Other options:
python setup.py --list # list all available tasks
python setup.py --skip-data # clone repos and create envs, but skip dataset downloads
python setup.py --skip-envs # set up workspaces only, skip conda env creation
After setup completes, the harness env fmlbench is ready, and each task has
its own conda env (e.g. causalml, domainbed, …) used to execute the
baseline code.
Run an agent (example)
This example runs AI Scientist v2 on Causality_causalml using GPT-5.4.
# 1. set up just this task
python setup.py --task Causality_causalml
# 2. activate the harness env
conda activate fmlbench
# 3. provide your API key
export OPENAI_API_KEY="your_openai_api_key"
# 4. pick a GPU
export CUDA_VISIBLE_DEVICES=0
# 5. run the agent
python run_agent_benchmark.py \
--agent-config configs/agents/ai_scientist_v2.yaml \
--task-config configs/tasks/causality_causalml.yaml \
--model gpt-5.4 \
--provider OpenAI \
--output-dir results \
agent.ai_scientist_v2.max_steps=100
Results, the per-step token usage, and a summary.json are written under
the chosen --output-dir (defaults to benchmark_results/); per-agent
step budget is controlled by the agent.<type>.max_steps=N override (the
example above sets it to 100).
Run with other models
The model and provider are command-line flags. Anything supported by the provider works (e.g. OpenAI GPT family, Google Gemini, Anthropic Claude, OpenRouter passthrough). Examples:
# Gemini 2.5 Pro via Google
python run_agent_benchmark.py \
--agent-config configs/agents/ai_scientist_v2.yaml \
--task-config configs/tasks/causality_causalml.yaml \
--model gemini-2.5-pro --provider Google \
--output-dir results \
agent.ai_scientist_v2.max_steps=100
# Claude via OpenRouter
python run_agent_benchmark.py \
--agent-config configs/agents/ai_scientist_v2.yaml \
--task-config configs/tasks/causality_causalml.yaml \
--model anthropic/claude-3.5-sonnet --provider OpenRouter \
--output-dir results \
agent.ai_scientist_v2.max_steps=100
Set the corresponding API key in your shell:
| Provider | Environment variable |
|---|---|
| OpenAI | OPENAI_API_KEY |
GOOGLE_API_KEY | |
| Anthropic | ANTHROPIC_API_KEY |
| OpenRouter | OPENROUTER_API_KEY |
Per-agent hyperparameters can be overridden inline via positional key=value
arguments, e.g.:
python run_agent_benchmark.py \
--agent-config configs/agents/ai_scientist_v2.yaml \
--task-config configs/tasks/causality_causalml.yaml \
--model gpt-5.4 --provider OpenAI \
--output-dir results \
agent.ai_scientist_v2.max_steps=100 \
agent.ai_scientist_v2.num_ideas=5 \
agent.ai_scientist_v2.max_debug_depth=2
Run on other tasks
Each task has a YAML in configs/tasks/. Set up the workspace for the task,
then point --task-config at the corresponding file. For example, to run on
DomainBed (Generalization):
python setup.py --task Generalization_domainbed
python run_agent_benchmark.py \
--agent-config configs/agents/ai_scientist_v2.yaml \
--task-config configs/tasks/generalization.yaml \
--model gpt-5.4 --provider OpenAI \
--output-dir results \
agent.ai_scientist_v2.max_steps=100
Available task configs (one per task):
| Task | Config file |
|---|---|
| Causality (CausalML) | configs/tasks/causality_causalml.yaml |
| Causality (gCastle) | configs/tasks/causality_gcastle.yaml |
| Continual Learning (continual-learning) | configs/tasks/continual_learning.yaml |
| Continual Learning (PyCIL) | configs/tasks/continual_learning_pycil.yaml |
| Data Efficiency (easy-few-shot-learning) | configs/tasks/data_efficiency.yaml |
| Data Efficiency (USB) | configs/tasks/data_efficiency_usb.yaml |
| Fairness (AIF360) | configs/tasks/fairness_and_bias_aif360.yaml |
| Fairness (Fairlearn) | configs/tasks/fairness_fairlearn.yaml |
| Federated Learning (PFLlib) | configs/tasks/federated_learning_pfllib.yaml |
| Generalization (DomainBed, ColoredMNIST) | configs/tasks/generalization.yaml |
| Generalization (DomainBed, OfficeHome) | configs/tasks/generalization_officehome.yaml |
| Privacy (Opacus) | configs/tasks/privacy_opacus.yaml |
| Privacy (PrivacyMeter) | configs/tasks/privacy_privacymeter.yaml |
| Representation Learning (Lightly) | configs/tasks/representation_learning.yaml |
| Representation Learning (solo-learn) | configs/tasks/representation_learning_solo_learn.yaml |
| Robustness (ART) | configs/tasks/robustness_and_reliability_art.yaml |
| Robustness (OpenOOD) | configs/tasks/robustness_openood.yaml |
| Unlearning (open-unlearning) | configs/tasks/unlearning_open_unlearning.yaml |
python setup.py --list prints the task names accepted by --task.
FML-bench-Lite
FML-bench-Lite is a subset of the full benchmark, offered as a cheaper proxy for the full 18-task suite.
| Task | Config file |
|---|---|
| Continual Learning (PyCIL) | configs/tasks/continual_learning_pycil.yaml |
| Data Efficiency (USB) | configs/tasks/data_efficiency_usb.yaml |
| Generalization (DomainBed, ColoredMNIST) | configs/tasks/generalization.yaml |
| Generalization (DomainBed, OfficeHome) | configs/tasks/generalization_officehome.yaml |
| Robustness (OpenOOD) | configs/tasks/robustness_openood.yaml |
| Privacy (Opacus) | configs/tasks/privacy_opacus.yaml |
| Privacy (PrivacyMeter) | configs/tasks/privacy_privacymeter.yaml |
| Robustness (ART) | configs/tasks/robustness_and_reliability_art.yaml |
Run an agent on these task configs and score it exactly as you would the full suite. Continual Learning (continual-learning) and Unlearning (open-unlearning) tasks are not suitable as subset tasks due to high variance and large metric scale. On this subset the overall ranking of agents closely tracks what the full 18-task benchmark shows, making Lite a useful, cheaper proxy when a full sweep is out of reach.
Remote GPU execution (Modal)
By default evaluations run as local subprocesses on this machine. FML-bench also supports an opt-in Modal backend that offloads only the experiment-execution step (each task's validation/test command) to an ephemeral remote GPU sandbox, while the entire agent loop — search, code edits, metric parsing — stays local. This lets you run on remote GPUs and fan many tasks out in parallel, with no change to agent or task behavior.
That backend lives on its own branch and does not affect the local default on
this branch. For setup and usage, see the
Modal branch.
Score a run
After an agent has run on all 18 tasks under a single --output-dir, score it
with compute_agent_metrics.py. Point the script at that agent's result
directory — <output-dir>/<agent_name>, which holds one subdirectory per task:
conda activate fmlbench
python compute_agent_metrics.py results/ai_scientist_v2
It prints and writes three tables (as CSVs under metric_reports/<agent_name>/
by default; override the location with --output-dir):
- Raw Performance — the canonical test metric for each task.
- Normalized Improvement — each task's improvement over its baseline,
normalized to
[0, 1]and averaged across tasks. - Process-Level metrics — 12 metrics spanning Exploration, Generalization, Reliability, Efficiency, and Cost.
The 4 Exploration metrics embed each step's code snapshot with GraphCodeBERT and
additionally require torch, transformers, and scikit-learn (embeddings are
cached under the output dir). They read each task's baseline code from
workspace/, so that workspace must be at its reset (baseline) state when
scoring. The script only reads from the result directory and refuses an
--output-dir that is, or is nested inside, the result directory.
Available agents
Seven agents are registered in this benchmark. Each has a config in
configs/agents/:
| Agent | Config file |
|---|---|
| The AI Scientist v1 | configs/agents/theaiscientist.yaml |
| The AI Scientist v2 | configs/agents/ai_scientist_v2.yaml |
| AIDE | configs/agents/aide.yaml |
| AIRA | configs/agents/aira_mcts.yaml |
| Autoresearch | configs/agents/autoresearch.yaml |
| OpenEvolve | configs/agents/openevolve.yaml |
| AdaptiveSearch (ours) | configs/agents/adaptivesearch.yaml |
Swap --agent-config to switch agents — everything else (task, model,
provider) stays the same.
Repository layout
setup.py # one-shot environment + workspace setup
run_agent_benchmark.py # entry point for running an agent on a task
compute_agent_metrics.py # score one agent's results (3 tables + CSVs)
agents/ # agent implementations
benchmark/ # benchmark runner / executor
configs/agents/ # agent YAMLs
configs/tasks/ # task YAMLs
ml_tasks/ # task definitions: train.py, prompts, configs
workspace/ # populated by setup.py with task codebases
Citation
If you find FML-bench useful in your research, please cite our paper:
@article{zou2026fml,
title={FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics},
author={Zou, Qiran and Lam, Hou Hei and Zhao, Wenhao and Chen, Tingting and Tang, Yiming and Yu, Samson and Zhu, Yingtao and Anumasa, Srinivas and Zhang, Zufeng and Zhang, Tianyi and others},
journal={arXiv preprint arXiv:2605.17373},
year={2026}
}
Acknowledgements
We thank the maintainers of the following upstream projects:
- ML Repos: DomainBed, easy-few-shot-learning, USB, Lightly, solo-learn, continual-learning, PyCIL, CausalML, gCastle, Adversarial Robustness Toolbox, OpenOOD, Opacus, PrivacyMeter, AIF360, Fairlearn, PFLlib, open-unlearning
- ML Research Agents: The AI Scientist v1, The AI Scientist v2, AIDE, AIRA, Autoresearch, OpenEvolve