README.md
April 19, 2026 Β· View on GitHub
Β AiScientist: A File-as-Bus Research Lab
Long-horizon ML research needs File-as-Bus coordination, not just message handoffs.
Talk is cheap, show me your files.
AiScientist is built for long-horizon ML research engineering, where agents must maintain coherent progress across heterogeneous stages while preserving evolving project state over time.
Across a 24-hour autonomous run, AiScientist repeatedly implements, tests, keeps, and discards candidate ideas while pushing the running best upward. This trajectory shows long-horizon improvement through 78 experiment cycles, with diverse solution strategies explored along the way rather than a single lucky guess.
π° News
2026-04-17: Released the paper on arXiv and added benchmark integrations forPaperBenchunderbenchmark/frontier-evalsandMLE-Benchunderbenchmark/MLE-bench.2026-04-13: Initial public release of AiScientist, including theFile-as-Busruntime model, hierarchical research-team orchestration, and long-horizonpaper/mleworkflows.
π¬ What AiScientist Is
AiScientist is an artifact-mediated virtual research lab for long-horizon ML research engineering. It treats long-horizon performance as a joint systems problem: agents must not only orchestrate the right expertise at the right stage, but also preserve evolving project state with enough fidelity for later decisions to stay coherent.
-
paper: given a paper markdown or bundle plus a GPU and time budget, AiScientist autonomously drives the full reproduction loop from reading and planning to implementation, experimentation, debugging, and final self-check. -
mle: given an ML task plus a GPU and time budget, AiScientist autonomously conducts research for stronger solutions through repeated implementation-and-experiment cycles that improve the target metric over time.
File-as-Bus is the core coordination protocol. Instead of compressing progress into lossy conversational handoffs, AiScientist turns workspace files into the system of record for plans, code, experiments, logs, and validation artifacts.
A short look at AiScientist in motion.
|
https://github.com/user-attachments/assets/4356691b-eeb5-4766-a50b-29ddbc48ef9b |
β¨ Why It Feels Different
Hierarchical Research Team
A hierarchical research team pairs a top-level Orchestrator with specialists and focused subagents to sustain coherent progress over multi-day workloads. |
File-as-Bus Coordination
Agents coordinate through evolved workspace files instead of relying only on lossy message handoffs between prompts. |
Workspace as System of Record
A permission-scoped workspace and compact workspace map keep plans, code, experiments, and validation as the durable source of truth for both agents and operators. |
Thin Control over Thick State
The Orchestrator keeps control thin through stage-level directives, concise summaries, and a workspace map, while specialists progressively disclose thick state by reading task-relevant artifacts on demand. |
βοΈ How It Works
- Stage the workspace. AiScientist stages the inputs into a permission-scoped workspace and builds a compact
workspace mapthat acts as the lightweight entry point into the run state. - Launch the sandbox. A Docker sandbox mounts the workspace into canonical paths under
/home, giving agents an isolated execution environment with shared persistent state. - Keep control thin. The
Orchestratormakes stage-level decisions and delegates heavy work to specialists through theAgent-as-Toolpattern. - Keep state thick. Specialists and focused subagents coordinate through
File-as-Busartifacts: they read task-relevant files on demand and write back plans, code, experiments, logs, and validation results. - Leave an inspectable run behind. The run finishes with a workspace, logs, artifacts, and export bundle that can be resumed, validated, diffed, or audited without reconstructing state from memory.
This is the core shift from message handoffs to File-as-Bus coordination: control stays lightweight, while project state remains durable, readable, and reusable on disk.
π§ Two Tracks
AiScientist uses one control plane for two long-horizon workloads: paper reproduction and Kaggle-style MLE competitions.
| Track | Primary entrypoints | What the loop optimizes for | Validation endpoint |
|---|---|---|---|
paper | --paper-md, --zip | turn paper context into a runnable reproduction through reading, planning, implementation, experimentation, debugging, and final self-check | final self-check plus validation_report.json |
mle | exactly one of --zip, --name, --workspace-zip, --competition-bundle-zip, or --data-dir | search for stronger solutions through repeated implementation-and-experiment cycles that improve the target metric over time | submission-format or grading validation |
Both tracks share the same workspace model: durable files on disk become the common state that agents, operators, and validation flows can all inspect later.
Paper Track
paper is the paper-grounded long-horizon ML research track. Starting from --paper-md or a bundled --zip, AiScientist carries work across paper understanding, task planning, implementation, experimentation, debugging, and final self-check under a fixed compute and time budget.
MLE Track
mle is the competition-style long-horizon ML engineering track. Starting from the most self-contained --zip path or a prepared-cache --name, AiScientist iterates through implementation-and-experiment cycles to explore stronger solutions and continuously improve the target metric over time.
π Benchmark Results
The benchmark/ directory exists to support rigorous, reproducible, and inspectable
experiments rather than one-off demos. We keep benchmark integrations in-tree so
other researchers can:
- rerun the same systems under matched budgets and controlled settings
- inspect logs, artifacts, and workspaces instead of relying on anecdotal summaries
- compare orchestration designs on standardized long-horizon workloads
- extend the benchmark setup for follow-up research
Two benchmark integrations are currently included:
PaperBench Results
On full PaperBench, AiScientist consistently outperforms the strongest baseline
within each model family under our controlled evaluation setup.
Notable observations:
- On average, AiScientist reaches
30.52on Gemini-3-Flash and33.73on GLM-5, improving over the strongest baseline by+9.92and+11.15, respectively. - AiScientist beats the best baseline on every task in both the Gemini-3-Flash and GLM-5 controlled comparisons.
- The gains are especially large on harder papers such as
pinn,bbox,bridging-data-gaps,sapg, andtest-time-model-adaptation. - The improvement does not come from simply spending more than every baseline: on both model families, AiScientist substantially outperforms
IterAgentwhile using much lower average cost per task.
For the full task-by-task breakdown, see the figure below.
MLE-Bench Lite Results
On MLE-Bench Lite, AiScientist also improves the end-to-end competition-style
workflow under matched model comparisons.
In our controlled evaluation:
- AiScientist reaches
81.82Any Medal on both Gemini-3-Flash and GLM-5. - On Gemini-3-Flash, it improves over the strongest baseline (
77.27Any Medal). - On GLM-5, it improves over the strongest baseline (
63.64Any Medal) while also achieving the bestAbove Median,Silver, andGoldrates in the matched comparison.
The matched-comparison results table is shown below.
Taken together, the PaperBench and MLE-Bench results support the same point: AiScientist is not optimized for a single short interaction, but for durable, artifact-mediated progress over long-horizon research workloads.
πΎ What Lands On Disk
Each run leaves a concrete, inspectable tree under jobs/<job_id>/. The full job directory is the durable run record, but workspace/ is the agent-visible File-as-Bus: it is where plans, code, experiments, and submissions persist as the primary system of record for ongoing coordination.
jobs/<job_id>/
βββ input/
βββ workspace/ # primary File-as-Bus / system of record
β βββ paper/ or data/
β βββ code/ # mle
β βββ submission/
β β βββ submission.csv
β β βββ submission_registry.jsonl
β β βββ candidates/ # mle
β βββ agent/
β βββ paper_analysis/ or analysis/
β βββ prioritized_tasks.md
β βββ plan.md
β βββ impl_log.md
β βββ exp_log.md
β βββ final_self_check.{md,json} # paper
βββ logs/ # operator / trace layer
βββ artifacts/ # validation / champion reports
βββ export/ # packaged outputs
βββ state/ # host-side runtime metadata
The files inside workspace/ are the bus:
- analysis becomes
workspace/agent/paper_analysis/*.mdforpaperandworkspace/agent/analysis/summary.mdformle - planning becomes
workspace/agent/prioritized_tasks.mdand, when needed,workspace/agent/plan.md - implementation and experiments become
workspace/agent/impl_log.mdandworkspace/agent/exp_log.md - MLE candidate search becomes
workspace/submission/submission.csv,workspace/submission/submission_registry.jsonl, andworkspace/submission/candidates/ - paper reproducibility becomes
workspace/agent/final_self_check.md,workspace/agent/final_self_check.json, andworkspace/submission/reproduce.sh
Outside the bus, the host still preserves logs/, artifacts/, and state/ so the run can be inspected, resumed, validated, exported, and audited later.
π Quick Start
|
Environment Note |
Profile Note |
The main README keeps only the shortest runnable happy path. For the full setup, GPU and Docker prerequisites, profile caveats, example scripts, and validation/resume flows, use the Operator Guide.
1. Configure the host
git clone https://github.com/AweAI-Team/AiScientist.git
cd AiScientist
cp .env.example .env
# Fill either OpenAI or Azure OpenAI credentials.
uv sync --dev
Host-side requirements:
- Python 3.12+
- Docker with a reachable daemon
uv- API credentials for at least one configured LLM backend
- Optional NVIDIA GPUs if you want GPU-bound runs, with NVIDIA Container Toolkit configured for Docker
2. Build the default runtime images
If you are not supplying your own runtime images, these are the intended local tags:
bash docker/build_paper_image.sh
bash docker/build_mle_image.sh
aisci-paper:latestaisci-mle:test
3. Run the built-in health checks
AISCI_PAPER_DOCTOR_PROFILE=gpt-5.4 uv run aisci paper doctor
uv run aisci mle doctor
If you use the shipped Azure-backed glm-5 paper profile, you can drop the AISCI_PAPER_DOCTOR_PROFILE override.
4. Launch one paper run
uv run aisci --env-file .env paper run \
--paper-md /abs/path/to/paper.md \
--image aisci-paper:latest \
--llm-profile gpt-5.4 \
--gpu-ids 0 \
--time-limit 24h \
--wait \
--tui
5. Launch one MLE run
uv run aisci --env-file .env mle run \
--zip /abs/path/to/competition.zip \
--name <competition-slug> \
--image aisci-mle:test \
--llm-profile gpt-5.4 \
--gpu-ids 0 \
--time-limit 12h \
--wait \
--tui
π Inspect, Resume, and Validate
Highest-signal inspection commands:
uv run aisci jobs list
uv run aisci jobs show <job_id>
uv run aisci logs tail <job_id> --kind conversation
uv run aisci artifacts ls <job_id>
uv run aisci export <job_id>
For validation, resume, lifecycle helpers, and detailed troubleshooting, see the Operator Guide.
πΊοΈ Repo Map
config/ shared LLM, image, and paper-subagent registries
docker/ default paper and MLE runtime image recipes
scripts/ example launch scripts
src/aisci_app/ CLI, job service, presentation, TUI
src/aisci_core/ job models, paths, store, exporter, runner
src/aisci_runtime_docker/ Docker session manager and image profile resolver
src/aisci_domain_paper/ paper-grounded long-horizon ML research engineering
src/aisci_domain_mle/ competition-style long-horizon ML engineering
tests/ host-side regression tests
AiScientist is opinionated enough to run real work, but still transparent enough that you can inspect every file the lab leaves behind.
β€οΈ Acknowledgments
AiScientist builds on prior work in research automation, evaluation, and ML task environments, especially:
We are grateful to the authors and maintainers of these projects for making this line of work more concrete, reproducible, and comparable.
π License
Released under the MIT License. See LICENSE.
π¬ Contact
For questions, collaboration, or bug reports, please open an issue or email π§ gx.chen.chn@gmail.com.
If AiScientist is useful in your research or engineering workflow, consider starring π the repo and citing the project.
@article{chen2026toward,
title={Toward Autonomous Long-Horizon Engineering for ML Research},
author={Chen, Guoxin and Chen, Jie and Chen, Lei and Zhao, Jiale and Meng, Fanzhe and Zhao, Wayne Xin and Song, Ruihua and Chen, Cheng and Wen, Ji-Rong and Jia, Kai},
journal={arXiv preprint arXiv:2604.13018},
year={2026}
}