autoresearch Tutorial

May 11, 2026 · View on GitHub

The overnight ML research agent that runs ~100 GPU experiments while you sleep.

autoresearch (https://github.com/karpathy/autoresearch) is a minimal, self-directing AI research agent built by Andrej Karpathy. It autonomously edits a PyTorch training script, commits the change, runs a fixed 5-minute training budget, measures validation bits-per-byte, and decides whether to keep or discard the experiment — all without human intervention. One sleeping cycle yields roughly 100 experiments.

PropertyValue
Stars70,978
LanguagePython
LicenseMIT
Primary metricval_bpb (bits-per-byte)
GPU requirementSingle CUDA GPU (recommended: H100/A100)
Time per experiment~5 minutes (fixed wall-clock budget)
Experiments per night~100

What You Will Learn

This tutorial takes you from zero to running your own autonomous ML research loop. By the end you will understand:

  • The three-file design philosophy that makes autoresearch auditable and reproducible
  • How prepare.py downloads the climbmix-400b dataset and trains a BPE tokenizer
  • The modern GPT architecture in train.py — GQA, RoPE, QK-norm, Flash Attention 3, sliding window, Value Residual
  • MuonAdamW: the hybrid optimizer combining Polar Express orthogonalization with AdamW
  • Why a fixed wall-clock time budget (not step count) is the correct unit of comparison
  • How program.md encodes the agent's entire research protocol as a readable text file
  • How to read results.tsv and analysis.ipynb to extract signal from 100 nightly experiments
  • Scaling and customizing the system for smaller GPUs, multiple GPUs, or alternative hardware

Current Snapshot (auto-updated)

Repository Structure

autoresearch/
├── prepare.py        # FIXED — data + tokenizer + eval harness
├── train.py          # MUTABLE — GPT model + MuonAdamW + training loop
├── program.md        # INSTRUCTIONS — agent protocol (the "research org code")
├── analysis.ipynb    # Jupyter notebook for exploring results.tsv
├── results.tsv       # Untracked experiment log (git-ignored)
└── pyproject.toml    # uv project manifest

Prerequisites

RequirementMinimumRecommended
GPUAny CUDA GPU with 16 GB VRAMH100 SXM 80 GB
Python3.103.12
PyTorch2.9.12.9.1 (CUDA 12.8)
Package managerpipuv
Disk space50 GB200 GB
Time to first experiment~30 min~15 min

Tutorial Chapters

#ChapterWhat you learn
1Getting StartedProblem statement, 3-file design, installation with uv
2Data Preparation and Training Environmentprepare.py, climbmix dataset, BPE tokenizer, best-fit dataloader
3GPT ArchitectureGPTConfig, GQA, RoPE, QK-norm, sliding window, Value Residual
4The MuonAdamW OptimizerPolar Express, NorMuon, Muon vs AdamW dispatch, LR schedule
5The Training Loop and Fixed Time BudgetGradient accumulation, GC freeze, MFU tracking, evaluate_bpb
6The Agent Protocolprogram.md, experiment loop, git as ledger, autonomy mandate
7Analyzing Results with analysis.ipynbresults.tsv schema, progress.png, best-hit analysis
8Customization and ScalingSmaller GPUs, multi-GPU, multi-agent, notable forks

Quick-Start (3 commands)

# 1. Clone and install
git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv sync

# 2. Prepare data (downloads climbmix, trains BPE tokenizer)
uv run prepare.py

# 3. Hand control to the agent
# (Open Claude / GPT-4o with program.md as system prompt, then say "go")

The agent takes over from step 3. Go to sleep. Check results.tsv in the morning.

Design Philosophy

autoresearch embodies three principles that distinguish it from heavier MLOps frameworks:

Simplicity over completeness. Three files. No YAML config trees, no orchestration layers, no databases. Every decision is visible in plain Python or plain Markdown.

Git as the experiment ledger. Every attempted change is a commit. Every rejected change is a git reset. The full history of what the agent tried — including failures — lives in the repository with zero extra tooling.

Comparable experiments by construction. A fixed 5-minute wall-clock budget means every experiment is measured under identical conditions. No cherry-picking long runs. No step-count games.


This tutorial was written for autoresearch as of April 2026 (70,978 stars, MIT license). The repository moves fast; always check the upstream source for the latest train.py and program.md.