autoresearch-mlx

July 2, 2026 · View on GitHub

Apple Silicon (MLX) port of Karpathy's autoresearch.

Full credit to @karpathy for the core idea: fixed-time autonomous research loops controlled through program.md. This port keeps the same basic rules: one mutable train.py, one metric (val_bpb), a fixed 5-minute training budget, and keep-or-revert via git. It runs natively on Apple Silicon through MLX, so there is no PyTorch or CUDA dependency.

Quick start

Requirements: Apple Silicon Mac, Python 3.10+, uv.

# install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# install dependencies
uv sync

# one-time data + tokenizer prep
uv run prepare.py

# run one 5-minute training experiment
uv run train.py

Then point Claude Code or another coding agent at program.md and let it run the loop.

What matters

prepare.py - data prep, tokenizer, dataloader, and evaluation. Treat as fixed.
train.py - model, optimizer, and training loop. This is the file the agent edits.
program.md - the autonomous experiment protocol.
results.tsv - logged experiment history.

The loop is the same as upstream: edit train.py, run a fixed-budget experiment, read val_bpb, keep the change if it wins, revert if it loses, and repeat.

Public baseline results

The public results.tsv captures the initial hardware-local walk from the default baseline down to 1.807902:

Commit	val_bpb	Status	Description
`383abb4`	2.667000	keep	baseline (AdamW, default config)
`909dd59`	2.588904	keep	halve total batch size to `2^16`
`4161af3`	2.533728	keep	increase matrix LR to `0.04`
`5efc7aa`	1.807902	keep	reduce depth from `8` to `4`

That result already shows the core Apple Silicon pattern: with a fixed 5-minute wall clock, smaller faster-training models can beat larger ones simply by fitting more optimizer steps into the budget.

Longer Apple Silicon runs

Longer overnight runs on the working MLX port pushed much further. The long Mac Mini test is included here because it found a meaningfully different winner stack from the Max-class machines.

Machine	Current best	Starting point	Repeated wins
M4 Max #1	1.294526	1.596971	AdamW-only, low matrix LR, 3x MLP, no logit cap, moderate weight decay
M4 Max #2	1.330509	1.807902	leaner batch, long anneal, SiLU, lower regularization, no logit cap
Mac Mini (long run)	1.353329	1.922472	Muon, sharper attention, smaller MLP, lower scalar LR

The Mac Mini result matters because it did not just rediscover the same exact recipe. On smaller Apple Silicon hardware, the strongest changes leaned toward more aggressive step-efficiency wins. Later transfer tests showed some of those Mac Mini findings did not carry cleanly onto the Max baseline, which is exactly the kind of hardware-specific behavior this loop is useful for uncovering.

Rigorous keep/discard (optional)

A single 5-minute run is noisy — re-running the same train.py moves val_bpb by ~0.03. Deciding keep/discard on one run below that threshold just chases noise, and since the loop only keeps a run that dips below the running best, the recorded curve is an optimistic running-minimum that regresses on honest re-eval.

rigor.py gates the decision instead of eyeballing a single delta: it runs a few seeds and keeps a change only if it beats the current best with high confidence (bootstrap), fails a clear loser fast after one run, and never re-scores an identical train.py.

uv run rigor.py run "halve the batch size"   # score train.py vs best (3 seeds)
uv run rigor.py run "..." --seeds 5 --confidence 0.9
uv run rigor.py best                          # current best
uv run rigor.py log                           # every scored config

It never edits train.py, touches git, or changes evaluate_bpb — it only decides, and writes samples to rigor_ledger.jsonl.

Differences from upstream

MLX instead of PyTorch/CUDA. Native Apple Silicon training with unified memory.
AdamW-only public path. This public train.py keeps the default path simple. The long Mac Mini run above explored a Muon variant in the working port, but that branch is not exposed as a public default here.
Smaller eval token budget. Reduced for faster iteration on Apple Silicon while keeping the same evaluate_bpb interface in prepare.py.
Roughly 6-7 minutes per experiment. Expect 5 minutes of training plus compile and eval overhead.
MFU reporting is placeholder. There is no Apple Silicon equivalent to the H100 FLOPs reference used upstream.

Acknowledgments

Andrej Karpathy - autoresearch and nanochat
scasella/nanochat-mlx - MLX GPT and optimizer reference
awni/picochat - MLX training patterns
Apple MLX team

License

MIT. See LICENSE.