autoresearch-mlx

March 10, 2026 ยท View on GitHub

Apple Silicon (MLX) port of Karpathy's autoresearch.

Full credit to @karpathy for the core idea: fixed-time autonomous research loops controlled through program.md. This port keeps the same basic rules: one mutable train.py, one metric (val_bpb), a fixed 5-minute training budget, and keep-or-revert via git. It runs natively on Apple Silicon through MLX, so there is no PyTorch or CUDA dependency.

Quick start

Requirements: Apple Silicon Mac, Python 3.10+, uv.

# install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# install dependencies
uv sync

# one-time data + tokenizer prep
uv run prepare.py

# run one 5-minute training experiment
uv run train.py

Then point Claude Code or another coding agent at program.md and let it run the loop.

What matters

  • prepare.py - data prep, tokenizer, dataloader, and evaluation. Treat as fixed.
  • train.py - model, optimizer, and training loop. This is the file the agent edits.
  • program.md - the autonomous experiment protocol.
  • results.tsv - logged experiment history.

The loop is the same as upstream: edit train.py, run a fixed-budget experiment, read val_bpb, keep the change if it wins, revert if it loses, and repeat.

Public baseline results

The public results.tsv captures the initial hardware-local walk from the default baseline down to 1.807902:

Commitval_bpbStatusDescription
383abb42.667000keepbaseline (AdamW, default config)
909dd592.588904keephalve total batch size to 2^16
4161af32.533728keepincrease matrix LR to 0.04
5efc7aa1.807902keepreduce depth from 8 to 4

That result already shows the core Apple Silicon pattern: with a fixed 5-minute wall clock, smaller faster-training models can beat larger ones simply by fitting more optimizer steps into the budget.

Longer Apple Silicon runs

Longer overnight runs on the working MLX port pushed much further. The long Mac Mini test is included here because it found a meaningfully different winner stack from the Max-class machines.

MachineCurrent bestStarting pointRepeated wins
M4 Max #11.2945261.596971AdamW-only, low matrix LR, 3x MLP, no logit cap, moderate weight decay
M4 Max #21.3305091.807902leaner batch, long anneal, SiLU, lower regularization, no logit cap
Mac Mini (long run)1.3533291.922472Muon, sharper attention, smaller MLP, lower scalar LR

The Mac Mini result matters because it did not just rediscover the same exact recipe. On smaller Apple Silicon hardware, the strongest changes leaned toward more aggressive step-efficiency wins. Later transfer tests showed some of those Mac Mini findings did not carry cleanly onto the Max baseline, which is exactly the kind of hardware-specific behavior this loop is useful for uncovering.

Differences from upstream

  • MLX instead of PyTorch/CUDA. Native Apple Silicon training with unified memory.
  • AdamW-only public path. This public train.py keeps the default path simple. The long Mac Mini run above explored a Muon variant in the working port, but that branch is not exposed as a public default here.
  • Smaller eval token budget. Reduced for faster iteration on Apple Silicon while keeping the same evaluate_bpb interface in prepare.py.
  • Roughly 6-7 minutes per experiment. Expect 5 minutes of training plus compile and eval overhead.
  • MFU reporting is placeholder. There is no Apple Silicon equivalent to the H100 FLOPs reference used upstream.

Acknowledgments

License

MIT. See LICENSE.