autoresearch-ANE

March 8, 2026 · View on GitHub

Apple Silicon LLM training — three accelerators, one chip.

Autonomous AI research on M4 Max using all three compute paths Apple Silicon offers: the Apple Neural Engine (ANE) via native Obj-C, the GPU via MLX, and the GPU via PyTorch/MPS. Forked from Karpathy's autoresearch.

Same protocol: an AI agent modifies training code, runs 5-minute experiments, evaluates val_bpb, keeps or discards, and loops overnight. But instead of one H100, we're running on a laptop chip — and discovering what works (and what doesn't) on Apple Silicon.

Results so far

ANE (native Obj-C, Apple Neural Engine):

  • 67.6M param GPT, 6 layers, SEQ=512, ~99ms/step
  • Best loss: 5.81 (LR=2e-4, 10K steps)
  • ANE is invisible to Activity Monitor — runs alongside GPU with zero interference
  • Key challenge: activation instability on long runs (cosine schedule must match run length)

MPS (PyTorch, Metal GPU):

  • 11.5M param GPT, val_bpb=1.308 after 79 autonomous experiments
  • bf16 confirmed 2.6x slower on Apple Silicon — fp32 is faster
  • H100 findings (embedding WD, init scaling) do not transfer to MPS

MLX (Apple's native ML framework) — mlx/:

  • ~50M param GPT, val_bpb=1.665 baseline (agent optimizing now)
  • Native bf16, unified memory, no translation layer
  • Replaced MPS — ported from trevin-creator/autoresearch-mlx

Quick start

ANE (native, macOS Apple Silicon only)

cd native && make all
make test-ane              # verify ANE hardware access
make bench-sram            # probe SRAM performance cliffs
./build/train_overnight_nl6_s512 --steps 10000 --scratch --lr 2e-4 \
  --data data/train.bin --val data/val.bin
cd mlx && uv sync
uv run prepare.py --num-shards 8
uv run train.py

MPS (retired, kept for reference)

cp pyproject_mac.toml pyproject.toml && uv sync
uv run prepare.py --num-shards 8
uv run train_mac.py

Autonomous agent mode

claude --dangerously-skip-permissions -p "Read program.md and start autoresearch."

Architecture

native/             — ANE hardware-level training (Obj-C, private APIs)
  runtime/          — ANE interface (_ANEInMemoryModel, IOSurface)
  mil/              — MIL code generation, dynamic weight pipeline
  training/         — training loop, CPU fallback ops (RMSNorm, Adam)
  bridge/           — C API for Python ctypes
  probes/           — hardware exploration (SRAM limits, weight patching)

mlx/                — MLX GPU training (Apple's native ML framework)
  train.py          — model + optimizer + loop (agent modifies this)
  prepare.py        — data prep, tokenizer, evaluation (read-only)
  program.md        — agent instructions

train.py            — NVIDIA GPU training (upstream, CUDA)
train_mac.py        — Apple Silicon training (MPS backend, retired)
prepare.py          — data prep, tokenizer, evaluation (read-only)
program.md          — agent instructions
viz/                — result visualizations

Key concept: dynamic weight pipeline (ANE)

Weights are packed into the IOSurface input alongside activations. Kernels compile once at startup; weight updates are just memcpy — no recompilation needed. This is the core innovation over maderix/ANE which rebaked weights into compiled kernels.

Key findings

  • ANE: 6x bigger model, 8x faster than MPS on the same chip
  • Both accelerators run simultaneously with zero interference
  • ANE timing breakdown: 33% ANE compute, 30% IO, 37% CPU (classifier is 22% bottleneck)
  • Depth U-curve at SEQ=512: NL=4(6.74) → NL=6(6.34) → NL=8(6.94) → NL=12(7.14)
  • SRAM wall at SEQ=1024 — ANE runs out of on-chip memory
  • Cosine schedule length must match actual run length or activations explode

Credits

License

MIT