autoresearch - 8-Agent Swarm Results

March 15, 2026 ยท View on GitHub

progress

This is a fork of @karpathy's autoresearch, where we ran an 8-agent swarm across 8 H100 GPUs to autonomously optimize a small LLM training setup. The swarm consisted of 4 Claude Code agents and 4 Codex agents (later replaced with 4 more Claude agents), coordinated via ClawTeam.

Results

MetricBaselineFinalChange
val_bpb1.0440.977-6.4%
Total experiments-2430+-
GPU-hours-~2408 GPUs x ~30h
Agent-hours-~2408 agents x ~30h

Key Optimizations Discovered

The agents independently discovered and validated the following improvements through systematic experimentation:

Architecture (biggest wins):

  • DEPTH=12 (from 8) - deeper model with 768-dim embeddings
  • Norm-before-RoPE - apply QK normalization before rotary embeddings
  • q *= 1.25 - scale queries before attention for sharper attention patterns
  • short_window=512 - fixed sliding window size
  • RoPE base=60000 - higher frequency base for positional encoding

Optimization (second biggest wins):

  • TOTAL_BATCH_SIZE=$2^{1}$7 (from 2192^{19}) - 4x smaller batch = 4x more gradient steps in 5 minutes
  • MATRIX_LR=0.028 - Muon learning rate tuned for depth-12
  • UNEMBEDDING_LR=0.009 - higher LR for output projection
  • SCALAR_LR=0.25 - lower scalar LR
  • WEIGHT_DECAY=0.17 with linear decay schedule
  • MUON_BETA2=0.83, momentum warmup 0.90 -> 0.95

Regularization:

  • EMBEDDING_WEIGHT_DECAY=0.008 with linear decay to 0
  • Dynamic unembedding weight decay, also decaying to 0

Schedule:

  • WARMDOWN_RATIO=0.88 - longer learning rate cooldown
  • FINAL_LR_FRAC=0.03 - small residual learning rate

Optimization Timeline

The experiment progressed in four distinct phases:

  1. Architecture discovery (experiments 1-12): Agents quickly found that DEPTH=12 with batch=$2^{1}$7 was far better than the default, dropping val_bpb from 1.044 to 0.991.

  2. Deep tuning (experiments 12-170): Claude agents ran 200+ experiments each, discovering norm-before-RoPE, weight decay decay schedules, embedding regularization, and RoPE base tuning. val_bpb reached 0.982.

  3. Cross-pollination (experiments 170-280): Codex agents received the best configs and tried combining findings from different Claude agents. Discovered that norm-before-rotary + q-scaling + momentum tuning stack well. val_bpb reached 0.978.

  4. Fine-tuning (experiments 280-430+): Fresh Claude agents started from the global best commit and ran systematic hyperparameter sweeps, finding the final improvements in UNEMBEDDING_LR and MATRIX_LR+MUON_BETA2. val_bpb reached 0.977.

Agent Contributions

AgentGPUExperimentsBest val_bpbKey Discoveries
gpu0-claude04080.982depth=12, norm-before-RoPE, q_scale, Muon momentum
gpu1-claude14240.987GELU-squared, MLP 5x, embedding WD differentiation
gpu2-claude24340.981LR/WD systematic tuning, init scale, MLP Muon LR
gpu3-claude34320.981Embedding WD decay, RoPE 60k, short_window=512
gpu4-claude241750.977UNEMBEDDING_LR=0.010, SCALAR_LR=0.22, EMBEDDING_LR=0.65
gpu5-claude251800.977WARMDOWN=0.88, MATRIX_LR=0.028 + MUON_BETA2=0.83
gpu6-claude261730.979Embedding LR, warmdown ratio, short_window tuning
gpu7-claude271740.977scalar_lr=0.35, Muon beta2, wte init tuning
Codex R1-R74-7~600.980Attention patterns, ablations, architecture validation

What Didn't Work

The agents also tried many things that failed, providing useful negative results:

  • MoE (Mixture of Experts) - significant regression
  • GQA / MQA - regression, full MHA is better at this scale
  • ALiBi / NoPE positional encodings - RoPE is clearly better
  • Label smoothing - catastrophic regression
  • Gradient clipping - regression
  • GELU / SiLU activations - ReLU-squared remains best
  • Deeper models (14-24 layers) - too slow, fewer steps in budget
  • Wider models (ASPECT_RATIO > 64) - worse at depth=12

How to Reproduce

# Setup
uv sync
uv run prepare.py

# Run the optimized model (5 minutes)
uv run train.py
# Expected: val_bpb โ‰ˆ 0.977

Multi-Agent Setup

The swarm was coordinated using ClawTeam with git worktrees for isolation:

# Install ClawTeam
pip install -e ~/ClawTeam/

# Create team
clawteam team spawn-team autoresearch -d "8-agent LLM optimization"

# Spawn agents (each gets its own git worktree)
CUDA_VISIBLE_DEVICES=0 clawteam spawn tmux claude \
  --team autoresearch --agent-name gpu0 --workspace --skip-permissions \
  --task "Read program.md and start experimenting"

See program.md for the full experiment protocol.

Original README

See README-old.md for the original autoresearch documentation by @karpathy.

License

MIT