Awesome AutoResearch [](https://awesome.re)

June 24, 2026 · View on GitHub

A curated list of AutoResearch use cases with optimization traces and open source implementations. Every entry includes a link to the actual optimization trajectory so you can see what the agent tried, not just the final result.

What is AutoResearch?

AutoResearch is, at its core, a prompt. Karpathy released it as a single markdown file - program.md, that instructs a coding agent (Claude Code, Codex, or similar) to follow an optimization workflow. The agent edits one file (train.py, that trains a language model), runs for a fixed 5 minutes on a GPU, checks whether the metric improved, and either commits the change or reverts it. Then it loops forever.

The specific program.md that ships with AutoResearch is written for one task: training a GPT model. But the structure - iteratively optimizing a file against an evaluation metric, with a discard/keep loop - turns out to be portable. In the weeks since release, the community has adapted it to GPU kernel optimization, template engine optimization, tabular ML engineering, and more. The program.md for each of these looks different, but the loop is the same.

Use Cases

Use CaseDescriptionAuthorLinksTraces
LLM training optimizationThe original - optimize nanoGPT training code. 20 improvements found overnight on hand-tuned codeAndrej KarpathyGitHub · Tweetprogress chart
Speed up Shopify's template engine53% faster parse+render, 61% fewer allocations from 93 automated commits on Shopify's Liquid engineTobi Lutke (Shopify CEO)GitHub · TweetPR
GPU kernel optimizationAutoresearch applied to CUDA kernel optimization (18 → 187 TFLOPS)RightNow AIGitHub · Tweetprogress chart
Voice agent prompt engineeringOptimize voice agent prompts with automated evaluation (score 0.728 → 0.969)Archie SenguptaGitHub · Tweetprogress chart
Predict baseball pitch speedBuild predictive model for pitch velocity from biomechanics data (R² 0.44 → 0.78)Kyle Boddy (Driveline Baseball)Tweetprogress chart
XGBoost for tennis match predictionPredict ATP/WTA match outcomes - encountered and documented reward hackingNick OakBlog · GitHubblog
RL post-training optimizationAutoresearch for RL hyperparameters on Qwen 0.5B + GSM8K (eval 0.475 → 0.550 in fewer steps)Vivek KashyapGitHub · Tweetprogress chart
Ancient scroll ink detectionVesuvius Challenge autoresearch agent swarm for ink detection models. 4 agents 24/7, cross-scroll generalization nearly doubledVesuvius ChallengeBlogblog
Earth system model optimizationHybrid: LLM proposes formula structures, TPE optimizes parameters. Fire correlation 0.09→0.65Dev Paragiri (UMD CS)Tweet · Blogblog
Bitcoin price formula discoveryAutonomous search for best time-based formula predicting Bitcoin price. 328 experiments, 50.5% RMSE improvement over power law. Walk-forward OOS evaluation with bootstrap significance testingCarlos BaqueroGitHubprogress chart
Protein folding architecture searchCodex /goal iterating SimplexFold (AlphaFold2-style model) on NanoFold benchmark for 150+ hours, searching topologically inspired architectures. 127 scored runs, best val C-alpha lDDT 0.4311Chris Hayduk (OpenAI)GitHub · Tweetprogress chart
Flappy Bird game AIEvolved a Flappy Bird agent from scratch over 100 iterations — mean score 2.76 → 20.9 (6.6×) for ~$12, no human guidance beyond initial setupWeco AIBlogprogress chart · search tree

Benchmarks & Evaluation

BenchmarkDescriptionMaintainerLinksTraces
ResearchClawBenchEnd-to-end scientific research benchmark for AI agents: agents read raw data and papers, write code, generate figures, and produce publication-style reports scored against expert checklistsInternScienceGitHub · Dataset · Leaderboardrun traces
FML-benchControlled study of AI research agent strategies across 18 ML research tasks (10 domains), separating search strategy from execution infrastructure. Finds a simple greedy hill-climber nearly matches the best tree-search agent; performance tracks early convergence and focused exploration, not diversity or computeNUSPaper · GitHubsearch dynamics

Implementations & Forks

ProjectDescriptionLinks
autoresearchThe original - single GPU, 630 lines of PythonGitHub
pi-autoresearchGeneralized as a Pi extension. Works for any optimization target - test speed, bundle size, build times, Lighthouse scoresGitHub
autoresearch-mlxApple Silicon (MLX) port. No PyTorch required, uses unified memoryGitHub
autoresearch-win-rtxWindows + consumer RTX GPU port (RTX 2060 through 4090)GitHub
autoresearch-at-homeDistributed autoresearch - SETI@home style. Multi-agent swarm coordinationGitHub
autoresearch (Claude Skill)Generalized as a Claude Code skill for any domainGitHub
agent-digivolve-harnessA control layer for long-running CLI agent work. Generalizes the autoresearch keep/revert loop with persistent run state, explicit eval packages, baseline and holdout cases, and one bounded mutation per iterationGitHub
auto-agentAutoresearch, but for AI agents. Given a golden dataset, it autonomously improves a target agent through an iterative hypothesis-driven loop: analyze failures, spawn a coding agent to implement fixes, evaluate, and accept or rollbackGitHub
CORALMulti-agent autoresearch with shared evolution. Spawns parallel Claude Code agents in isolated git worktrees that share a hub of attempts, notes, and reusable skills. Each agent reads a task guide, commits changes, triggers eval-on-commit, and loops — with cross-agent knowledge sharing as the core mechanism for open-ended discoveryGitHub · Paper
evoA Claude Code plugin for autoresearchGitHub

Contributing

Want to add a use case? Open a PR or file an issue.

To make our work easier, please make submissions as verifiable as possible:

  • Minimum: a progress chart showing each experiment's score and breakthrough annotations (e.g. Karpathy's progress chart)
  • Ideal: a public repo with per-solution code and scores (the full exploration trace), or a Weco Observe dashboard link

License

CC0 1.0