NanoGPT Slowrun

May 28, 2026 · View on GitHub

Experiments

NanoGPT Slowrun is a new benchmark for language modeling algorithms in the infinite compute, fixed data regime: 100M tokens from FineWeb, no compute/time limit, lowest validation loss wins.1 We call it a Slowrun since the goal is to spend as much time with the data as we need to maximize learning on it. We deliberately choose this setting in contrast to speedruns like modded-nanogpt, which assume infinite data and optimize for wall-clock time on fixed hardware. Loved by @karpathy himself!

karpathy

When speed is not the binding constraint, the space of promising algorithms changes dramatically--for example, large models trained with heavy regularization, expensive optimizers, and evolutionary search are all fair game. We want leaps like GPT-3, where previously unimaginable compute led to better generalization. That doesn't happen if wall-clock time is your constraint.

The baseline trains in ~47 minutes on 8xH100 (~$12) and achieves 3.402 val loss. There are four tracks:

  1. a limited compute track capped at a single 8xH100 node for 1 hour (this is 100x the compute used by the Nanochat 1-epoch baseline),
  2. a tiny compute track capped at a single 8xH100 node for 15 minutes,
  3. a two hour track capped at a single 8xH100 node for two hours,
  4. and an unlimited compute track with minimal restrictions on hardware or time.

For now the limited track lives in the root directory, the tiny track lives at tiny/, the two hour track lives at two_hour/, and the unlimited track lives at unlimited/. The approaches pursued vary significantly based on the compute budget. Submit an entry by opening a PR.

Running the current record

You can reproduce the limited-compute record by running the following commands:

# Set HF_TOKEN and WANDB_API_KEY in your environment first
git clone https://github.com/qlabs-eng/slowrun.git && cd slowrun
pip install -r requirements.txt
python prepare_data.py
torchrun --standalone --nproc_per_node=8 train.py

World Record History

We accept PRs that achieve a new World Record validation loss within the track's time limit, and add an entry below for each improvement.

Limited Compute Track (1 hour)

The limited-compute track caps runs at a single 8xH100 node for at most 1 hour.

#Val LossDescriptionDateTimeScriptContributors
13.402Baseline: 2.7B transformer, Muon, dropout 0.1, weight decay 1.602/26/26~47 minsScript@akshayvegesna
23.376Add shuffling every epoch02/27/26~47 minsScript@kvegesna
33.349Change value embed tables to projections from x003/01/26~47 minsScript@ms337
43.335Use swiglu activation03/01/2652.1 minsScript@akshayvegesna
53.314Add U-Net architecture03/03/2652.3 minsScript@em-see-squared
63.295Add gating per attention head03/03/2653.3 minsScript@akshayvegesna
73.285Repeat layers 15-20 for last 3 epochs, reduce warmdown03/11/2653.3 mins (training time only)Script@shmublu
83.278Run layers 15-20 3 times before layers 21-29 for the last 3 epochs03/11/2655.7 minsScript@akshayvegesna
93.276Add exclusive self attention (XSA)03/12/2657.7 minsScript@not-nonymous
103.270LR tuning, warmdown tuning03/16/2655.5 minsScript@zhiweixx
113.252EMA of weights, hyperparameter tuning03/18/2659.2 minsScript@ChinmayK0607, @ms337
123.248Use weighted average of last 3 epoch checkpoints03/23/2658.2 minsScript@not-nonymous
133.236Add Stochastic Weight Averaging (SWA)04/01/2658.9 minsScript@shmublu
143.230Switch c_proj init from zero to normal04/02/2658.6 minsScript@ms337
153.227Add stochastic depth training04/06/2658.5 minsScript@ChinmayK0607
163.222Add multi-token prediction loss04/09/2657.1 minsScript@clarkkev
173.214Add Interleaved Head Attention (IHA)04/13/2658.9 minsScript@ms337
183.211Add MuonEq-R04/17/2659.4 minsScript@clarkkev
193.204Add document-level shuffling04/24/2659.0 minsScript@samacqua
203.195Add weight decay schedule, adjust learning rate schedule04/26/2659.0 minsScript@shmublu

Tiny Track (15 minutes)

The tiny track caps runs at a single 8xH100 node for at most 15 minutes.

#Val LossDescriptionDateTimeScriptContributors
13.428Baseline: 300M transformer, weight decay 0.8, dropout 0.103/02/2613.7 minsScript@akshayvegesna
23.410Add swiglu activation03/02/2614.4 minsScript@ChinmayK0607
33.395Add U-Net architecture03/03/2614.5 minsScript@em-see-squared, @akshayvegesna
43.385Add gating per attention head03/04/2614.6 minsScript@ChinmayK0607
53.383Update warmdown ratio03/06/2614.6 minsScript@not-nonymous
63.374Half truncated RoPE, partial key offset, residual lambdas to 1.103/06/2614.8 minsScript@ChinmayK0607
73.365Add weight decay schedule03/15/2614.8 minsScript@shmublu
83.353Add EMA parameter averaging03/18/2614.9 minsScript@clarkkev
93.345Add Stochastic Weight Averaging (SWA)04/01/2614.6 minsScript@shmublu
103.332Add document-level shuffling04/24/2614.7 minsScript@samacqua

Two hour track

The two hour track caps runs at a single 8xH100 node for at most two hours.

#Val LossDescriptionDateTimeScriptContributors
13.203Baseline, extending the 1 hour multi token prediction result04/12/26110.6 minsScript@ChinmayK0607
23.197Add Interleaved Head Attention (IHA)04/14/26115.0 minsScript@ChinmayK0607, @ms337
33.188Add MuonEq-R04/20/26116.3 minsScript@ChinmayK0607
43.150Add weight decay schedule, adjust learning rate schedule04/29/26117.5 minsScript@shmublu
53.144Add back context window scheduling05/04/26117.9 minsScript@ChinmayK0607

Unlimited Compute Track

#Val LossDescriptionDateTimeScriptContributors
13.402Baseline: 2.7B transformer, Muon, dropout 0.1, weight decay 1.602/26/26~47 minsScript@akshayvegesna
23.264Baseline: 8 × 2.7B transformer, Muon, dropout 0.1, weight decay 1.6, logit averaging02/27/266h 44mScript@akshayvegesna
33.218Use value projections and swiglu activation03/02/266h 54mScript@akshayvegesna
43.185Add U-Net and Attention Gating03/04/267h 8mScript@akshayvegesna, @em-see-squared
53.166Train each model for 1.5x longer03/05/2610h 35mScript@akshayvegesna
63.126Train each model in ensemble to distill previous model + usual CE loss03/07/2616h 1mScript@not-nonymous
73.089Ensemble of 10 models, looping of layers 15-20, tuned epoch counts, loss weight03/13/2619h 18m (2 nodes, 8xH100)Script@akshayvegesna
83.081Ensemble of 12 models, distill alpha 0.503/18/2642h 35m (1 node, 8xH100)Script@not-nonymous
93.045More looping, hyperparam tuning, model size increase03/19/26~44h (2 nodes, 8xH100)Script@akshayvegesna
103.024Use probability averaging over logit averaging, train 20 models03/31/26210 hours (7xH100 node)Script@L-z-Chen
113.001Add MTP, IHA, MuonEq-R, adjust initialization, ensemble more models04/23/2668 hours (4 nodes, 8xH100)Script@akshayvegesna
122.987Add snapshot ensembles + gradient based model selection05/28/2680 hours (5 nodes, 8xH100)Script@not-nonymous

Why limited data, unlimited compute?

The bitter lesson tells us that we should strongly prefer algorithms that scale with compute alone. We can't improve models at the rate compute scales as long as performance is bottlenecked by data.

This repo builds on Nanochat, which took many ideas from the modded-nanogpt speedrun contest. To be fair, the speedrun contest did provide real data efficiency gains: using less data is one way to train faster. But because it sets speed as the binding constraint, it filters out an entire class of algorithms that yield learning gains.

Initial Baseline Approach (02/26/26)

Following Kim et al. (2025),2 we developed the initial baseline in three steps:

  1. Optimizer selection. We tested popular optimizers in the data-limited regime, training for multiple epochs on the 100M tokens. Muon outperforms AdamW, SOAP, and MAGMA.

  2. Scaling up. We increased model size but found diminishing returns due to the limited data. Without appropriate regularization, a 1.4B parameter model outperforms a 2.7B parameter model.

  3. Regularization. When we scale up parameter size also using heavy weight decay, we recover monotonic improvements with scale. We further find that dropout improves performance on top of weight decay. Our final model3 is a 2.7B parameter transformer, with 1.2B parameters in the transformer trunk and heavy embedding defaults from Nanochat. It is trained with dropout 0.1 and weight decay 1.6. This weight decay is very large by traditional standards, but consistent with Kim et al. (2025), who find optimal weight decay is up to 30× larger than standard practice in the data-constrained regime.

Given the strong performance by large models that are well regularized, we speculate that larger models have a strong simplicity bias, amplified by regularization.

Overparametrization Figure taken from Andrew Gordon Wilson, "Deep Learning is Not So Mysterious or Different."

Why 100M tokens?

We choose 100M tokens because it is small enough to affordably try radically different learning algorithms, while large enough that the winning techniques may work at a larger scale, though the scaling behavior is an open empirical question.

Footnotes

  1. For practical purposes, we begin by providing an upper bound on time of 64 H100's for 7 days. For reference, nanogpt can be trained for 1 epoch in 30s, so using this amount of compute would be 100,000x the compute used by that baseline.

  2. Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. "Pre-training under infinite compute." arXiv:2509.14786, 2025.

  3. These numbers from 02/26/26 are no longer accurate as of the latest world records. As of 04/08/26, the world record on the 1 hour track uses a 1.4B parameter model.