NanoGPT Slowrun

May 28, 2026 · View on GitHub

Experiments

NanoGPT Slowrun is a new benchmark for language modeling algorithms in the infinite compute, fixed data regime: 100M tokens from FineWeb, no compute/time limit, lowest validation loss wins.¹ We call it a Slowrun since the goal is to spend as much time with the data as we need to maximize learning on it. We deliberately choose this setting in contrast to speedruns like modded-nanogpt, which assume infinite data and optimize for wall-clock time on fixed hardware. Loved by @karpathy himself!

When speed is not the binding constraint, the space of promising algorithms changes dramatically--for example, large models trained with heavy regularization, expensive optimizers, and evolutionary search are all fair game. We want leaps like GPT-3, where previously unimaginable compute led to better generalization. That doesn't happen if wall-clock time is your constraint.

The baseline trains in ~47 minutes on 8xH100 (~$12) and achieves 3.402 val loss. There are four tracks:

a limited compute track capped at a single 8xH100 node for 1 hour (this is 100x the compute used by the Nanochat 1-epoch baseline),
a tiny compute track capped at a single 8xH100 node for 15 minutes,
a two hour track capped at a single 8xH100 node for two hours,
and an unlimited compute track with minimal restrictions on hardware or time.

For now the limited track lives in the root directory, the tiny track lives at tiny/, the two hour track lives at two_hour/, and the unlimited track lives at unlimited/. The approaches pursued vary significantly based on the compute budget. Submit an entry by opening a PR.

Running the current record

You can reproduce the limited-compute record by running the following commands:

# Set HF_TOKEN and WANDB_API_KEY in your environment first
git clone https://github.com/qlabs-eng/slowrun.git && cd slowrun
pip install -r requirements.txt
python prepare_data.py
torchrun --standalone --nproc_per_node=8 train.py

World Record History

We accept PRs that achieve a new World Record validation loss within the track's time limit, and add an entry below for each improvement.

Limited Compute Track (1 hour)

The limited-compute track caps runs at a single 8xH100 node for at most 1 hour.

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.402	Baseline: 2.7B transformer, Muon, dropout 0.1, weight decay 1.6	02/26/26	~47 mins	Script	@akshayvegesna
2	3.376	Add shuffling every epoch	02/27/26	~47 mins	Script	@kvegesna
3	3.349	Change value embed tables to projections from x0	03/01/26	~47 mins	Script	@ms337
4	3.335	Use swiglu activation	03/01/26	52.1 mins	Script	@akshayvegesna
5	3.314	Add U-Net architecture	03/03/26	52.3 mins	Script	@em-see-squared
6	3.295	Add gating per attention head	03/03/26	53.3 mins	Script	@akshayvegesna
7	3.285	Repeat layers 15-20 for last 3 epochs, reduce warmdown	03/11/26	53.3 mins (training time only)	Script	@shmublu
8	3.278	Run layers 15-20 3 times before layers 21-29 for the last 3 epochs	03/11/26	55.7 mins	Script	@akshayvegesna
9	3.276	Add exclusive self attention (XSA)	03/12/26	57.7 mins	Script	@not-nonymous
10	3.270	LR tuning, warmdown tuning	03/16/26	55.5 mins	Script	@zhiweixx
11	3.252	EMA of weights, hyperparameter tuning	03/18/26	59.2 mins	Script	@ChinmayK0607, @ms337
12	3.248	Use weighted average of last 3 epoch checkpoints	03/23/26	58.2 mins	Script	@not-nonymous
13	3.236	Add Stochastic Weight Averaging (SWA)	04/01/26	58.9 mins	Script	@shmublu
14	3.230	Switch c_proj init from zero to normal	04/02/26	58.6 mins	Script	@ms337
15	3.227	Add stochastic depth training	04/06/26	58.5 mins	Script	@ChinmayK0607
16	3.222	Add multi-token prediction loss	04/09/26	57.1 mins	Script	@clarkkev
17	3.214	Add Interleaved Head Attention (IHA)	04/13/26	58.9 mins	Script	@ms337
18	3.211	Add MuonEq-R	04/17/26	59.4 mins	Script	@clarkkev
19	3.204	Add document-level shuffling	04/24/26	59.0 mins	Script	@samacqua
20	3.195	Add weight decay schedule, adjust learning rate schedule	04/26/26	59.0 mins	Script	@shmublu

Tiny Track (15 minutes)

The tiny track caps runs at a single 8xH100 node for at most 15 minutes.

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.428	Baseline: 300M transformer, weight decay 0.8, dropout 0.1	03/02/26	13.7 mins	Script	@akshayvegesna
2	3.410	Add swiglu activation	03/02/26	14.4 mins	Script	@ChinmayK0607
3	3.395	Add U-Net architecture	03/03/26	14.5 mins	Script	@em-see-squared, @akshayvegesna
4	3.385	Add gating per attention head	03/04/26	14.6 mins	Script	@ChinmayK0607
5	3.383	Update warmdown ratio	03/06/26	14.6 mins	Script	@not-nonymous
6	3.374	Half truncated RoPE, partial key offset, residual lambdas to 1.1	03/06/26	14.8 mins	Script	@ChinmayK0607
7	3.365	Add weight decay schedule	03/15/26	14.8 mins	Script	@shmublu
8	3.353	Add EMA parameter averaging	03/18/26	14.9 mins	Script	@clarkkev
9	3.345	Add Stochastic Weight Averaging (SWA)	04/01/26	14.6 mins	Script	@shmublu
10	3.332	Add document-level shuffling	04/24/26	14.7 mins	Script	@samacqua

Two hour track

The two hour track caps runs at a single 8xH100 node for at most two hours.

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.203	Baseline, extending the 1 hour multi token prediction result	04/12/26	110.6 mins	Script	@ChinmayK0607
2	3.197	Add Interleaved Head Attention (IHA)	04/14/26	115.0 mins	Script	@ChinmayK0607, @ms337
3	3.188	Add MuonEq-R	04/20/26	116.3 mins	Script	@ChinmayK0607
4	3.150	Add weight decay schedule, adjust learning rate schedule	04/29/26	117.5 mins	Script	@shmublu
5	3.144	Add back context window scheduling	05/04/26	117.9 mins	Script	@ChinmayK0607

Unlimited Compute Track

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.402	Baseline: 2.7B transformer, Muon, dropout 0.1, weight decay 1.6	02/26/26	~47 mins	Script	@akshayvegesna
2	3.264	Baseline: 8 × 2.7B transformer, Muon, dropout 0.1, weight decay 1.6, logit averaging	02/27/26	6h 44m	Script	@akshayvegesna
3	3.218	Use value projections and swiglu activation	03/02/26	6h 54m	Script	@akshayvegesna
4	3.185	Add U-Net and Attention Gating	03/04/26	7h 8m	Script	@akshayvegesna, @em-see-squared
5	3.166	Train each model for 1.5x longer	03/05/26	10h 35m	Script	@akshayvegesna
6	3.126	Train each model in ensemble to distill previous model + usual CE loss	03/07/26	16h 1m	Script	@not-nonymous
7	3.089	Ensemble of 10 models, looping of layers 15-20, tuned epoch counts, loss weight	03/13/26	19h 18m (2 nodes, 8xH100)	Script	@akshayvegesna
8	3.081	Ensemble of 12 models, distill alpha 0.5	03/18/26	42h 35m (1 node, 8xH100)	Script	@not-nonymous
9	3.045	More looping, hyperparam tuning, model size increase	03/19/26	~44h (2 nodes, 8xH100)	Script	@akshayvegesna
10	3.024	Use probability averaging over logit averaging, train 20 models	03/31/26	210 hours (7xH100 node)	Script	@L-z-Chen
11	3.001	Add MTP, IHA, MuonEq-R, adjust initialization, ensemble more models	04/23/26	68 hours (4 nodes, 8xH100)	Script	@akshayvegesna
12	2.987	Add snapshot ensembles + gradient based model selection	05/28/26	80 hours (5 nodes, 8xH100)	Script	@not-nonymous

Why limited data, unlimited compute?

The bitter lesson tells us that we should strongly prefer algorithms that scale with compute alone. We can't improve models at the rate compute scales as long as performance is bottlenecked by data.

This repo builds on Nanochat, which took many ideas from the modded-nanogpt speedrun contest. To be fair, the speedrun contest did provide real data efficiency gains: using less data is one way to train faster. But because it sets speed as the binding constraint, it filters out an entire class of algorithms that yield learning gains.

Initial Baseline Approach (02/26/26)

Following Kim et al. (2025),² we developed the initial baseline in three steps:

Optimizer selection. We tested popular optimizers in the data-limited regime, training for multiple epochs on the 100M tokens. Muon outperforms AdamW, SOAP, and MAGMA.
Scaling up. We increased model size but found diminishing returns due to the limited data. Without appropriate regularization, a 1.4B parameter model outperforms a 2.7B parameter model.
Regularization. When we scale up parameter size also using heavy weight decay, we recover monotonic improvements with scale. We further find that dropout improves performance on top of weight decay. Our final model³ is a 2.7B parameter transformer, with 1.2B parameters in the transformer trunk and heavy embedding defaults from Nanochat. It is trained with dropout 0.1 and weight decay 1.6. This weight decay is very large by traditional standards, but consistent with Kim et al. (2025), who find optimal weight decay is up to 30× larger than standard practice in the data-constrained regime.

Given the strong performance by large models that are well regularized, we speculate that larger models have a strong simplicity bias, amplified by regularization.

Overparametrization Figure taken from Andrew Gordon Wilson, "Deep Learning is Not So Mysterious or Different."

Why 100M tokens?

We choose 100M tokens because it is small enough to affordably try radically different learning algorithms, while large enough that the winning techniques may work at a larger scale, though the scaling behavior is an open empirical question.

For practical purposes, we begin by providing an upper bound on time of 64 H100's for 7 days. For reference, nanogpt can be trained for 1 epoch in 30s, so using this amount of compute would be 100,000x the compute used by that baseline. ↩
Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. "Pre-training under infinite compute." arXiv:2509.14786, 2025. ↩
These numbers from 02/26/26 are no longer accurate as of the latest world records. As of 04/08/26, the world record on the 1 hour track uses a 1.4B parameter model. ↩