Modded-NanoGPT

June 19, 2026 · View on GitHub

This repository hosts the NanoGPT speedrun, in which we (collaboratively|competitively) search for the fastest algorithm to use 8 NVIDIA H100 GPUs to train a language model that attains 3.28 cross-entropy loss on the FineWeb validation set.

(Note: Besides the main track, there is also an optimization track where we try to minimize steps subject to fixed arch/data/bsz and with unlimited wallclock budget.)

The target (3.28 validation loss on FineWeb) follows Andrej Karpathy's GPT-2 replication in llm.c, which attains that loss after running for 45 minutes. The speedrun code also descends from llm.c's PyTorch trainer, which itself descends from NanoGPT, hence the name of the repo. Thanks to the efforts of many contributors, this repo now contains a training algorithm which attains the target performance in:

  • Under 90 seconds on 8xH100 (the llm.c GPT-2 replication needed 45 minutes)
  • under 400M tokens (the llm.c GPT-2 replication needed 10B)

This improvement in training speed has been brought about by the following techniques:

  • Modernized architecture: Rotary embeddings, QK-Norm, and ReLU²
  • The Muon optimizer [writeup] [repo]
  • Use FP8 for head, and asymmetric rescale and softcap logits
  • Use FP8 on MLP up projection forward pass
  • Initialization of projections to zero (muP-like)
  • Skip connections from embedding to every block as well as from block 3 to 6
  • Extra embeddings which are mixed into the values in attention layers (inspired by Zhou et al. 2024)
  • Flash Attention 3 with long-short sliding window attention pattern (inspired by Gemma 2) and window size warmup with YaRN
  • Align training batch starts with EoS and set a max document length
  • Accumulate gradients for 2 steps for embedding and lm_head before updating parameters
  • Single activation input for last 3 attention layers
  • Polar Express implementation in Muon
  • Smear module to enable 1 token look back
  • Sparse attention gate
  • NorMuon
  • Cautious Weight Decay w/ schedule tied to LR
  • Exponential decay of residual stream
  • Batch size schedule
  • Max seq length schedule
  • Partial Key Offset
  • Multi token prediction
  • Untie embed and lm_head at 2/3 of training
  • Additional gating on value embeddings and skip connection
  • Paired head attention
  • Bigram hash embedding on 1/4 of model_dim w/ sign trick
  • MUDD skip connections to residual stream and attention values
  • Learnable XSA

As well as many systems optimizations.

Contributors list (growing with each new record): @bozavlado; @brendanh0gan; @fernbear.bsky.social; @Grad62304977; @jxbz; @kellerjordan0; @KoszarskyB; @leloykun; @YouJiacheng; @jadenj3o; @KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; @ryanyang0, @vagrawal, @classiclarryd, @byronxu99, @varunneal, @EmelyanenkoK, @bernard24/https://www.hiverge.ai/, @Gusarich, @li_zichong, @akash5474, @snimu, @roeeshenberg, @ChrisJMcCormick, @dominikkallusky, @acutkosky, @manikbhandari, @andrewbriand, @jrauvola, @soren_dunn_, @photon_mz, @srashedll, @dhrvji, @EmmettBicker, @dualverse-ai, @sisovicm, @moof2x, @samacqua, @Lisennlp, @_djdumpling, @TrianX


Running the current record

To run the current record, run the following commands.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
pip install -r requirements.txt
# downloads only the first 900M training tokens to save time
python data/cached_fineweb10B.py 9
./run.sh

Add torchrun to path if ./run.sh gives error torchrun: command not found.

Note: torch.compile will add around 7 minutes of latency the first time you run the code.

Official records are timed on 8 NVIDIA H100 GPUs from https://app.primeintellect.ai/. PrimeIntellect has generously sponsored recent validation runs.

For cases where CUDA or NCCL versions aren't compatible with your current system setup, Docker can be a helpful alternative. This approach standardizes versions for CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. Note: an NVIDIA driver must already be installed on the system (useful if only the NVIDIA driver and Docker are available).

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
sudo docker build -t modded-nanogpt .
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt python data/cached_fineweb10B.py 8
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh run.sh

To get an interactive docker, you can use

sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt bash

World record history

The following is the historical progression of world speed records for the following competitive task:

Train a neural network to ≤3.28 validation loss on FineWeb using 8x NVIDIA H100s.

Note: The 3.28 target was selected to match Andrej Karpathy's GPT-2 (small) reproduction.

#Record timeDescriptionDateLogContributors
145 minutesllm.c baseline05/28/24log@karpathy, llm.c contributors
231.4 minutesTuned learning rate & rotary embeddings06/06/24log@kellerjordan0
324.9 minutesIntroduced the Muon optimizer10/04/24none@kellerjordan0, @jxbz
422.3 minutesMuon improvements10/11/24log@kellerjordan0, @bozavlado
515.2 minutesPad embeddings, ReLU², zero-init projections, QK-norm10/14/24log@Grad62304977, @kellerjordan0
613.1 minutesDistributed the overhead of Muon10/18/24log@kellerjordan0
712.0 minutesUpgraded PyTorch 2.5.010/18/24log@kellerjordan0
810.8 minutesUntied embedding and head11/03/24log@Grad62304977, @kellerjordan0
98.2 minutesValue and embedding skip connections, momentum warmup, logit softcap11/06/24log@Grad62304977, @kellerjordan0
107.8 minutesBfloat16 activations11/08/24log@kellerjordan0
117.2 minutesU-net pattern skip connections & double lr11/10/24log@brendanh0gan
125.03 minutes1024-ctx dense causal attention → 64K-ctx FlexAttention11/19/24log@KoszarskyB
134.66 minutesAttention window warmup11/24/24log@fernbear.bsky.social
144.41 minutesValue Embeddings12/04/24log@KoszarskyB
153.95 minutesU-net pattern value embeddings, assorted code optimizations12/08/24log@leloykun, @YouJiacheng
163.80 minutesSplit value embeddings, block sliding window, separate block mask12/10/24log@YouJiacheng
173.57 minutesSparsify value embeddings, improve rotary embeddings, drop an attn layer12/17/24log@YouJiacheng
183.4 minutesLower logit softcap from 30 to 1501/04/25log@KoszarskyB
193.142 minutesFP8 head, offset logits, lr decay to 0.1 instead of 0.001/13/25log@YouJiacheng
202.992 minutesMerged QKV weights, long-short attention, attention scale, lower Adam epsilon, batched Muon01/16/25log@leloykun, @fernbear.bsky.social, @YouJiacheng, @brendanh0gan, @scottjmaddox, @Grad62304977
212.933 minutesReduced batch size01/26/25log@leloykun
212.997 minutes21st record with new timing02/01/25lognot a new record, just re-timing #21 with the updated rules
213.014 minutes21st record with latest torch05/24/25lognot a new record, just re-timing #21 with latest torch
222.990 minutesFaster gradient all-reduce05/24/25log@KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; The Enigma project
232.979 minutesOverlap computation and gradient communication05/25/25log@ryanyang0
242.966 minutesReplace gradient all_reduce with reduce_scatter05/30/25log@vagrawal
252.896 minutesUpgrade PyTorch to 2.9.0.dev20250713+cu12607/13/25log@kellerjordan0
262.863 minutesAlign training batch starts with EoS, increase cooldown frac to .4507/13/25log@classiclarryd
272.817 minutesTranspose one of the MLP matrices + add Triton kernel for symmetric matmul07/18/25log,PR@byronxu99
282.812 minutesSparse attention gate08/23/25log,PR@classiclarryd
292.731 minutesFlash Attention 3, 2048 max_doc_len, update ws schedule09/03/25log,PR@varunneal
302.717 minutesDrop first MLP layer09/05/25log,PR@EmelyanenkoK
312.656 minutesDynamically incorporate YaRN during training and validation09/10/25log,PR@classiclarryd
322.625 minutesOptimize distributed training, improve skip connection gating, and enhance bfloat16 usage09/11/25log,PR@bernard24 & AI system hiverge.ai
332.565 minutesAsynchronously fetch and index data batches, extend final layer attention window for validation09/15/25log,PR@classiclarryd
342.547 minutesSmear token embeddings 1 position forward09/18/25log,PR@classiclarryd
352.527 minutesDrop first attn layer, extend all long windows for validation, update schedule09/21/25log,PR@classiclarryd
362.495 minutesMuonCustomSizing, perform mlp and attn reduce scatter in shared call09/23/25log,PR@classiclarryd
372.483 minutesCompute cross entropy in BF16 during training09/27/25log,PR@Gusarich
382.476 minutesPolar Express, replacement for Newton-Schulz09/29/25log,PR@varunneal
392.447 minutesOnly update Adam params every other step, reduce batch size09/30/25log,PR@classiclarryd
402.358 minutesBackout, misc hyperparameter tuning, optimize lambda padding10/04/25log,PR@classiclarryd
412.345 minutesNorMuon10/24/25log,PR@li_zichong
422.313 minutesUpdate NorMuon LR, Step Logic10/27/25log,PR@varunneal
432.284 minutesCautious Weight Decay w/ schedule11/10/25log,PR@varunneal
442.269 minutesBackward hooks on Adam, Profiling 10111/16/25log,PR@akash5474
452.248 minutesRefine skip arch, update exponential decay init11/18/25log,PR@classiclarryd
462.203 minutesBatch size schedule11/29/25log,PR@varunneal
472.193 minutesMultiply attn lambda with weight instead of data, fix warmup12/10/25log,PR@roeeshenberg
482.170 minutesSpeed up Muon, additional pre-multiply lambda, reshape matrices, update lr, update NorMuon axis12/11/25log,PR@ChrisJMcCormick
492.146 minutesPartial Key Offset12/14/25log,PR@classiclarryd
502.128 minutesExtend Cautious Weight Decay to Adam parameters12/18/25log,PR@roeeshenberg
512.075 minutesRetie Embed to lm_head, retune fp8 scales12/19/25log,PR@varunneal
522.037 minutesSmooth scalars via beta increase, decrease smear gate lr, freeze scalars during transitions, adam all reduce12/21/25log,PR@ChrisJMcCormick
531.988 minutesMulti-token prediction, untie embed/lm_head at 2/3 training, lr update, tweak CWD12/22/25log,PR@varunneal, feat. @classiclarryd
541.940 minutesAsymmetric Logit Rescale12/26/25log,PR@classiclarryd
551.918 minutesGates on value embeds and skip connection12/29/25log,PR@classiclarryd
561.894 minutesOptimize and compile Adam, increase Adam buffer precision, move gates from Muon to Adam parameter banks12/31/25log,PR@ChrisJMcCormick
571.878 minutesBfloat16 attn/mlp weights, mixed precision Muon, interweave Adam/Muon, finer-grain Adam beta01/04/26log,PR@classiclarryd, feat. @YouJiacheng, @ChrisJMcCormick
581.820 minutesPaired Head Attention01/07/26log,PR@classiclarryd
591.781 minutesFused triton kernel for linear relu square MLP step01/10/26log,PR@andrewbriand8, @Joshrav21
601.765 minutesFused triton kernel for softcapped multi-token prediction cross entropy step01/16/26log,PR@soren_dunn_ & AI System Locus
611.748 minutesUnified Optimizers and Transposed LM Head01/18/26log,PR@ChrisJMcCormick
621.655 minutesBigram Hash Embedding01/19/26log,PR@classiclarryd
631.650 minutesUntie Value Embeds01/26/26log,PR@photon_mz
641.630 minutesTuned nonzero Attn V and O init01/30/26log,PR@srashedll
651.613 minutesGroup Value Embeds into single parameter01/30/26log,PR@varunneal
661.595 minutesTorch 2.1001/31/26--
671.540 minutesTune fused softcap kernels and fuse fp8 quantization in LM head01/31/26log,PR@andrewbriand8
681.535 minutesMove bigram hash to GPU01/31/26log,PR@dhrvji
691.528 minutesKernel Optimizations02/02/26log,PR@EmmettBicker & AI System Aster
701.521 minutesTune value embed layout and ve_gates02/03/26log,PR@photon_mz
711.516 minutesSparse bigram gradient comms and optimized loading on CPU02/06/26log,PR@roeeshenberg
721.496 minutesIncrease minimum lr and add max_seq_len schedule02/10/26log,PR@dualverse-ai & AI System Station
731.485 minutesPartitioned Hyperconnections02/12/26log,PR@sisovicm
741.468 minutesFlattened GPT forward, removed post attention lambdas, added transpose kernels02/16/26log,PR@ChrisJMcCormick
751.453 minutesCross Entropy Kernel Optimizations02/23/26log,PR@moof2x
761.446 minutesReuse and tune backward transpose kernel02/28/26log,PR@samacqua
771.435 minutesReplace partitioned hyperconnections with single saved activation03/06/26log,PR@classiclarryd
781.426 minutesTighten bounds on fa3 max_num_docs to match fineweb distribution03/22/26log,PR@ChrisJMcCormick
791.411 minutesFuse Cross Entropy Fwd/Bwk Kernel, to avoid recalc on softcap sigmoid04/04/26log,PR@andrewbriand8
801.406 minutesIn Muon orthogonize Q and K matrices in pairs of heads, instead of across the full 6 head matrix04/08/26log,PR@samacqua
811.363 minutesMUDD Skip Connections04/22/26log,PR@Lisennlp
821.353 minutesLearnable XSA04/29/26log,PR@_djdumpling
831.328 minutesSign Trick on Bigram Embed05/20/26log,PR@TrianX
841.320 minutesFP8 on MLP up-projection forward pass05/21/26log,PR@sisovicm

Rules

New records must:

  1. Not modify the train or validation data pipelines. (You can change the batch size, sequence length, attention structure etc.; just don't change the underlying streams of tokens.)
  2. Attain ≤3.28 mean val loss. (Due to inter-run variance, submissions must provide enough run logs to attain a statistical significance level of p<0.01 that their mean val loss is ≤3.28. Example code to compute p-value can be found here. For submissions which improve speed by optimizing the systems performance, without touching the ML, this requirement is waived.)
  3. Not use any extra torch._inductor.config or torch.compile flags. (These can save a few seconds, but they can also make compilation take >30min. This rule was introduced after the 21st record.)
  4. Run faster than the prior record when baselined on the same hardware.

Discretionary reasons why a PR may not be accepted:

  1. Disproportionately degrades the readability of the codebase. A 200 line kernel to drop 300ms is considered worthwhile. 500 lines that convolute the optimizer layout for a 50ms gain will likely be rejected.
  2. The current record is intentionally kept roughly 0.001-0.002 loss below 3.28 to make validation simpler. If a PR substantially consumes this buffer, it should do so in a way that outperforms a simple step count decrease, when measured at equivalent loss.

Note: torch._inductor.config.coordinate_descent_tuning is allowed for GPT-2 Medium track (a.k.a. 2.92 track).

Other than that, anything and everything is fair game!

further clarifications


Comment on the target metric

The target metric is cross-entropy loss on the FineWeb val set. To speak mathematically, the goal of the speedrun is *to obtain a probability model of language which assigns a probability of at least math.exp(-3.28 * 10485760) to the first 10,485,760 tokens of the FineWeb valset. Hence, e.g., we allow evaluation at any sequence length, so long as we still have a valid probability model of language.


Timing change after record 21

After the 21st record, we made two changes to the timing. First, there used to be an initial "grace period" of 10 untimed steps to allow kernel warmup. We replaced this with an explicit kernel-warmup section which is untimed and uses dummy data. This results in an extra runtime of 850ms from the 10 extra timed steps. Second, we banned the use of torch._inductor.config.coordinate_descent_tuning. This saves ~25min of untimed pre-run compilation, but results in an extra runtime of ~3s.


Notable attempts & forks

Notable runs:

  • @alexjc's 01/20/2025 2.77-minute TokenMonster-based record. This record is technically outside the rules of the speedrun, since we specified that the train/val tokens must be kept fixed. However, it's very interesting, and worth including. The run is not more data-efficient; rather, the speedup comes from the improved tokenizer allowing the vocabulary size to be reduced (nearly halved!) while preserving the same bytes-per-token, which saves lots of parameters and FLOPs in the head and embeddings.
  • @samacqua's 1/23/2026 test time training run. Sam found that prediction accuracy on the later portions of a given document could be improved by performing a training update on Adam parameters based on the early portion of the document. This 'parameter nudging' is repeated independently for each document. Interestingly, these gradient updates prove effective while only using ~500 tokens, substantially less than the over 200k tokens typically used on a normal training step. While technically a valid probability model, we are not allowing untimed backward passes.

Notable forks:


Speedrun track 2: GPT-2 Medium

The target loss for this track is lowered from 3.28 to 2.92, as per Andrej Karpathy's 350M-parameter llm.c baseline. This baseline generates a model with performance similar to the original GPT-2 Medium, whereas the first track's baseline generates a model on par with GPT-2 Small. All other rules remain the same.

Note: torch._inductor.config.coordinate_descent_tuning is turned on after the record 6 (*).

#Record timeDescriptionDateLogContributors
15.8 hoursllm.c baseline (350M parameters)05/28/24log@karpathy, llm.c contributors
229.3 minutesInitial record based on scaling up the GPT-2 small track speedrun01/18/25log@kellerjordan0
328.1 minutesAdded standard weight decay02/08/25log@kellerjordan0
427.7 minutesTuned Muon Newton-Schulz coefficients02/14/25log@leloykun
527.2 minutesIncreased learning rate cooldown phase duration03/06/25log@YouJiacheng
625.95 minutes*2x MLP wd, qkv norm, all_reduce/opt.step() overlap, optimized skip pattern03/25/25log@YouJiacheng
725.29 minutesRemove FP8 head; ISRU logits softcap; New sharded mixed precision Muon; merge weights04/16/25log@YouJiacheng
824.50 minutesCubic sliding window size schedule, 2× max window size (24.84 minutes) 24.5min repro04/22/25log@jadenj3o
924.12 minutesAdd two value embeddings08/28/25log, PR@snimu
1024.07 minutesSecond input embedding09/11/25log, PR@snimu
1123.45 minutesUpgrade from torch 2.7 to torch==2.10.0.dev20251210+cu126---
1223.28 minutesSnoo Optimizer (Outer optimizer around Adam and Muon)09/16/25log,PR@dominikkallusky
1323.14 minutesEMA Wrapper on Muon09/17/25log,PR@acutkosky
1423.08 minutesCombine both records 12 & 1309/30/25log,PR@acutkosky
1523.03 minutesBackout (Skip from 2/3 point to pre-lm_head)10/04/25log,PR@snimu
1622.99 minutesSmear-MTP11/02/25log,PR@snimu
1722.98 minutesRemove Redundant Mask Op11/12/25log,PR@manikbhandari
1817.35 minutesBulk transfer short track features12/31/25log,PR-

Q: What is the point of NanoGPT speedrunning?

A: The officially stated goal of NanoGPT speedrunning is as follows: gotta go fast. But for something a little more verbose involving an argument for good benchmarking, here's some kind of manifesto, adorned with a blessing from the master. https://x.com/karpathy/status/1846790537262571739

Q: What makes "NanoGPT speedrunning" not just another idiosyncratic benchmark?

A: Because it is a competitive benchmark. In particular, if you attain a new speed record (using whatever method you want), there is an open invitation for you to post that record (on arXiv or X) and thereby vacuum up all the clout for yourself. I will even help you do it by reposting you as much as I can.

"Artificial intelligence advances by inventing games and gloating to goad others to play" - Professor Ben Recht

Q: NanoGPT speedrunning is cool and all, but meh it probably won't scale and is just overfitting to val loss

A: This is hard to refute, since "at scale" is an infinite category (what if the methods stop working only for >100T models?), making it impossible to fully prove. Also, I would agree that some of the methods used in the speedrun are unlikely to scale, particularly those which impose additional structure on the network, such as logit softcapping. But if the reader cares about 1.5B models, they might be convinced by this result:

Straightforwardly scaling up the speedrun (10/18/24 version) to 1.5B parameters yields a model with GPT-2 (1.5B)-level HellaSwag performance 2.5x more cheaply than @karpathy's baseline ($233 instead of $576):

[reproducible log]


Muon optimizer

Muon is defined as follows:

Where NewtonSchulz5 is the following Newton-Schulz iteration [2, 3], which approximately replaces G with U @ V.T where U, S, V = G.svd().

@torch.compile
def zeroth_power_via_newtonschulz5(G, steps=5, eps=1e-7):
    assert len(G.shape) == 2
    a, b, c = (3.4445, -4.7750,  2.0315)
    X = G.bfloat16() / (G.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T 
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T 
    return X.to(G.dtype)

For this training scenario, Muon has the following favorable properties:

  • Lower memory usage than Adam
  • ~1.5x better sample-efficiency
  • <2% wallclock overhead

Provenance

Many of the choices made to generate this optimizer were obtained experimentally by our pursuit of CIFAR-10 speedrunning. In particular, we experimentally obtained the following practices:

  • Using Nesterov momentum inside the update, with orthogonalization applied after momentum.
  • Using a specifically quintic Newton-Schulz iteration as the method of orthogonalization.
  • Using non-convergent coefficients for the quintic polynomial in order to maximize slope at zero, and thereby minimize the number of necessary Newton-Schulz iterations. It turns out that the variance doesn't actually matter that much, so we end up with a quintic that rapidly converges to the range 0.68, 1.13 upon repeated application, rather than converging more slowly to 1.
  • Running the Newton-Schulz iteration in bfloat16 (whereas Shampoo implementations often depend on inverse-pth-roots run in fp32 or fp64).

Our use of a Newton-Schulz iteration for orthogonalization traces to Bernstein & Newhouse (2024), who suggested it as a way to compute Shampoo [5, 6] preconditioners, and theoretically explored Shampoo without preconditioner accumulation. In particular, Jeremy Bernstein @jxbz sent us the draft, which caused us to experiment with various Newton-Schulz iterations as the orthogonalization method for this optimizer. If we had used SVD instead of a Newton-Schulz iteration, this optimizer would have been too slow to be useful. Bernstein & Newhouse also pointed out that Shampoo without preconditioner accumulation is equivalent to steepest descent in the spectral norm, and therefore Shampoo can be thought of as a way to smooth out spectral steepest descent. The proposed optimizer can be thought of as a second way of smoothing spectral steepest descent, with a different set of memory and runtime tradeoffs compared to Shampoo.


Running on fewer GPUs

  • To run experiments on fewer GPUs, simply modify run.sh to have a different --nproc_per_node. This should not change the behavior of the training.
  • If you're running out of memory, you may need to reduce the sequence length for FlexAttention (which does change the training. see here for a guide)

References

  1. Guilherme Penedo et al. "The fineweb datasets: Decanting the web for the finest text data at scale." arXiv preprint arXiv:2406.17557 (2024).
  2. Nicholas J. Higham. Functions of Matrices. Society for Industrial and Applied Mathematics (2008). Equation 5.22.
  3. Günther Schulz. Iterative Berechnung der reziproken Matrix. Z. Angew. Math. Mech., 13:57–59 (1933).
  4. Jeremy Bernstein and Laker Newhouse. "Old Optimizer, New Norm: An Anthology." arxiv preprint arXiv:2409.20325 (2024).
  5. Vineet Gupta, Tomer Koren, and Yoram Singer. "Shampoo: Preconditioned stochastic tensor optimization." International Conference on Machine Learning. PMLR, 2018.
  6. Rohan Anil et al. "Scalable second order optimization for deep learning." arXiv preprint arXiv:2002.09018 (2020).
  7. Alexander Hägele et al. "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations." arXiv preprint arXiv:2405.18392 (2024).
  8. Zhanchao Zhou et al. "Value Residual Learning For Alleviating Attention Concentration In Transformers." arXiv preprint arXiv:2410.17897 (2024).
  9. Team, Gemma, et al. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024).
  10. Alec Radford et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019).

Citation

@misc{modded_nanogpt_2024,
  author       = {Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and
                  @fernbear.bsky.social and Boza Vlado and You Jiacheng and
                  Franz Cesista and Braden Koszarsky and @Grad62304977},
  title        = {modded-nanogpt: Speedrunning the NanoGPT baseline},
  year         = {2024},
  url          = {https://github.com/KellerJordan/modded-nanogpt}
}
itsover_wereback