Modern LLM Notebook

June 3, 2026 · View on GitHub

Build modern LLMs from scratch through 23 runnable Jupyter Notebooks.

English · 中文文档 · Read Online · Start in Colab

GitHub stars Quality checks License Python PyTorch Notebooks Languages

Overview · What You Will Build · Why · What Is Included · Quick Start · Status · Curriculum · Quality Bar · Contributing


Case-based LLM learning: course map -> concrete notebook -> runnable experiment.

Modern LLM Notebook English home page

Start from the full bilingual course map, then choose a focused path through foundations, training, inference, frontiers, and production topics.

Modern LLM Notebook English notebook reader

Each case keeps the learning loop visible: intuition, hand calculation, implementation, experiment, outline navigation, and one-click Colab access.

Overview

Modern LLM Notebook is a hands-on course for building modern LLM systems from the ground up in PyTorch. Instead of treating the model as a black box, you implement the core pieces yourself: tokenizers, embeddings, attention, Transformer blocks, training loops, MoE, LoRA, RLHF, decoding, KV Cache, long context, VLMs, evaluation, distillation, and on-policy distillation.

The repository ships with a full English notebook mirror under notebooks-en/. The web viewer supports language switching from the home page and the notebook sidebar (or via ?lang=en in the URL), so both the curriculum and the browsing experience stay bilingual end to end.

The project is designed as an educational reference implementation. It is not a model zoo, not a production serving framework, and not a wrapper around hosted APIs. Its purpose is to make the internal machinery of LLMs legible to engineers who want to reason from first principles.

Each notebook follows the same learning contract:

intuition -> hand calculation -> implementation -> experiment

That contract matters. A reader should not only know that BPE merges frequent pairs, or that KV Cache speeds up generation. They should be able to trace the numbers, write the minimal code, and explain why the behavior appears.

What You Will Build

By the end, you will have implemented a compact version of the systems that power modern LLMs:

StageYou buildWhy it matters
Text to tokensCharacter, word, and BPE tokenizersSee exactly how raw text becomes model input
Tokens to vectorsToken embeddings and position encodingsUnderstand what the model can compute over
Transformer coreSelf-Attention, Multi-Head Attention, Transformer blocks, Mini-GPTReconstruct the core forward pass
Training systemCross-Entropy, batching, gradient flow, scaling-law intuitionConnect loss curves to real model behavior
AdaptationLoRA, continued pretraining, reward modeling, PPO/DPO style objectivesLearn how base models become useful assistants
Inference systemSampling, beam search, KV Cache, speculative decodingUnderstand why serving is a systems problem
FrontiersLong context, CoT experiments, VLM patch embeddings and cross-attentionTurn newer papers into small runnable examples
Production loopEvaluation, win-rate matrices, distillation, OPDMeasure, compress, and improve model behavior
raw text -> tokens -> embeddings -> attention -> Transformer -> Mini-GPT
         -> training -> alignment -> inference -> evaluation -> distillation

Why This Project

LLM education often falls into two extremes.

Some resources are mathematically precise but difficult to enter: they introduce formulas before the reader understands the problem being solved. Other resources are easy to run but heavily abstracted: the important ideas disappear behind a library call.

Modern LLM Notebook takes the middle path. It treats modern LLMs as systems that can be decomposed, tested, and rebuilt piece by piece. The goal is not to replace papers or production libraries. The goal is to give you the mental model needed to read those papers and use those libraries with judgment.

Use this project if you want to:

  • Understand the data flow from raw text to logits.
  • Build a small GPT-style model without treating the architecture as a black box.
  • See how training objectives, data quality, and scaling laws connect.
  • Learn why inference systems need KV Cache, batching, memory planning, and speculative decoding.
  • Connect recent research topics such as MoE, long context, CoT, VLMs, RLHF, DPO, and distillation back to small runnable examples.

What Is Included

AreaTopicsReference implementations
FoundationsTokenization, BPE, embeddings, position encodingCharTokenizer, WordTokenizer, BPETokenizer, TokenEmbedding
Transformer coreSelf-Attention, Multi-Head Attention, Transformer blockMultiHeadAttention, TransformerBlock, MiniGPT
GPT-2 to modern modelsRMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA, MoERMSNorm, SwiGLU, RoPE, GroupedQueryAttention, MultiHeadLatentAttention, MoELayer
TrainingLoss, optimization, scaling laws, data engineering, MTP, FIMTraining loop, gradient accumulation, MinHash deduplication, Multi-Token Prediction, Fill-in-the-Middle
Adaptation and alignmentLoRA, reward modeling, PPO, DPOLoraLinear, reward model loss, PPO clip, DPO loss
InferenceSampling, beam search, KV Cache, speculative decodingTop-k, Top-p, beam search, AttentionWithKVCache
FrontiersLong context, reasoning traces, VLM, Sliding Window AttentionRoPE extrapolation, Self-Consistency, Cross-Attention, Sliding Window mask
Production conceptsEvaluation, distillation, on-policy distillationWin-rate matrices, soft labels, KL estimators

What This Project Is Not

This repository intentionally avoids several things so the learning path stays clear:

  • It is not a production LLM framework.
  • It is not optimized for maximum throughput or distributed training.
  • It does not provide pretrained model weights.
  • It does not use transformers as a shortcut for core implementations.
  • It does not assume the reader already knows the terminology.

Some dependencies such as transformers and datasets may appear in the environment for comparison or utility work, but the teaching path keeps the core algorithms explicit.

Quick Start

Python notebooks

git clone https://github.com/walkinglabs/modern-llm-notebook.git
cd modern-llm-notebook
pip install -r requirements.txt
jupyter notebook notebooks-en/part1-foundation/01-tokenizer-basics.ipynb

Language note:

  • Chinese notebooks live in notebooks/
  • English notebooks live in notebooks-en/ (complete 23/23 translation coverage)

Recommended environment:

  • Python 3.9+
  • PyTorch 2.0+
  • NumPy, Matplotlib, Jupyter
  • 16GB RAM

Most notebooks run on CPU. Larger training experiments are easier with a GPU.

Web viewer

The repository also includes a React / Vite reader for a course-like browsing experience. The reader imports the .ipynb files directly and renders them in the browser, without a generated web content copy.

npm install
npm run dev

Build and preview the static site:

npm run build
npm run preview

Executing notebooks in restricted environments

Some sandboxed environments disallow opening local sockets, which breaks the standard Jupyter kernel protocol (and tools like nbclient / nbconvert --execute). For those cases we ship a no-kernel executor that runs code cells via plain Python and writes outputs back into the English notebooks:

python scripts/execute_notebooks_en_no_kernel.py

Project Status

AreaStatus
Chinese notebooksComplete 23/23
English notebooksComplete 23/23 with executed outputs
Web readerReact / Vite app with language switching
Static sitePublished through GitHub Pages
Quality checksEnglish coverage, syntax, output-language checks, and web build
Next focusCS336/CME295-inspired depth, smoother writing, reproducible pretraining, and stronger eval benchmarks

Near-Term Roadmap

  1. Incorporate more material inspired by CS336 and CME295, especially around data, training, systems, and evaluation.
  2. Polish the flow of the existing notebooks so the explanations read more naturally from intuition to code.
  3. Add a reproducible 0-to-1 pretraining workflow inspired by SmolLM, from data preparation to a small trained model.
  4. Make the eval benchmark chapter more detailed, including benchmark design, metrics, judge prompts, result aggregation, and failure analysis.

Curriculum

The curriculum is organized as five parts and 23 self-contained notebooks.

Modern LLM Notebook

├── Part 1: Foundation
│   ├── Tokenizer basics
│   ├── BPE tokenizer
│   ├── Embedding and position encoding
│   ├── Attention and Transformer block
│   ├── Mini-GPT
│   └── BERT encoder

├── Part 2: Training
│   ├── From GPT-2 to modern models
│   ├── Mixture of Experts
│   ├── Training and loss
│   ├── Scaling laws
│   ├── Data engineering
│   ├── LoRA
│   ├── Mid-training and continued pretraining
│   └── RLHF alignment

├── Part 3: Inference
│   ├── Generation
│   ├── Inference acceleration
│   └── Speculative decoding

├── Part 4: Frontiers
│   ├── Long context
│   ├── CoT and thinking
│   └── Vision-language models

└── Part 5: Production
    ├── Evaluation
    ├── Distillation
    └── On-policy distillation

Each notebook is designed to be runnable on its own. You can follow the full sequence or jump to a topic without depending on hidden runtime state from earlier notebooks.

Notebook Index

Part 1: Foundation

#NotebookPrimary questionImplementation focus
01Tokenizer BasicsWhy do models need tokenizers?Character and word tokenizers
02BPE TokenizerHow does BPE learn a vocabulary?Merge rules, encode, decode
03EmbeddingHow do IDs become vectors?Token embedding, distributed representation
04Position EncodingHow does the model know word order?Sinusoidal encoding, input assembly
05Attention & Transformer BlockHow does attention move information?MHA, residuals, normalization
06Mini-GPTHow does a GPT-style model fit together?Decoder-only model, LM head
08BERT EncoderWhy can encoder-only models read bidirectionally?MiniBERT, MLM head

Part 2: Training

#NotebookPrimary questionImplementation focus
09From GPT-2 to Modern ModelsWhat changed architecturally after GPT-2?RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA
10Mixture of ExpertsHow does sparse expert routing work?Router gate, top-k experts, aux-free load balancing
11Training & LossHow does a language model learn from prediction errors?Training loop, loss, gradients, Multi-Token Prediction
12Scaling LawsHow do model size, data, and compute trade off?FLOPs estimates, Chinchilla intuition
13Data EngineeringWhy does data quality dominate model behavior?Cleaning, filtering, MinHash, FIM
14LoRAWhy does low-rank adaptation work?LoraLinear, merge for inference
15Mid-Training & CPTHow does continued pretraining adapt a model?Data mixing, loss observation
16RLHF AlignmentHow do preference signals become objectives?Reward model, PPO, DPO

Part 3: Inference

#NotebookPrimary questionImplementation focus
17GenerationHow do decoding strategies change model behavior?Greedy, top-k, top-p, beam search
18Inference AccelerationWhy is generation memory-bound?KV Cache, FlashAttention, PagedAttention
19Speculative DecodingHow can a small model accelerate a large one?Draft-then-verify acceptance

Part 4: Frontiers

#NotebookPrimary questionImplementation focus
20Long ContextHow do models extend beyond their training context length?RoPE extrapolation, YaRN, Sliding Window Attention
21CoT & ThinkingWhy can reasoning traces improve answers?Self-Consistency, reward design
22Vision-Language ModelsHow does visual information enter a language model?Patch embedding, cross-attention

Part 5: Production

#NotebookPrimary questionImplementation focus
23EvaluationHow do we tell whether a model is better?Win-rate matrices, RAGAS, judge metrics
24DistillationHow does a small model learn from a large one?Soft labels, temperature, logit distillation
25On-Policy DistillationHow can distillation reduce exposure bias?OPSD, KL estimator taxonomy

Quality Bar

The repository follows a small set of standards to keep the notebooks useful as learning material:

  • Concepts are introduced by motivation before notation.
  • New terminology is defined before it is used heavily.
  • Core algorithms include at least one concrete hand calculation or toy example.
  • Code cells are kept small and observable.
  • Randomized experiments use fixed seeds where appropriate.
  • Each notebook is self-contained and does not rely on variables from previous notebooks.
  • Markdown explanations are written for patient beginners, while the code remains close to the real algorithmic structure.

Papers and Systems

The course connects implementation details to influential papers and production systems:

Paper or systemConcepts covered
Attention Is All You NeedMulti-Head Attention, position encoding
BERTEncoder-only models, masked language modeling
LLaMARMSNorm, SwiGLU, RoPE, Pre-Norm
DeepSeek-V2 / DeepSeek-V3MLA, Multi-Token Prediction, aux-free MoE load balancing
Mixtral / Qwen3Sliding Window Attention, MoE with shared experts
Scaling Laws / ChinchillaParameter, data, and compute trade-offs
LoRALow-rank adaptation
RLHF / PPO / DPOPreference alignment
Code Llama / DeepSeek-CoderFill-in-the-Middle (FIM)
FlashAttention / vLLMInference acceleration and memory management
Speculative DecodingDraft-then-verify generation
RoPE / YaRNLong-context extrapolation
Chain-of-ThoughtReasoning traces and Self-Consistency
Flamingo / LLaVAVision-language models
Knowledge Distillation / OPDCompression and distillation

Repository Structure

modern-llm-notebook/
├── notebooks/           # Chinese source notebooks
│   ├── part1-foundation/
│   ├── part2-training/
│   ├── part3-inference/
│   ├── part4-frontiers/
│   └── part5-production/
├── notebooks-en/        # English mirror notebooks
│   ├── part1-foundation/
│   ├── part2-training/
│   ├── part3-inference/
│   ├── part4-frontiers/
│   └── part5-production/
├── external/            # Upstream references (e.g. karpathy nanoGPT/minGPT)
├── karpathy_models.py   # Thin import wrapper used by a few notebooks
├── web/                 # React / Vite web viewer
├── docs/                # Static site build output
├── scripts/             # Notebook conversion scripts
├── requirements.txt
├── package.json
├── README.md
└── README-CN.md

Contributing

Contributions are welcome when they improve clarity, correctness, or coverage.

Good contributions include:

  • Fixing incorrect explanations, broken cells, or outdated APIs.
  • Improving hand-calculation sections and visualizations.
  • Adding focused exercises with assertions.
  • Translating or improving bilingual documentation.
  • Proposing new notebooks for important model architectures or training methods.

Please read CONTRIBUTING.md before opening a pull request.

Star History

Star history chart

Citation

If Modern LLM Notebook helps your research or work, please cite:

@misc{modern-llm-notebook,
  title   = {Modern LLM Notebook: Build Modern LLMs from Scratch},
  author  = {WalkingLabs},
  year    = {2025},
  url     = {https://github.com/walkinglabs/modern-llm-notebook},
  note    = {GitHub repository, accessed 2026}
}

License

This project is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


Built for engineers who want to understand LLM systems from the inside.
Maintained by walkinglabs.