Modern LLM Notebook
June 3, 2026 · View on GitHub
Build modern LLMs from scratch through 23 runnable Jupyter Notebooks.
English · 中文文档 · Read Online · Start in Colab
Overview · What You Will Build · Why · What Is Included · Quick Start · Status · Curriculum · Quality Bar · Contributing
Case-based LLM learning: course map -> concrete notebook -> runnable experiment.
Start from the full bilingual course map, then choose a focused path through foundations, training, inference, frontiers, and production topics.
Each case keeps the learning loop visible: intuition, hand calculation, implementation, experiment, outline navigation, and one-click Colab access.
Overview
Modern LLM Notebook is a hands-on course for building modern LLM systems from the ground up in PyTorch. Instead of treating the model as a black box, you implement the core pieces yourself: tokenizers, embeddings, attention, Transformer blocks, training loops, MoE, LoRA, RLHF, decoding, KV Cache, long context, VLMs, evaluation, distillation, and on-policy distillation.
The repository ships with a full English notebook mirror under notebooks-en/. The web viewer
supports language switching from the home page and the notebook sidebar (or via ?lang=en in the
URL), so both the curriculum and the browsing experience stay bilingual end to end.
The project is designed as an educational reference implementation. It is not a model zoo, not a production serving framework, and not a wrapper around hosted APIs. Its purpose is to make the internal machinery of LLMs legible to engineers who want to reason from first principles.
Each notebook follows the same learning contract:
intuition -> hand calculation -> implementation -> experiment
That contract matters. A reader should not only know that BPE merges frequent pairs, or that KV Cache speeds up generation. They should be able to trace the numbers, write the minimal code, and explain why the behavior appears.
What You Will Build
By the end, you will have implemented a compact version of the systems that power modern LLMs:
| Stage | You build | Why it matters |
|---|---|---|
| Text to tokens | Character, word, and BPE tokenizers | See exactly how raw text becomes model input |
| Tokens to vectors | Token embeddings and position encodings | Understand what the model can compute over |
| Transformer core | Self-Attention, Multi-Head Attention, Transformer blocks, Mini-GPT | Reconstruct the core forward pass |
| Training system | Cross-Entropy, batching, gradient flow, scaling-law intuition | Connect loss curves to real model behavior |
| Adaptation | LoRA, continued pretraining, reward modeling, PPO/DPO style objectives | Learn how base models become useful assistants |
| Inference system | Sampling, beam search, KV Cache, speculative decoding | Understand why serving is a systems problem |
| Frontiers | Long context, CoT experiments, VLM patch embeddings and cross-attention | Turn newer papers into small runnable examples |
| Production loop | Evaluation, win-rate matrices, distillation, OPD | Measure, compress, and improve model behavior |
raw text -> tokens -> embeddings -> attention -> Transformer -> Mini-GPT
-> training -> alignment -> inference -> evaluation -> distillation
Why This Project
LLM education often falls into two extremes.
Some resources are mathematically precise but difficult to enter: they introduce formulas before the reader understands the problem being solved. Other resources are easy to run but heavily abstracted: the important ideas disappear behind a library call.
Modern LLM Notebook takes the middle path. It treats modern LLMs as systems that can be decomposed, tested, and rebuilt piece by piece. The goal is not to replace papers or production libraries. The goal is to give you the mental model needed to read those papers and use those libraries with judgment.
Use this project if you want to:
- Understand the data flow from raw text to logits.
- Build a small GPT-style model without treating the architecture as a black box.
- See how training objectives, data quality, and scaling laws connect.
- Learn why inference systems need KV Cache, batching, memory planning, and speculative decoding.
- Connect recent research topics such as MoE, long context, CoT, VLMs, RLHF, DPO, and distillation back to small runnable examples.
What Is Included
| Area | Topics | Reference implementations |
|---|---|---|
| Foundations | Tokenization, BPE, embeddings, position encoding | CharTokenizer, WordTokenizer, BPETokenizer, TokenEmbedding |
| Transformer core | Self-Attention, Multi-Head Attention, Transformer block | MultiHeadAttention, TransformerBlock, MiniGPT |
| GPT-2 to modern models | RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA, MoE | RMSNorm, SwiGLU, RoPE, GroupedQueryAttention, MultiHeadLatentAttention, MoELayer |
| Training | Loss, optimization, scaling laws, data engineering, MTP, FIM | Training loop, gradient accumulation, MinHash deduplication, Multi-Token Prediction, Fill-in-the-Middle |
| Adaptation and alignment | LoRA, reward modeling, PPO, DPO | LoraLinear, reward model loss, PPO clip, DPO loss |
| Inference | Sampling, beam search, KV Cache, speculative decoding | Top-k, Top-p, beam search, AttentionWithKVCache |
| Frontiers | Long context, reasoning traces, VLM, Sliding Window Attention | RoPE extrapolation, Self-Consistency, Cross-Attention, Sliding Window mask |
| Production concepts | Evaluation, distillation, on-policy distillation | Win-rate matrices, soft labels, KL estimators |
What This Project Is Not
This repository intentionally avoids several things so the learning path stays clear:
- It is not a production LLM framework.
- It is not optimized for maximum throughput or distributed training.
- It does not provide pretrained model weights.
- It does not use
transformersas a shortcut for core implementations. - It does not assume the reader already knows the terminology.
Some dependencies such as transformers and datasets may appear in the environment for
comparison or utility work, but the teaching path keeps the core algorithms explicit.
Quick Start
Python notebooks
git clone https://github.com/walkinglabs/modern-llm-notebook.git
cd modern-llm-notebook
pip install -r requirements.txt
jupyter notebook notebooks-en/part1-foundation/01-tokenizer-basics.ipynb
Language note:
- Chinese notebooks live in
notebooks/ - English notebooks live in
notebooks-en/(complete 23/23 translation coverage)
Recommended environment:
- Python 3.9+
- PyTorch 2.0+
- NumPy, Matplotlib, Jupyter
- 16GB RAM
Most notebooks run on CPU. Larger training experiments are easier with a GPU.
Web viewer
The repository also includes a React / Vite reader for a course-like browsing experience.
The reader imports the .ipynb files directly and renders them in the browser, without a generated
web content copy.
npm install
npm run dev
Build and preview the static site:
npm run build
npm run preview
Executing notebooks in restricted environments
Some sandboxed environments disallow opening local sockets, which breaks the standard Jupyter
kernel protocol (and tools like nbclient / nbconvert --execute). For those cases we ship a
no-kernel executor that runs code cells via plain Python and writes outputs back into the English
notebooks:
python scripts/execute_notebooks_en_no_kernel.py
Project Status
| Area | Status |
|---|---|
| Chinese notebooks | Complete 23/23 |
| English notebooks | Complete 23/23 with executed outputs |
| Web reader | React / Vite app with language switching |
| Static site | Published through GitHub Pages |
| Quality checks | English coverage, syntax, output-language checks, and web build |
| Next focus | CS336/CME295-inspired depth, smoother writing, reproducible pretraining, and stronger eval benchmarks |
Near-Term Roadmap
- Incorporate more material inspired by CS336 and CME295, especially around data, training, systems, and evaluation.
- Polish the flow of the existing notebooks so the explanations read more naturally from intuition to code.
- Add a reproducible 0-to-1 pretraining workflow inspired by SmolLM, from data preparation to a small trained model.
- Make the eval benchmark chapter more detailed, including benchmark design, metrics, judge prompts, result aggregation, and failure analysis.
Curriculum
The curriculum is organized as five parts and 23 self-contained notebooks.
Modern LLM Notebook
│
├── Part 1: Foundation
│ ├── Tokenizer basics
│ ├── BPE tokenizer
│ ├── Embedding and position encoding
│ ├── Attention and Transformer block
│ ├── Mini-GPT
│ └── BERT encoder
│
├── Part 2: Training
│ ├── From GPT-2 to modern models
│ ├── Mixture of Experts
│ ├── Training and loss
│ ├── Scaling laws
│ ├── Data engineering
│ ├── LoRA
│ ├── Mid-training and continued pretraining
│ └── RLHF alignment
│
├── Part 3: Inference
│ ├── Generation
│ ├── Inference acceleration
│ └── Speculative decoding
│
├── Part 4: Frontiers
│ ├── Long context
│ ├── CoT and thinking
│ └── Vision-language models
│
└── Part 5: Production
├── Evaluation
├── Distillation
└── On-policy distillation
Each notebook is designed to be runnable on its own. You can follow the full sequence or jump to a topic without depending on hidden runtime state from earlier notebooks.
Notebook Index
Part 1: Foundation
| # | Notebook | Primary question | Implementation focus |
|---|---|---|---|
| 01 | Tokenizer Basics | Why do models need tokenizers? | Character and word tokenizers |
| 02 | BPE Tokenizer | How does BPE learn a vocabulary? | Merge rules, encode, decode |
| 03 | Embedding | How do IDs become vectors? | Token embedding, distributed representation |
| 04 | Position Encoding | How does the model know word order? | Sinusoidal encoding, input assembly |
| 05 | Attention & Transformer Block | How does attention move information? | MHA, residuals, normalization |
| 06 | Mini-GPT | How does a GPT-style model fit together? | Decoder-only model, LM head |
| 08 | BERT Encoder | Why can encoder-only models read bidirectionally? | MiniBERT, MLM head |
Part 2: Training
| # | Notebook | Primary question | Implementation focus |
|---|---|---|---|
| 09 | From GPT-2 to Modern Models | What changed architecturally after GPT-2? | RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA |
| 10 | Mixture of Experts | How does sparse expert routing work? | Router gate, top-k experts, aux-free load balancing |
| 11 | Training & Loss | How does a language model learn from prediction errors? | Training loop, loss, gradients, Multi-Token Prediction |
| 12 | Scaling Laws | How do model size, data, and compute trade off? | FLOPs estimates, Chinchilla intuition |
| 13 | Data Engineering | Why does data quality dominate model behavior? | Cleaning, filtering, MinHash, FIM |
| 14 | LoRA | Why does low-rank adaptation work? | LoraLinear, merge for inference |
| 15 | Mid-Training & CPT | How does continued pretraining adapt a model? | Data mixing, loss observation |
| 16 | RLHF Alignment | How do preference signals become objectives? | Reward model, PPO, DPO |
Part 3: Inference
| # | Notebook | Primary question | Implementation focus |
|---|---|---|---|
| 17 | Generation | How do decoding strategies change model behavior? | Greedy, top-k, top-p, beam search |
| 18 | Inference Acceleration | Why is generation memory-bound? | KV Cache, FlashAttention, PagedAttention |
| 19 | Speculative Decoding | How can a small model accelerate a large one? | Draft-then-verify acceptance |
Part 4: Frontiers
| # | Notebook | Primary question | Implementation focus |
|---|---|---|---|
| 20 | Long Context | How do models extend beyond their training context length? | RoPE extrapolation, YaRN, Sliding Window Attention |
| 21 | CoT & Thinking | Why can reasoning traces improve answers? | Self-Consistency, reward design |
| 22 | Vision-Language Models | How does visual information enter a language model? | Patch embedding, cross-attention |
Part 5: Production
| # | Notebook | Primary question | Implementation focus |
|---|---|---|---|
| 23 | Evaluation | How do we tell whether a model is better? | Win-rate matrices, RAGAS, judge metrics |
| 24 | Distillation | How does a small model learn from a large one? | Soft labels, temperature, logit distillation |
| 25 | On-Policy Distillation | How can distillation reduce exposure bias? | OPSD, KL estimator taxonomy |
Quality Bar
The repository follows a small set of standards to keep the notebooks useful as learning material:
- Concepts are introduced by motivation before notation.
- New terminology is defined before it is used heavily.
- Core algorithms include at least one concrete hand calculation or toy example.
- Code cells are kept small and observable.
- Randomized experiments use fixed seeds where appropriate.
- Each notebook is self-contained and does not rely on variables from previous notebooks.
- Markdown explanations are written for patient beginners, while the code remains close to the real algorithmic structure.
Papers and Systems
The course connects implementation details to influential papers and production systems:
| Paper or system | Concepts covered |
|---|---|
| Attention Is All You Need | Multi-Head Attention, position encoding |
| BERT | Encoder-only models, masked language modeling |
| LLaMA | RMSNorm, SwiGLU, RoPE, Pre-Norm |
| DeepSeek-V2 / DeepSeek-V3 | MLA, Multi-Token Prediction, aux-free MoE load balancing |
| Mixtral / Qwen3 | Sliding Window Attention, MoE with shared experts |
| Scaling Laws / Chinchilla | Parameter, data, and compute trade-offs |
| LoRA | Low-rank adaptation |
| RLHF / PPO / DPO | Preference alignment |
| Code Llama / DeepSeek-Coder | Fill-in-the-Middle (FIM) |
| FlashAttention / vLLM | Inference acceleration and memory management |
| Speculative Decoding | Draft-then-verify generation |
| RoPE / YaRN | Long-context extrapolation |
| Chain-of-Thought | Reasoning traces and Self-Consistency |
| Flamingo / LLaVA | Vision-language models |
| Knowledge Distillation / OPD | Compression and distillation |
Repository Structure
modern-llm-notebook/
├── notebooks/ # Chinese source notebooks
│ ├── part1-foundation/
│ ├── part2-training/
│ ├── part3-inference/
│ ├── part4-frontiers/
│ └── part5-production/
├── notebooks-en/ # English mirror notebooks
│ ├── part1-foundation/
│ ├── part2-training/
│ ├── part3-inference/
│ ├── part4-frontiers/
│ └── part5-production/
├── external/ # Upstream references (e.g. karpathy nanoGPT/minGPT)
├── karpathy_models.py # Thin import wrapper used by a few notebooks
├── web/ # React / Vite web viewer
├── docs/ # Static site build output
├── scripts/ # Notebook conversion scripts
├── requirements.txt
├── package.json
├── README.md
└── README-CN.md
Contributing
Contributions are welcome when they improve clarity, correctness, or coverage.
Good contributions include:
- Fixing incorrect explanations, broken cells, or outdated APIs.
- Improving hand-calculation sections and visualizations.
- Adding focused exercises with assertions.
- Translating or improving bilingual documentation.
- Proposing new notebooks for important model architectures or training methods.
Please read CONTRIBUTING.md before opening a pull request.
Star History
Citation
If Modern LLM Notebook helps your research or work, please cite:
@misc{modern-llm-notebook,
title = {Modern LLM Notebook: Build Modern LLMs from Scratch},
author = {WalkingLabs},
year = {2025},
url = {https://github.com/walkinglabs/modern-llm-notebook},
note = {GitHub repository, accessed 2026}
}
License
This project is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Built for engineers who want to understand LLM systems from the inside.
Maintained by walkinglabs.