Modern LLM Notebook

June 30, 2026 · View on GitHub

Build modern LLMs from scratch through 26 runnable Jupyter Notebooks.

English · 中文文档 · Read Online · Start in Colab · Join Discord

Overview · What You Will Build · Why · What Is Included · Quick Start · Status · Curriculum · Quality Bar · Contributing

Case-based LLM learning: course map -> concrete notebook -> runnable experiment.

Modern LLM Notebook English home page

Start from the full bilingual course map, then choose a focused path through foundations, training, inference, frontiers, and production topics.

Modern LLM Notebook English notebook reader

Each case keeps the learning loop visible: intuition, hand calculation, implementation, experiment, outline navigation, and one-click Colab access.

Overview

Modern LLM Notebook is a hands-on course for building modern LLM systems from the ground up in PyTorch. Instead of treating the model as a black box, you implement the core pieces yourself: tokenizers, embeddings, attention, Transformer blocks, training loops, MoE, LoRA, RLHF, decoding, KV Cache, long context, VLMs, evaluation, distillation, and on-policy distillation.

The repository ships with a full English notebook mirror under notebooks-en/. The web viewer supports language switching from the home page and the notebook sidebar (or via ?lang=en in the URL), so both the curriculum and the browsing experience stay bilingual end to end.

The project is designed as an educational reference implementation. It is not a model zoo, not a production serving framework, and not a wrapper around hosted APIs. Its purpose is to make the internal machinery of LLMs legible to engineers who want to reason from first principles.

Each notebook follows the same learning contract:

intuition -> hand calculation -> implementation -> experiment

That contract matters. A reader should not only know that BPE merges frequent pairs, or that KV Cache speeds up generation. They should be able to trace the numbers, write the minimal code, and explain why the behavior appears.

What You Will Build

By the end, you will have implemented a compact version of the systems that power modern LLMs:

Stage	You build	Why it matters
Text to tokens	Character, word, and BPE tokenizers	See exactly how raw text becomes model input
Tokens to vectors	Token embeddings and position encodings	Understand what the model can compute over
Transformer core	Self-Attention, Multi-Head Attention, Transformer blocks, Mini-GPT	Reconstruct the core forward pass
Training system	Cross-Entropy, batching, gradient flow, scaling-law intuition	Connect loss curves to real model behavior
Adaptation	LoRA, continued pretraining, reward modeling, PPO/DPO style objectives	Learn how base models become useful assistants
Inference system	Sampling, beam search, KV Cache, speculative decoding	Understand why serving is a systems problem
Frontiers	Long context, CoT experiments, VLM patch embeddings and cross-attention	Turn newer papers into small runnable examples
Production loop	Evaluation, win-rate matrices, distillation, OPD	Measure, compress, and improve model behavior

raw text -> tokens -> embeddings -> attention -> Transformer -> Mini-GPT
         -> training -> alignment -> inference -> evaluation -> distillation

Why This Project

LLM education often falls into two extremes.

Some resources are mathematically precise but difficult to enter: they introduce formulas before the reader understands the problem being solved. Other resources are easy to run but heavily abstracted: the important ideas disappear behind a library call.

Modern LLM Notebook takes the middle path. It treats modern LLMs as systems that can be decomposed, tested, and rebuilt piece by piece. The goal is not to replace papers or production libraries. The goal is to give you the mental model needed to read those papers and use those libraries with judgment.

Use this project if you want to:

Understand the data flow from raw text to logits.
Build a small GPT-style model without treating the architecture as a black box.
See how training objectives, data quality, and scaling laws connect.
Learn why inference systems need KV Cache, batching, memory planning, and speculative decoding.
Connect recent research topics such as MoE, long context, CoT, VLMs, RLHF, DPO, and distillation back to small runnable examples.

What Is Included

Area	Topics	Reference implementations
Foundations	Tokenization, BPE, embeddings, position encoding	`CharTokenizer`, `WordTokenizer`, `BPETokenizer`, `TokenEmbedding`
Transformer core	Self-Attention, Multi-Head Attention, Transformer block	`MultiHeadAttention`, `TransformerBlock`, `MiniGPT`
GPT-2 to modern models	RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA, MoE	`RMSNorm`, `SwiGLU`, `RoPE`, `GroupedQueryAttention`, `MultiHeadLatentAttention`, `MoELayer`
Training	Loss, optimization, scaling laws, data engineering, MTP, FIM	Training loop, gradient accumulation, MinHash deduplication, Multi-Token Prediction, Fill-in-the-Middle
Adaptation and alignment	LoRA, reward modeling, PPO, DPO	`LoraLinear`, reward model loss, PPO clip, DPO loss
Inference	Sampling, beam search, KV Cache, speculative decoding	Top-k, Top-p, beam search, `AttentionWithKVCache`
Frontiers	Long context, reasoning traces, VLM, Sliding Window Attention	RoPE extrapolation, Self-Consistency, Cross-Attention, Sliding Window mask
Production concepts	Evaluation, distillation, on-policy distillation	Win-rate matrices, soft labels, KL estimators

What This Project Is Not

This repository intentionally avoids several things so the learning path stays clear:

It is not a production LLM framework.
It is not optimized for maximum throughput or distributed training.
It does not provide pretrained model weights.
It does not use transformers as a shortcut for core implementations.
It does not assume the reader already knows the terminology.

Some dependencies such as transformers and datasets may appear in the environment for comparison or utility work, but the teaching path keeps the core algorithms explicit.

Quick Start

Python notebooks

git clone https://github.com/walkinglabs/modern-llm-notebook.git
cd modern-llm-notebook

# Create an isolated Python environment instead of installing into the system Python.
python3 -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m ipykernel install --user \
  --name modern-llm-notebook \
  --display-name "Python (modern-llm-notebook)"

jupyter notebook notebooks-en/part1-foundation/01-tokenizer-basics.ipynb

If jupyter: command not found appears, the virtual environment is probably not active. Run:

source .venv/bin/activate

Or call Jupyter directly from the environment:

.venv/bin/jupyter notebook notebooks-en/part1-foundation/01-tokenizer-basics.ipynb

Language note:

Chinese notebooks live in notebooks/
English notebooks live in notebooks-en/ (complete 26/26 translation coverage)

Recommended environment:

Python 3.9+
PyTorch 2.0+
NumPy, Matplotlib, Jupyter
16GB RAM

Most notebooks run on CPU. Larger training experiments are easier with a GPU.

Web viewer

The repository also includes a React / Vite reader for a course-like browsing experience. The reader imports the .ipynb files directly and renders them in the browser, without a generated web content copy.

cd web
npm install
npm run dev

Build and preview the static site:

cd web
npm run build
npm run preview

Executing notebooks in restricted environments

Some sandboxed environments disallow opening local sockets, which breaks the standard Jupyter kernel protocol (and tools like nbclient / nbconvert --execute). For those cases we ship a no-kernel executor that runs code cells via plain Python and writes outputs back into the English notebooks:

python scripts/execute_notebooks_en_no_kernel.py

Project Status

Area	Status
Chinese notebooks	Complete 32/32 (added 29-MLA, 30-inference-systems, 31-linear-attention, 32-sparse-attention)
English notebooks	Complete 26/26 with executed outputs; renumber pending
Web reader	React / Vite app with language switching
Static site	Published through GitHub Pages
Quality checks	English coverage, syntax, output-language checks, and web build
Next focus	CS336/CME295-inspired depth, smoother writing, reproducible pretraining, and stronger eval benchmarks

Near-Term Roadmap

Incorporate more material inspired by CS336 and CME295, especially around data, training, systems, and evaluation.
Polish the flow of the existing notebooks so the explanations read more naturally from intuition to code.
Add a reproducible 0-to-1 pretraining workflow inspired by SmolLM, from data preparation to a small trained model.
Make the eval benchmark chapter more detailed, including benchmark design, metrics, judge prompts, result aggregation, and failure analysis.

Curriculum

The curriculum is organized as five parts and 26 self-contained notebooks.

Modern LLM Notebook
│
├── Part 1: Foundation
│   ├── Tokenizer basics
│   ├── BPE tokenizer
│   ├── Embedding and position encoding
│   ├── Attention and Transformer block
│   ├── Mini-GPT
│   └── BERT encoder
│
├── Part 2: Training
│   ├── From GPT-2 to modern models
│   ├── Model config
│   ├── Mixture of Experts
│   ├── Training and loss
│   ├── Scaling laws
│   ├── Data engineering
│   ├── LoRA
│   ├── Mid-training and continued pretraining
│   └── RLHF alignment
│
├── Part 3: Inference
│   ├── Generation
│   ├── Inference acceleration
│   └── Speculative decoding
│
├── Part 4: Frontiers
│   ├── Long context
│   ├── CoT and thinking
│   └── Vision-language models
│
└── Part 5: Production
    ├── Evaluation
    ├── Distillation
    ├── On-policy distillation
    └── vLLM & SGLang deployment

Each notebook is designed to be runnable on its own. You can follow the full sequence or jump to a topic without depending on hidden runtime state from earlier notebooks.

Notebook Index

Part 1: Foundation

#	Notebook	Primary question	Implementation focus
01	Tokenizer Basics	Why do models need tokenizers?	Character and word tokenizers
02	BPE Tokenizer	How does BPE learn a vocabulary?	Merge rules, encode, decode
03	Embedding	How do IDs become vectors?	Token embedding, distributed representation
04	Position Encoding	How does the model know word order?	Sinusoidal encoding, input assembly
05	Attention & Transformer Block	How does attention move information?	MHA, residuals, normalization
06	Mini-GPT	How does a GPT-style model fit together?	Decoder-only model, LM head
07	BERT Encoder	Why can encoder-only models read bidirectionally?	MiniBERT, MLM head

Part 2: Training

#	Notebook	Primary question	Implementation focus
08	From GPT-2 to Modern Models	What changed architecturally after GPT-2?	RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA
09	Model Config	What does each field in a real config.json mean?	vocab_size, hidden_size, layers, heads
10	Mixture of Experts	How does sparse expert routing work?	Router gate, top-k experts, aux-free load balancing
11	Training & Loss	How does a language model learn from prediction errors?	Training loop, loss, gradients, Multi-Token Prediction
12	Scaling Laws	How do model size, data, and compute trade off?	FLOPs estimates, Chinchilla intuition
13	Distributed Training	How do we shard memory and compute across GPUs?	DDP, ZeRO Stage 1/2/3, FSDP, DeepSpeed, Accelerate
14	Data Engineering	Why does data quality dominate model behavior?	Cleaning, filtering, MinHash, FIM
15	LoRA	Why does low-rank adaptation work?	`LoraLinear`, merge for inference
16	Mid-Training & CPT	How does continued pretraining adapt a model?	Data mixing, loss observation
17	RLHF Alignment	How do preference signals become objectives?	Reward model, PPO, DPO

Part 3: Inference

#	Notebook	Primary question	Implementation focus
17	Generation	How do decoding strategies change model behavior?	Greedy, top-k, top-p, beam search
18	Inference Acceleration	Why is generation memory-bound?	KV Cache, FlashAttention, PagedAttention
19	Speculative Decoding	How can a small model accelerate a large one?	Draft-then-verify acceptance

Part 4: Frontiers

#	Notebook	Primary question	Implementation focus
20	Long Context	How do models extend beyond their training context length?	RoPE extrapolation, YaRN, Sliding Window Attention
21	CoT & Thinking	Why can reasoning traces improve answers?	Self-Consistency, reward design
22	Vision-Language Models	How does visual information enter a language model?	Patch embedding, cross-attention

Part 5: Production

#	Notebook	Primary question	Implementation focus
23	Evaluation	How do we tell whether a model is better?	Win-rate matrices, RAGAS, judge metrics
24	Distillation	How does a small model learn from a large one?	Soft labels, temperature, logit distillation
25	On-Policy Distillation	How can distillation reduce exposure bias?	OPSD, KL estimator taxonomy
26	LLM Deployment	How do you turn a trained model into a callable service?	vLLM, SGLang, custom architecture registration

Quality Bar

The repository follows a small set of standards to keep the notebooks useful as learning material:

Concepts are introduced by motivation before notation.
New terminology is defined before it is used heavily.
Core algorithms include at least one concrete hand calculation or toy example.
Code cells are kept small and observable.
Randomized experiments use fixed seeds where appropriate.
Each notebook is self-contained and does not rely on variables from previous notebooks.
Markdown explanations are written for patient beginners, while the code remains close to the real algorithmic structure.

Papers and Systems

The course connects implementation details to influential papers and production systems:

Paper or system	Concepts covered
Attention Is All You Need	Multi-Head Attention, position encoding
BERT	Encoder-only models, masked language modeling
LLaMA	RMSNorm, SwiGLU, RoPE, Pre-Norm
DeepSeek-V2 / DeepSeek-V3	MLA, Multi-Token Prediction, aux-free MoE load balancing
Mixtral / Qwen3	Sliding Window Attention, MoE with shared experts
Scaling Laws / Chinchilla	Parameter, data, and compute trade-offs
LoRA	Low-rank adaptation
RLHF / PPO / DPO	Preference alignment
Code Llama / DeepSeek-Coder	Fill-in-the-Middle (FIM)
FlashAttention / vLLM	Inference acceleration and memory management
Speculative Decoding	Draft-then-verify generation
RoPE / YaRN	Long-context extrapolation
Chain-of-Thought	Reasoning traces and Self-Consistency
Flamingo / LLaVA	Vision-language models
Knowledge Distillation / OPD	Compression and distillation

Repository Structure

modern-llm-notebook/
├── notebooks/           # Chinese source notebooks
│   ├── part1-foundation/
│   ├── part2-training/
│   ├── part3-inference/
│   ├── part4-frontiers/
│   └── part5-production/
├── notebooks-en/        # English mirror notebooks
│   ├── part1-foundation/
│   ├── part2-training/
│   ├── part3-inference/
│   ├── part4-frontiers/
│   └── part5-production/
├── external/            # Upstream references (e.g. karpathy nanoGPT/minGPT)
├── karpathy_models.py   # Thin import wrapper used by a few notebooks
├── web/                 # React / Vite web viewer
├── docs/                # Static site build output
├── scripts/             # Notebook conversion scripts
├── requirements.txt
├── package.json
├── README.md
└── README-CN.md

Contributing

Contributions are welcome when they improve clarity, correctness, or coverage.

Good contributions include:

Fixing incorrect explanations, broken cells, or outdated APIs.
Improving hand-calculation sections and visualizations.
Adding focused exercises with assertions.
Translating or improving bilingual documentation.
Proposing new notebooks for important model architectures or training methods.

Please read CONTRIBUTING.md before opening a pull request.

Star History

Citation

If Modern LLM Notebook helps your research or work, please cite:

@misc{modern-llm-notebook,
  title   = {Modern LLM Notebook: Build Modern LLMs from Scratch},
  author  = {WalkingLabs},
  year    = {2025},
  url     = {https://github.com/walkinglabs/modern-llm-notebook},
  note    = {GitHub repository, accessed 2026}
}

License

This project is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

_{Built for engineers who want to understand LLM systems from the inside.

Maintained by walkinglabs.}