README_EN.md
December 23, 2025 ¡ View on GitHub
Description
đ 100+ diagrams covering LLMs, VLMs, RL / RLHF / GRPO / DPO / SFT / distillation, RAG and performance tuning.
đ Inspired by ă大樥ĺçŽćłďźĺźşĺĺŚäš ă垎č°ä¸ĺŻšé˝ă, and continually expanded.
đ Click Star â to follow for updates.
đ Click any image for highâres view, or open the .svg files for infinite zoom.
Table of Contents
- Overall Architecture of Large Model Algorithms (Focusing on LLMs and VLMs)
- ăLLM basicsăLLM overview
- ăLLM basicsăLLM structure
- ăLLM basicsăLLM generation and decoding
- ăLLM basicsăLLM Input
- ăLLM basicsăLLM output
- ăLLM basicsăMLLM and VLM
- ăLLM basicsăLLM training process
- ăSFTăCategories of fine-tuning techniques
- ăSFTăLoRA(1 of 2)
- ăSFTăLoRA(2 of 2)
- ăSFTăPrefix-Tuning
- ăSFTăToken ID and Token
- ăSFTăLoss of SFT(cross-entropy)
- ăSFTăPacking of multiple pieces of sample
- ăDPOăRLHF vs DPO
- ăDPOăDPO(Direct Preference Optimization)
- ăDPOăOverview of DPO training
- ăDPOăImpact of the β parameter on DPO
- ăDPOăEffect of implicit reward differences on the magnitude of parameter updates
- ăOptimization without trainingăComparison of CoT and traditional Q&A
- ăOptimization without trainingăCoTăSelf-consistency CoTăToTăGoT [87]
- ăOptimization without trainingăExhaustive Search
- ăOptimization without trainingăGreedy Search
- ăOptimization without trainingăBeam Search
- ăOptimization without trainingăMultinomial Sampling
- ăOptimization without trainingăTop-K Sampling
- ăOptimization without trainingăTop-P Sampling
- ăOptimization without trainingăRAG(Retrieval-Augmented Generation)
- ăOptimization without trainingăFunction Calling
- ăRL basicsăHistory of RL
- ăRL basicsăThree major machine learning paradigms
- ăRL basicsăBasic architecture of RL
- ăRL basicsăFundamental Concepts of RL
- ăRL basicsăMarkov Chain vs MDP
- ăRL basicsăUsing dynamic Îľ values under the Îľ-greedy strategy
- ăRL basicsăComparison of RL training paradigms
- ăRL basicsăClassification of RL
- ăRL basicsăReturn(cumulative reward)
- ăRL basicsăBackwards iteration and computation of return G
- ăRL basicsăReward, Return, and Value
- ăRL basicsăQĎ and VĎ
- ăRL basicsăEstimate the value through Monte Carlo(MC)
- ăRL basicsăTD target and TD error
- ăRL basicsăTD(0), n-step TD, and MC
- ăRL basicsăCharacteristics of MC and TD methods
- ăRL basicsăMC, TD, DP, and exhaustive search [32]
- ăRL basicsăDQN model with two input-output structures
- ăRL basicsăHow to use DQN
- ăRL basicsăDQN's overestimation problem
- ăRL basicsăValue-Based vs Policy-Based
- ăRL basicsăPolicy gradient
- ăRL basicsăMulti-agent reinforcement learning(MARL)
- ăRL basicsăMulti-agent DDPG [41]
- ăRL basicsăImitation learning(IL)
- ăRL basicsăBehavior cloning(BC)
- ăRL basicsăInverse RL(IRL) and RL
- ăRL basicsăModel-Based and Model-Free
- ăRL basicsăFeudal RL
- ăRL basicsăDistributional RL
- ăPolicy Optimization & VariantsăActor-Critic
- ăPolicy Optimization & VariantsăComparison of baseline and advantage
- ăPolicy Optimization & VariantsăGAE(Generalized Advantage Estimation)
- ăPolicy Optimization & VariantsăTRPO and its trust region
- ăPolicy Optimization & VariantsăImportance sampling
- ăPolicy Optimization & VariantsăPPO-Clip
- ăPolicy Optimization & VariantsăPolicy model update process in PPO training
- ăPolicy Optimization & VariantsăPPO Pseudocode
- ăPolicy Optimization & VariantsăGRPO & PPO [72]
- ăPolicy Optimization & VariantsăDeterministic policy vs. Stochastic policy
- ăPolicy Optimization & VariantsăDPG
- ăPolicy Optimization & VariantsăDDPGďźDeep Deterministic Policy Gradientďź
- ăRLHF and RLAIFăRL modeling of language models
- ăRLHF and RLAIFăTwo-stage training process of RLHF
- ăRLHF and RLAIFăStructure of the reward model
- ăRLHF and RLAIFăInput and output of the reward model
- ăRLHF and RLAIFăReward deviation and loss
- ăRLHF and RLAIFăTraining of the reward model
- ăRLHF and RLAIFăRelationship between the four models in PPO
- ăRLHF and RLAIFăThe structure and init of the four models in PPO
- ăRLHF and RLAIFăA value model with a dual-head structure
- ăRLHF and RLAIFăFour models can share one base in RLHF
- ăRLHF and RLAIFăInputs and Outputs of Each Model in PPO
- ăRLHF and RLAIFăThe Process of Calculating KL in PPO
- ăRLHF and RLAIFăRLHF Training Based on PPO
- ăRLHF and RLAIFăRejection Sampling Fine-tuning
- ăRLHF and RLAIFăRLAIF vs RLHF
- ăRLHF and RLAIFăCAI(Constitutional AI)
- ăRLHF and RLAIFăOpenAI RBR(Rule-Based Reward)
- ăReasoning capacity optimizationăKnowledge Distillation Based on CoT
- ăReasoning capacity optimizationăDistillation Based on DeepSeek
- ăReasoning capacity optimizationăORM(Outcome Reward Model)Â &Â PRM (Process Reward Model)
- ăReasoning capacity optimizationăFour Key Steps of Each MCTS
- ăReasoning capacity optimizationăMCTS
- ăReasoning capacity optimizationăSearch Tree Example in a Linguistic Context
- ăReasoning capacity optimizationăBoN(Best-of-N) Sampling
- ăReasoning capacity optimizationăMajority Vote
- ăReasoning capacity optimizationăPerformance Growth of AlphaGo Zero [179]
- ăLLM basics extendedăPerformance Optimization Map for Large Models
- ăLLM basics extendedăALiBi positional encoding
- ăLLM basics extendedăTraditional knowledge distillation
- ăLLM basics extendedăNumerical representation, quantization
- ăLLM basics extendedăForward and backward
- ăLLM basics extendedăGradient Accumulation
- ăLLM basics extendedăGradient Checkpoint(gradient recomputation)
- ăLLM basics extendedăFull recomputation
- ăLLM basics extendedăLLM Benchmark
- ăLLM basics extendedăMHAăGQAăMQAăMLA
- ăLLM basics extendedăRNN(Recurrent Neural Network)
- ăLLM basics extendedăPre-norm vs Post-norm
- ăLLM basics extendedăBatchNorm & LayerNorm
- ăLLM basics extendedăRMSNorm
- ăLLM basics extendedăPrune
- ăLLM basics extendedăRole of the temperature coefficient
- ăLLM basics extendedăSwiGLU
- ăLLM basics extendedăAUCăPRăF1ăPrecisionăRecall
- ăLLM basics extendedăRoPE positional encoding
- ăLLM basics extendedăThe effect of RoPE on each sequence position and each dim
- đ For Reference Section
- đ BibTeX Citation Format
Overall Architecture of Large Model Algorithms (Focusing on LLMs and VLMs)
ăLLM basicsăLLM overview
- This is the culmination of dozens of hours of dedicated effort; clicking the Star â at the top right â of this repository is my greatest encouragement!
- LLMs mainly come in two forms: Decoder-Only or MoE (Mixture of Experts). The overall architectures are similar; the main difference is that MoE introduces multiple expert networks into the FFN (Feed-Forward Network) component.
ăLLM basicsăLLM structure
- LLMs mainly come in two forms: Decoder-Only or MoE (Mixture of Experts). The overall architectures are similar; the key difference is that MoE introduces multiple expert networks into the FFN (Feed-Forward Network) component.
- A typical LLM architecture can be divided into three parts: the input layer, the multi-layer stacked Decoder structure, and the output layer (including the language model head and the decoding module).
ăLLM basicsăLLM generation and decoding
- Decoding strategies are the core factors that determine the fluency, diversity, and overall performance of the final output text. Common decoding algorithms include: Greedy Search, Beam Search and its variants, Multinomial Sampling, Top-K Sampling, Top-P Sampling, Contrastive Search, Speculative Decoding, Lookahead Decoding, DoLa Decoding, and others.
- The output layer of an LLM is responsible for applying a decoding algorithm to the probability distribution to determine the final predicted next token(s).
- Based on the probability distribution, the decoding strategy (e.g., random sampling or selecting the highest probability) is applied to choose the next token. For example, under Greedy Search, the token âI'mâ with the highest probability would be selected.
- Each token generation requires passing through all layers of the Transformer structure again.
- This diagram shows one-token-at-a-time prediction. There are also multi-token prediction schemes; see Chapter 4 of Large Model Algorithms: Reinforcement Learning, Fine-Tuning, and Alignment for details.
ăLLM basicsăLLM Input
- The input layer of an LLM converts input text into a multi-dimensional numerical tensor for processing by the modelâs main structure.

ăLLM basicsăLLM output
The output layer of an LLM predicts the next token (text) based on the hidden states (a multi-dimensional tensor). The process is as follows:
- (1) Input hidden states: The hidden states from the final Decoder layer serve as input to the LLMâs output layer. For example, a 3Ă896 tensor containing all semantic information of the prefix sequence.
- (2) Language Model Head (LM Head): Typically a fully connected layer that converts hidden states to logits (calculating only the last positionâs logits during inference). For example, producing a 3Ă151936 matrix of scores for each vocabulary token.
- (3) Extract last position logits: Next-token prediction depends only on the logits at the last position, so we extract the final row from the logits matrix, yielding a 151936-dimensional vector [2.0, 3.1, â1.7, âŚ, â1.7].
- (4) Convert to probability distribution (Softmax): Apply Softmax to the logits to obtain probabilities for each vocabulary token. For example, a 151936-dimensional vector [0.01, 0.03, 0.001, âŚ, 0.001], summing to 1. A higher probability indicates a higher chance of being chosen as the next token (e.g., âI'mâ has p=0.34).
- (5) Decoding: Apply the decoding strategy (e.g., random sampling or choosing the maximum) to the probability distribution to determine the next token. Under Greedy Search, select the token with the highest probability, such as âI'mâ.
ăLLM basicsăMLLM and VLM
Depending on their focus, multimodal models are often referred to by various names:
- VLM (Vision-Language Model)
- MLLM (Multimodal Large Language Model)
- VLA (Vision-Language-Action Model)
ăLLM basicsăLLM training process
- Training large models involves two main stages: Pre-Training and Post-Training. Each stage uses different data, paradigms (algorithms), objectives, and hyperparameters.
- Pre-Training includes early training (short-context on massive data), mid-training (long-text/long-context), and Annealing. This stage is self-supervised, uses the most data, and is the most compute-intensive.
- Post-Training encompasses various fine-tuning paradigms, including but not limited to SFT (Supervised Fine-Tuning), Distillation, RSFT (Rejection Sampling Fine-Tuning), RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), and other RL methods like GRPO and PPO. Some steps, like RSFT, can iterate multiple times.
ăSFTăCategories of fine-tuning techniques
- There are many fine-tuning techniques for SFT, as shown in the diagram: the first two methods only require fine-tuning the pretrained model body (low development cost), while Parallel Low-Rank Fine-Tuning and Adapter Tuning introduce additional modules and are more complex. All these modify model parameters; prompt-based tuning instead fine-tunes the input.
ăSFTăLoRA(1 of 2)
- LoRA (Low-Rank Adaptation) was introduced by Microsoft Research in 2021. Its efficient fine-tuning and strong performance have made it widely adopted. The core idea is that the parameter difference âW before and after fine-tuning is low-rank.
- A low-rank matrix contains redundancy; decomposing it into smaller matrices preserves most useful information. For example, a 1024Ă1024 matrix can be approximated by a 1024Ă2 and a 2Ă1024 matrix product, reducing parameters to ~0.4%.
ăSFTăLoRA(2 of 2)
- Initialization of A and B:
- A is randomly initialized (e.g., Kaiming initialization);
- B is zero-initialized or uses very small random values.
- The goal is to ensure the inserted LoRA module does not overly perturb model outputs at the start of training.
ăSFTăPrefix-Tuning
- Prefix-Tuning, proposed by Stanford researchers, offers lightweight fine-tuning by inserting a trainable sequence of vectors (the âprefixâ) at the start of input. These vectors act as virtual tokens in subsequent Transformer attention.
ăSFTăToken ID and Token
- For example data (preprocessed in ChatML), tokenization produces 33 tokens at 33 positions.
- Each Token ID maps one-to-one with a token.

ăSFTăLoss of SFT(cross-entropy)
ăSFTăPacking of multiple pieces of sample
- Training uses fixed-length input. Short sequences are padded, wasting compute.
- Packing concatenates multiple samples into one fixed-length sequence, resetting position IDs and attention masks to keep samples independent.
ăDPOăRLHF vs DPO
Unlike RLHF, DPO simplifies alignment via supervised learning:
- Streamlined: DPO directly optimizes the policy model, no reward model training needed, and uses only provided preference dataâno sampling required.
- Stability: As a supervised method, DPO avoids RLâs instability.
- Low overhead: Only one model is loaded (policy model); reference model outputs can be precomputed.
ăDPOăDPO(Direct Preference Optimization)
- DPO, introduced by Stanford et al. in 2023, is a preference-optimization algorithm for LLM/VLM alignment.
- It greatly simplifies PPO-based RLHF by skipping reward model training and directly optimizing the policy modelâhence âDirectâ.
- Two models are used:
- Policy model: initialized from the SFT model copy.
- Reference model: also copied from SFT (or a stronger model), with attention to KL-distance and data distribution.
ăDPOăOverview of DPO training
- You may load two models (policy and reference) or just one (policy). This overview illustrates loading both. Blue blocks denote the âgood responseâ and its intermediate results; pink blocks, the âbad responseâ and results.
ăDPOăImpact of the β parameter on DPO
- In DPO, β plays a role similar to its use in RLHF.
ăDPOăEffect of implicit reward differences on the magnitude of parameter updates
- DPOâs gradient update increases the probability of good responses and reduces that of bad ones. The gradient includes a dynamic coefficient reflecting the implicit reward differenceâi.e., how much the implicit âreward modelâ deviates in judging preferences.
ăOptimization without trainingăComparison of CoT and traditional Q&A
- CoT (Chain of Thought), introduced by Jason Wei et al. at Google in 2022, is a major innovation that explicitly breaks down reasoning steps to improve performance on complex tasks.
ăOptimization without trainingăCoTăSelf-consistency CoTăToTăGoT [87]
- After CoTâs success, many variants emerged: ToT, GoT, Self-consistency CoT, Zero-shot-CoT, Auto-CoT, MoT, XoT, etc.
ăOptimization without trainingăExhaustive Search
- Token generation can be viewed as a V=10âľ-ary tree. Exhaustive Search finds the global optimum but is computationally prohibitive.

ăOptimization without trainingăGreedy Search
- Greedy Search selects the current highest-probability token at each step, ignoring global optimality and diversity, leading to possible local optima and lack of diversity.
ăOptimization without trainingăBeam Search
- Beam Search keeps multiple candidate sequences (âbeamsâ) each step, pruning others. The final output is the highest-scoring beam. Larger beam count â closer to global optimum but higher cost.
ăOptimization without trainingăMultinomial Sampling
- Multinomial Sampling randomly draws tokens according to the modelâs predicted distribution rather than uniform sampling. Includes Top-K, Top-P, etc.
ăOptimization without trainingăTop-K Sampling
- Top-K Sampling limits the candidate pool to the top K tokens by probability, then samples from them.

ăOptimization without trainingăTop-P Sampling
- Top-P Sampling (Nucleus Sampling) dynamically selects the smallest set of tokens whose cumulative probability ⼠P, then samples from that set.

ăOptimization without trainingăRAG(Retrieval-Augmented Generation)
- RAG integrates external knowledge via retrieval to enhance generative models. Proposed by Meta AI in 2020, it boosts performance on knowledge-intensive tasks. The workflow has offline index building and online serving components.
ăOptimization without trainingăFunction Calling
- Function Calling (Tool Use) lets an LLM agent invoke external APIs, database queries, local functions, or plugins. The agent parses user requests, handles parameters, calls tools, then feeds results back into the model.
ăRL basicsăHistory of RL
- RL dates back to the 1950s, with key contributions by Richard S. Sutton and others.
- Since 2012, deep learning has spurred high-profile RL applications.
- In November 2022, ChatGPT (trained with RLHF) launched.
- In December 2024, OpenAI released the more deeply RL-optimized o1 model, driving industry interest in RL for large models.
ăRL basicsăThree major machine learning paradigms
- The three paradigms are : Unsupervised Learning, Supervised Learning, and Reinforcement Learning.
ăRL basicsăBasic architecture of RL
- RL involves two core roles: the Agent and the Environment.
- The Agent perceives state, selects actions via its policy.
- The Environment updates state and returns rewards.

ăRL basicsăFundamental Concepts of RL
- Example: an AGI travel company uses self-driving cars. The Agent learns optimal routes via iterative trips and accumulated passenger ratings (rewards), optimizing the travel experience.
ăRL basicsăMarkov Chain vs MDP
- Markov Chains are extended to Markov Decision Processes by adding Actions (A) and Rewards (R).
ăRL basicsăUsing dynamic Îľ values under the Îľ-greedy strategy
- Epsilon-Greedy uses Îľ to balance exploration/exploitation. Start with high Îľ to explore, then decay Îľ to exploit learned knowledge, improving training efficiency and final policy.
ăRL basicsăComparison of RL training paradigms
- On-policy vs. Off-policy vs. Online RL vs. Offline RL.
- On-policy: behavior and target policies are the same (e.g., SARSA).
- Off-policy: behavior and target differ (e.g., Q-learning, DQN).
- Online RL: continuous environment interaction and data collection.
- Offline RL: training solely on a fixed dataset without environment interaction.

ăRL basicsăClassification of RL
- RL algorithms are categorized by various dimensions and often combined with SL, CL, IL, GANs, etc., spawning many hybrid methods.

ăRL basicsăReturn(cumulative reward)
- Return (G) is the sum of future rewards from a time step, measuring the total expected reward under a policy.
ăRL basicsăBackwards iteration and computation of return G
- Using discount factor Îł, compute returns backwards from the end of an episode.
ăRL basicsăReward, Return, and Value
- Reward: immediate, local gain.
- Return: total future gain.
- Value: expected return over all trajectories, weighted by their probability (assuming Îł=1 in this illustration).

ăRL basicsăQĎ and VĎ
- Action-Value Function QĎ(s,a): expected return after taking action a in state s under policy Ď.
- State-Value Function VĎ(s): expected return from state s under Ď.

ăRL basicsăEstimate the value through Monte Carlo(MC)
- Monte Carlo methods estimate value functions by sampling complete episodes, suitable when the environment model is unknown.
ăRL basicsăTD target and TD error
- TD Target uses the next reward and next state value; TD Error measures the difference between current value estimate and TD target, guiding updates.
ăRL basicsăTD(0), n-step TD, and MC
- When n=1, multi-step TD reduces to TD(0); as nââ, it approaches Monte Carlo.
ăRL basicsăCharacteristics of MC and TD methods
- MC: low bias, high variance; TD: high bias, low variance.
ăRL basicsăMC, TD, DP, and exhaustive search [32]
- Four approaches to value estimation and policy optimization: Monte Carlo, Temporal Difference, Dynamic Programming, and Brute-Force Search.
ăRL basicsăDQN model with two input-output structures
DQN has two IO variants:
- Input: state and a candidate action â output: its Q-value (batch compute for multiple actions possible).
- Input: state â output: Q-values for all actions; choose the max.
ăRL basicsăHow to use DQN
- After training, deploy DQN online to aid decisions. At inference, pick the action with the highest Q-value as determined by the model.
ăRL basicsăDQN's overestimation problem
- Two core issues of DQN:
- (1) Overestimation of Q-values due to max operations accumulating error unevenly across actions.
- (2) Bootstrapping (âdog chases its tailâ) where the target depends on the same network weights, causing training instability and convergence difficulties.
ăRL basicsăValue-Based vs Policy-Based
- Value-Based: estimate value functions (V or Q) â derive policy.
- Policy-Based: directly parameterize and optimize policy Ď.
- Actor-Critic: combines both approaches, learning policy (Actor) and value (Critic) together.

ăRL basicsăPolicy gradient
- Policy Gradients underpin many RL algorithms (PPO, GRPO, DPG, Actor-Critic variants). Sutton et al. formalized the Policy Gradient Theorem. Unlike value-based methods, policy-based methods optimize Ď directly via gradient ascent.
ăRL basicsăMulti-agent reinforcement learning(MARL)
- MARL studies multiple agents learning to cooperate or compete in a shared environment (e.g., AlphaGo, AlphaStar, OpenAI Five).
ăRL basicsăMulti-agent DDPG [41]
- MADDPG (Multi-Agent DDPG), introduced by OpenAI in 2017, uses N actor networks (ĎââŚĎâ) and N critic networks (QââŚQâ). Each critic takes all agentsâ actions and observations to output Q-value for its agent.
ăRL basicsăImitation learning(IL)
- Imitation Learning learns policies by observing and mimicking experts, without explicit reward functions. Main approaches:
- Behavioral Cloning (BC): supervised learning on state-action pairs.
- Inverse Reinforcement Learning (IRL): infer reward function from expert behavior, then learn policy.
- Generative Adversarial Imitation Learning (GAIL): adversarially train policy against expert.
ăRL basicsăBehavior cloning(BC)
- Behavioral Cloning treats imitation as regression/classification: input state â predict expert action. Minimizes difference between predicted and expert actions.
ăRL basicsăInverse RL(IRL) and RL
- IRL infers the underlying reward function from expert behavior, then learns the optimal policy.
- Andrew Y. Ng and Stuart Russell formalized IRL in their 2000 paper Algorithms for Inverse Reinforcement Learning.

ăRL basicsăModel-Based and Model-Free
- Model-Based: uses environment model for planning
- Model-Free: learns value or policy directly from interaction.
ăRL basicsăFeudal RL
- Hierarchical RL decomposes tasks into sub-tasks or sub-policies. Feudal RL and MAXQ are classic examples. Geoffrey E. Hinton proposed Feudal RL; Hinton won the 2024 Nobel Prize in Physics.
ăRL basicsăDistributional RL
- Distributional RL models the distribution of returns rather than just the expectation, capturing richer uncertainty information for policy optimization.
ăPolicy Optimization & VariantsăActor-Critic
- The Actor-Critic architecture combines a policy model (Actor) and a value model (Critic). Algorithms like PPO, DPG, DDPG, TD3 are based on this.
ăPolicy Optimization & VariantsăComparison of baseline and advantage
- A2C introduces a baseline (state value V(s)) and constructs the Advantage Function A(s,a) = Q(s,a) â V(s), reducing variance.
ăPolicy Optimization & VariantsăGAE(Generalized Advantage Estimation)
- GAE (Generalized Advantage Estimation) was proposed by John Schulman et al. and is a key component of algorithms like PPO.
- It leverages the TD(Îť) idea to balance bias and variance by tuning the Îť parameter.
- Computation is typically performed recursively over time steps.
- The core implementation pseudocode is shown below:
import numpy as np
def compute_gae(rewards, values, gamma=0.99, lambda_=0.95):
"""
Parameters:
rewards (list or np.ndarray): The rewards collected at each time step, shape (T,)
values (list or np.ndarray): The value estimates for each state V, shape (T+1,)
gamma (float): Discount factor Îł
lambda_ (float): Decay parameter Îť for GAE
Returns:
np.ndarray: Advantage estimates A, shape (T,). For example, for T=5, A = [A0, A1, A2, A3, A4]
"""
T = len(rewards) # Eg. End: t=T-1, T = 5
advantages = np.zeros(T) # Eg. A=[A0, A1, A2, A3, A4]
gae = 0
# From t=T-1, to t=0
for t in reversed(range(T)):
# δ_t = r_t + γ * V(s_{t+1}) - V(s_t)
delta = rewards[t] + gamma * values[t+1] - values[t]
# A_t = δ_t + γ * Ν * A_{t+1}
gae = delta + gamma * lambda_ * gae
advantages[t] = gae
return advantages
ăPolicy Optimization & VariantsăTRPO and its trust region
- TRPO (Trust Region Policy Optimization) is the predecessor to PPO.
- It improves policy gradient methods by introducing a trust region constraint and importance sampling.
- The core idea is to maximize the objective J(θ) while limiting the divergence between the new and old policies.

ăPolicy Optimization & VariantsăImportance sampling
- Importance Sampling corrects distribution mismatch between old and new policies, enabling reuse of old data to optimize the new policy.
- It samples from an auxiliary distribution and applies importance weights to improve estimation efficiency.
- It requires that if p(x)>0 then pâ˛(x)>0, ensuring no probability mass is lost during reweighting.
ăPolicy Optimization & VariantsăPPO-Clip
- PPO-Clip refers to PPO with a clipped surrogate objective.
- The goal is to maximize expected future return by optimizing J(θ) with a clipping mechanism.
- The clipping limits the probability ratio r(θ) to [1âÎľ,1+Îľ], preventing overly large policy updates.
ăPolicy Optimization & VariantsăPolicy model update process in PPO training
- PPO training alternates between two phases:
- Sample Collection: generate trajectories with the old policy and store them in a replay buffer.
- Multiple PPO Training Rounds: shuffle and split the buffer into mini-batches, then run multiple PPO epochs per mini-batch (computing clipped policy loss, value loss, and entropy bonus) to update parameters.
- These phases repeat iteratively until training completes.
ăPolicy Optimization & VariantsăPPO Pseudocode
# Abbreviations: R = rewards, V = values, Adv = advantages, J = objective, P = probability
for iteration in range(num_iterations): # Perform num_iterations training iterations
# [1/2] Collect samples (prompt, response_old, logP_old, Adv, V_target)
prompt_batch, response_old_batch = [], []
logP_old_batch, Adv_batch, V_target_batch = [], [], []
for _ in range(num_examples):
logP_old, response_old = actor_model(prompt)
V_old = critic_model(prompt, response_old)
R = reward_model(prompt, response_old)[-1]
logP_ref = ref_model(prompt, response_old)
# KL penalty. Note: R here is only the reward for the final token
KL = logP_old - logP_ref
R_with_KL = R - scale_factor * KL
# Compute advantage Adv via GAE
Adv = GAE_Advantage(R_with_KL, V_old, gamma, Îť)
V_target = Adv + V_old
prompt_batch += prompt
response_old_batch += response_old
logP_old_batch += logP_old
Adv_batch += Adv
V_target_batch += V_target
# [2/2] PPO training loop: multiple parameter updates
for _ in range(ppo_epochs):
mini_batches = shuffle_split(
(prompt_batch, response_old_batch, logP_old_batch, Adv_batch, V_target_batch),
mini_batch_size
)
for prompt, response_old, logP_old, Adv, V_target in mini_batches:
logits, logP_new = actor_model(prompt, response_old)
V_new = critic_model(prompt, response_old)
# Probability ratio: ratio(θ) = Ď_θ(a|s) / Ď_{θ_old}(a|s)
ratios = exp(logP_new - logP_old)
# Compute clipped policy loss
L_clip = -mean(
min(ratios * Adv,
clip(ratios, 1 - Îľ, 1 + Îľ) * Adv)
)
S_entropy = mean(compute_entropy(logits)) # Compute policy entropy
Loss_V = mean((V_new - V_target) ** 2) # Compute value function loss
# Total loss
Loss = L_clip + C1 * Loss_V - C2 * S_entropy
backward_update(Loss, L_clip, Loss_V) # Backpropagate and update parameters
ăPolicy Optimization & VariantsăGRPO & PPO [72]
- GRPO (Group Relative Policy Optimization) is a policy-based RL algorithm by DeepSeek.
- It removes the separate value network and uses group-relative advantage estimation as a baseline, reducing resource usage while maintaining stability.

ăPolicy Optimization & VariantsăDeterministic policy vs. Stochastic policy
- Reinforcement learning policies can be deterministic or stochastic.
- Deterministic policies output a single action per state.
- Stochastic policies output a probability distribution over actions.
ăPolicy Optimization & VariantsăDPG
- DPG (Deterministic Policy Gradient) was formalized by David Silver et al. in 2014 at DeepMind.
- It uses an actor-critic architecture for continuous action spaces.
ăPolicy Optimization & VariantsăDDPGďźDeep Deterministic Policy Gradientďź
- DDPG extends DPG by incorporating deep Q-network concepts, requiring four networks (2 actors, 2 critics).
- TD3 (Twin Delayed DDPG) further improves DDPG with six networks (2 actors, 4 critics) and delayed updates for stability.

ăRLHF and RLAIFăRL modeling of language models
- To apply RL to LMs, define:
ăRLHF and RLAIFăTwo-stage training process of RLHF
- RLHF involves two stages:
- Phase 1 (SFT & Reward Model): generate candidate responses and collect human preference labels to train the reward model.
- Phase 2 (RL with PPO): optimize the policy model using PPO, guided by the reward model and constrained by the reference modelâs KL penalty.
- Phase 1 uses preference pairs (Prompt + Responses); Phase 2 uses only prompts, relying on RL interactions.
ăRLHF and RLAIFăStructure of the reward model
- The Reward Model shares the decoder layers of the SFT model and replaces its LM head with a reward head that outputs a single scalar score.
ăRLHF and RLAIFăInput and output of the reward model
- Input: Prompt + Response sequence.
- Output: reward scores r0, r1, ⌠for each token of the response.
ăRLHF and RLAIFăReward deviation and loss
- The Reward Modelâs objective is a contrastive negative log-likelihood loss, fitting human preference scores similarly to DPOâs reward modeling.
ăRLHF and RLAIFăTraining of the reward model
- (1) Compute reward scores for preferred (yw) and less preferred (yl) responses.
- (2) Calculate loss based on score difference.
- (3) Backpropagate to update parameters.
- (4) Iterate over preference samples until alignment with human judgments.
ăRLHF and RLAIFăRelationship between the four models in PPO
- Four models collaborate:
- Policy Model (Actor): generates responses, balances reward guidance and KL constraint.
- Reference Model: provides KL baseline to prevent divergence.
- Reward Model: scores responses to simulate human feedback.
- Value Model (Critic): estimates long-term returns Vt.
ăRLHF and RLAIFăThe structure and init of the four models in PPO
- All four share the same N-layer decoder but differ in heads and LoRA modules:
- Policy Model: SFT model + LM head, parameters updated during PPO.
- Reference Model: SFT model + LM head, frozen.
- Reward Model: SFT decoder + Reward head (random init), frozen.
- Value Model: SFT decoder + Value head, updated during PPO.
ăRLHF and RLAIFăA value model with a dual-head structure
- TRLâs AutoModelForCausalLMWithValueHead uses two heads:
- LM Head outputs logits ([seq_len, vocab_size]).
- Value Head outputs values ([seq_len, 1]).
- Shared decoder pass reduces compute and memory.

ăRLHF and RLAIFăFour models can share one base in RLHF
- By sharing a frozen decoder and using separate LoRA modules for each head, RLHF minimizes GPU memory. LoRA modules and heads can be dynamically loaded or swapped.
ăRLHF and RLAIFăInputs and Outputs of Each Model in PPO
PPO training modelsâ inputs and outputs
| Model | Input | Output |
|---|---|---|
| Policy Model | Prompt (or context) | Response and its log-probs (LogProbs) |
| Reference Model | Prompt (or context) + Response | Response log-probs (LogProbs) |
| Reward Model | Prompt (or context) + Response | Per-token reward scores |
| Value Model | Prompt (or context) + Response | Per-token value estimates |
ăRLHF and RLAIFăThe Process of Calculating KL in PPO
- In frameworks like TRL, KL is penalized during advantage computation: policy and reference logits â log-probs â gather token log-probs â compute per-token KL differences.
ăRLHF and RLAIFăRLHF Training Based on PPO
- The PPO RLHF workflow is split into two halves: experience collection and multi-epoch PPO training, iterating via a replay buffer with old/new model versions.
ăRLHF and RLAIFăRejection Sampling Fine-tuning
- Rejection Sampling Fine-Tuning filters out low-quality generated samples, retaining only high-quality ones for further fine-tuning. Used by Anthropic, Meta, etc., and can iterate multiple rounds.
ăRLHF and RLAIFăRLAIF vs RLHF
- RLAIF (Reinforcement Learning from AI Feedback) mirrors RLHF but uses AI for preference labeling instead of humans.
ăRLHF and RLAIFăCAI(Constitutional AI)
- Constitutional AI (CAI) by Anthropic (2022) uses a set of constitutional principles:
- Self-critique & revision by a random principle.
- Train SL-CAI on revised + human-labeled data.
- Generate candidates and AI-label via random principle.
- Train reward model on combined labels.
- RL-CAI: optimize target model with PPO and the reward model.
ăRLHF and RLAIFăOpenAI RBR(Rule-Based Reward)
- RBR (Rule Based Rewards) by OpenAI (2024) trains a rule-based reward model on human data, integrating AI feedback into RLHF. It is central to GPT-4âs safety system.
ăReasoning capacity optimizationăKnowledge Distillation Based on CoT
- Knowledge Distillation (KD) by Hinton et al. (2015) compresses models by transferring teacher outputs (soft labels) to a student. In reasoning tasks, distill CoT chains and answers from a strong model to a smaller one.
ăReasoning capacity optimizationăDistillation Based on DeepSeek
- To reduce model size and deployment overhead, distill capabilities from a strong model (e.g., DeepSeek) into a smaller model.
ăReasoning capacity optimizationăORM(Outcome Reward Model)Â &Â PRM (Process Reward Model)
- Outcome Reward Model (ORM) scores only the final result.
- Process Reward Model (PRM) scores each intermediate reasoning step for finer feedback.
ăReasoning capacity optimizationăFour Key Steps of Each MCTS
- MCTS iterates four key steps: Selection, Expansion, Simulation, Backpropagation.
ăReasoning capacity optimizationăMCTS
- MCTS repeats these steps to expand and refine the search tree, gradually favoring better paths as simulation count increases.
ăReasoning capacity optimizationăSearch Tree Example in a Linguistic Context
- In language tasks, each tree node represents a sentence or paragraph-level reasoning step.
ăReasoning capacity optimizationăBoN(Best-of-N) Sampling
- Best-of-N sampling: generate N candidates and select the highest-scoring one.
- Run multiple reasoning paths, then choose the most frequent final answer.
ăReasoning capacity optimizationăMajority Vote
- Run multiple reasoning paths, then choose the most frequent final answer.
ăReasoning capacity optimizationăPerformance Growth of AlphaGo Zero [179]
- AlphaGo Zero achieved an Elo of 5185 with MCTS; without MCTS pre-search, its Elo was only 3055, highlighting searchâs impact.
ăLLM basics extendedăPerformance Optimization Map for Large Models
- The optimization map shows five levelsâservice, model, framework, compiler, hardware/communicationâfor training and inference.
ăLLM basics extendedăALiBi positional encoding
- RoPE is mainstream; ALiBi is being phased out.
ăLLM basics extendedăTraditional knowledge distillation
- Knowledge Distillation: transfer teacher soft labels to a student for compression and faster inference. Introduced by Hinton in âDistilling the Knowledge in a Neural Network.â
ăLLM basics extendedăNumerical representation, quantization
| Type | Total Bits | Sign Bits | Exponent Bits | Mantissa/Integer Bits |
|---|---|---|---|---|
| FP64 | 64 | 1 | 11 | 52 (mantissa) |
| FP32 | 32 | 1 | 8 | 23 (mantissa) |
| TF32 | 19 | 1 | 8 | 10 (mantissa) |
| BF16 | 16 | 1 | 8 | 7 (mantissa) |
| FP16 | 16 | 1 | 5 | 10 (mantissa) |
| INT64 | 64 | 1 | â | 63 (integer) |
| INT32 | 32 | 1 | â | 31 (integer) |
| INT8 | 8 | 1 | â | 7 (integer) |
| UINT8 | 8 | 0 | â | 8 (integer) |
| INT4 | 4 | 1 | â | 3 (integer) |
ăLLM basics extendedăForward and backward
- Forward Propagation: inputs pass through layers (1â4), caching activations.
- Backward Propagation: compute gradients from loss back through layers using cached activations.
ăLLM basics extendedăGradient Accumulation
- Standard: each batch runs forward & backward immediately, updating parameters frequently with low memory usage.
- Accumulation: accumulate gradients over several batches before updating, simulating a larger batch size.

ăLLM basics extendedăGradient Checkpoint(gradient recomputation)
- Standard: store all activations for backward, high memory cost.
- Checkpointing: save only key activations, recompute others during backward to save memory at the expense of compute.
ăLLM basics extendedăFull recomputation
- Full Recomputation: store no activations, recompute the forward pass during backward propagation. Minimizes memory usage, increases compute time.
ăLLM basics extendedăLLM Benchmark
- LLM benchmarks (e.g., MMLU, C-eval) follow similar evaluation protocols as illustrated.
ăLLM basics extendedăMHAăGQAăMQAăMLA
- MHA: Multi-Head Attention
- GQA: General Question Answering
- MQA: Multimodal QA
- MLA: Multimodal Language and Action (various naming conventions)

ăLLM basics extendedăRNN(Recurrent Neural Network)
- RNN processes sequential data via recurrent connections to maintain hidden state.
- Pros: simple, handles short-term dependencies.
- Cons: suffers from vanishing/exploding gradients, poor long-term dependencies.

ăLLM basics extendedăPre-norm vs Post-norm
- Pre-norm: apply LayerNorm before sublayer then add residual, improving gradient flow in deep networks.
- Post-norm: traditional Transformer norm after residual, may cause gradient decay in deep models.
ăLLM basics extendedăBatchNorm & LayerNorm
- BatchNorm: normalize each channel across the batch, then scale and shift.
- LayerNorm: normalize all features of each sample, then scale and shift.

ăLLM basics extendedăRMSNorm
- RMSNorm normalizes by the root-mean-square of input features (no mean subtraction), then scales and shifts. More efficient and comparable or superior to LayerNorm.
ăLLM basics extendedăPrune
- Model pruning removes redundant weights/neuron channels to compress networks, then fine-tunes to retain accuracy. Steps: importance scoring â prune â fine-tune â validate.
ăLLM basics extendedăRole of the temperature coefficient
- Temperature T scales logits for sampling diversity:
- T<1 sharpens distribution, more deterministic.
- T>1 flattens distribution, more random.
- Adjusting T balances accuracy vs creativity.

ăLLM basics extendedăSwiGLU
- SwiGLU is a GLU variant with two linear projections: one direct, one through a SiLU gate, multiply elementwise. Smooth gating improves gradient stability and convergence.
ăLLM basics extendedăAUCăPRăF1ăPrecisionăRecall
- AUC: Area Under the ROC Curve.
- PR Curve: precisionârecall across thresholds.
- Precision = TP/(TP+FP).
- Recall = TP/(TP+FN).
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall).
- Accuracy = (TP+TN)/(TP+TN+FP+FN).
- BLEU: n-gram overlap for MT.
- ROUGE: n-gram/LCS overlap for summarization.

ăLLM basics extendedăRoPE positional encoding
- Rotary Position Embedding applies sinusoidal rotations to query/key vectors, encoding positions as phase differences. No extra parameters, efficient for long sequences.
ăLLM basics extendedăThe effect of RoPE on each sequence position and each dim
- For details on the principles of RoPE, the base and θ values, and how they work, see: RoPE-theta-base.xlsx
- RoPE groups embedding dimensions in pairs and applies a 2D rotation per pair with an angle based on position and frequency.
- High-frequency dimensions capture short-range relative positions.
- Low-frequency dimensions capture long-range relative positions.
Contributing
-
Contributions are welcome! Example diagram template: images-template.pptx
-
How to contribute:
(1) Fork: Click the "Fork" button to create a copy of the repo under your GitHub account â
(2) Clone: Clone the forked repo to your local environment â
(3) Create a new local branch â
(4) Make changes and commit â
(5) Push changes to your remote repo â
(6) Submit a PR: On GitHub, go to your forked repo and click "Compare & pull request" to submit a PR. The maintainer will review and merge it into the main repository. -
Suggested color scheme for diagram design:
Light Blue (#71CCF5) ;
Light Yellow (#FFE699) ;
Blue-Purple (#C0BFDE) ;
Pink (#F0ADB7)
Terms of Use
All images in this repository are licensed under LICENSE. You are free to use, modify, and remix the materials under the following terms:
- Sharing â You may copy and redistribute the material in any format.
- Adapting â You may remix, transform, and build upon the material.
You must also comply with the following terms:
- For Web Use â If using the materials in blog posts or online content, please retain the original author information embedded in the images.
- For Papers, Books, and Publications â If using the materials in formal publications, please cite the source using the format below. In such cases, the embedded author info may be removed from the image.
- Non-commercial Use Only â These materials may not be used for any direct commercial purposes.
Citation
If you use any content from this project (including diagrams or concepts from the book), please cite it as follows:
đ For Reference Section
Yu, Changye. Large Model Algorithms: Reinforcement Learning, Fine-Tuning, and Alignment.
Beijing: Publishing House of Electronics Industry, 2025. https://github.com/changyeyu/LLM-RL-Visualized
đ BibTeX Citation Format
@book{yu2025largemodel_en,
title = {Large Model Algorithms: Reinforcement Learning, Fine-Tuning, and Alignment},
author = {Yu, Changye},
publisher = {Publishing House of Electronics Industry},
year = {2025},
address = {Beijing},
isbn = {9787121500725},
url = {https://github.com/changyeyu/LLM-RL-Visualized},
language = {en}
}




























































































