LLM Quest

April 11, 2026 · View on GitHub

LLM Quest Banner

LLM Architectures, techniques, and research papers for practice, experimentation, and learning — from scratch.

Content
Acknowledgements

Latest 3 updates

Multimodal Qwen3.5 from scratch
Google Reinforced Attention Learning (RAL)
LoRA-XS and Meta TinyLoRA

Content

Note

More details are available in each subfolder's README.md

Architectures

	Key Components
GPT-2*	• MHA • LayerNorm • FFN • GeLU • KVCache
GPT to Llama 3.2*	• GQA • RoPE + YaRN • RMS Norm • SwiGLU
Llama 3.2 to DeepSeek V3/R1	• MLA • MTP modules • DeepSeek MoE
Llama 3.2 to Gemma 3 (text-only)	• GeGLU • Local/Global attention • SWA • QK norm • Pre+Post RMSNorm • Logit softcapping (Gemma 2)
Qwen3 (dense and MoE)	— (nothing new)
Qwen3-Next (hybrid attention)	• Gated DeltaNet • Gated Attention • Zero-Centered RMSNorm • Weighted shared expert • Partial RoPE
Xiaomi MiMo-V2-Flash (hybrid attention)	• SWA+GA • Attention Sink • DeepSeek MoE without shared experts • SWA MTP modules
Qwen3.5 (multimodal)	• Early fusion • Qwen3-Next + Qwen3-ViT • 3D patch embeddings with temporal downsampling • ViT-LLM adapter with spatial downsampling • Multimodal RoPE: MRoPE-I variant (interleaved)

Mixture of Experts (MoE)

Variant	Notes
Sparse MoE	Classic auxiliary loss + z router loss
DeepSeek MoE	Fine-grained + shared expert isolation + auxiliary loss-free load balancing
Nvidia LatentMoE	latent/low rank compression + experts rebalancing

Alignment & Reasoning

Method	Notes
DPO*	With cDPO for noisy labels, step by step
RLHF with GRPO	including variants: Dr. GRPO, DAPO, GSPO, SAPO
RLVR with GRPO	Reasoning with GPT2-m
Reinforcement Pretraining (RPT)	with Qwen3-0.6B

Multimodal

	Key Components
Part 1: GPT to ViT	• Image patches + learnable CLS token + positional encoding • Full Attention • Classification head
Part 2: VLM	• ViT-LLM adapter (multimodal alignment/fine-tuning) • Early fusion (image + text embeddings)

Fine-tuning (SFT)

Type	Method
Classifier	Hidden state retrieval for the last real token
Instruction*	—

Other Model-Agnostic Techniques and Papers (`common` folder)

	Notes
Hyper-connections	• mHC-lite: optimization of mHC to get faster and exact doubly stochasticity (with Birkhoff-von Neumann theorem) • DeepSeek Manifold-Constrained Hyper-connections (mHC) with Sinkhorn-Knopp and Qwen3 convergence test • Classic (unconstrained) Hyper-connections (alternative to residual connections) and test on "hyper-connected" Qwen3 for convergence
QK-Clip	Query-Key clipping (naive & per head + GQA compatible) from Moonshot.ai's MuonClip and experimental "Magnitude" variant.
Speculative Decoding	Google's original version
Reinforced Attention Learning (RAL)	Google's auxiliary policy-gradient loss, regularizing attention weights via Advantage-Weighted Jensen-Shannon divergence
Dynamic Tanh	Normalization-free alternative to RMSNorm/LayerNorm (Zhu et al., 2025)
RoPE + YaRN	NTK-aware + by-part/wavelength scaling
LoRAs	classic LoRA, LoRA-XS and Meta TinyLoRA
Number Token Loss	Regression-like loss on number tokens — Wasserstein Distance variant (Zausinger et al., 2025)
generate.py	common sampling functions: temperature, top-k, top-p, min-p
experimental	—

* : Already covered by @rasbt; my code is similar.

¹ : The original GPT-2 implementation only included causal masks, not attention masks. (In OpenAI's code, causal masks are called "attention mask", which can be confusing)

Most notably, the Open-source community, without whom none of this would have been possible.
Whether academia, top AI labs or independent researchers, I am grateful for their shared knowledge and research.

Research papers used in the repo are always cited and linked in the relevant readmes or code comments.

Special mention to @rasbt for the LLMs-from-scratch book/repo, which made me kickstart this repo and became a base for verbose re-implementations of various research papers.

Install

uv venv && uv pip install -e .

uv sync