LLM Quest

April 11, 2026 · View on GitHub

LLM Quest Banner

LLM Architectures, techniques, and research papers for practice, experimentation, and learning — from scratch.

 

Latest 3 updates

  • Multimodal Qwen3.5 from scratch
  • Google Reinforced Attention Learning (RAL)
  • LoRA-XS and Meta TinyLoRA

 

Content

Note

More details are available in each subfolder's README.md

 

Architectures

Key Components
GPT-2*• MHA
• LayerNorm
• FFN
• GeLU
• KVCache
GPT to Llama 3.2*• GQA
• RoPE + YaRN
• RMS Norm
• SwiGLU
Llama 3.2 to DeepSeek V3/R1• MLA
• MTP modules
• DeepSeek MoE
Llama 3.2 to Gemma 3 (text-only)• GeGLU
• Local/Global attention
• SWA
• QK norm
• Pre+Post RMSNorm
• Logit softcapping (Gemma 2)
Qwen3 (dense and MoE)— (nothing new)
Qwen3-Next (hybrid attention)• Gated DeltaNet
• Gated Attention
• Zero-Centered RMSNorm
• Weighted shared expert
• Partial RoPE
Xiaomi MiMo-V2-Flash (hybrid attention)• SWA+GA
• Attention Sink
• DeepSeek MoE without shared experts
• SWA MTP modules
Qwen3.5 (multimodal)• Early fusion
• Qwen3-Next + Qwen3-ViT
• 3D patch embeddings with temporal downsampling
• ViT-LLM adapter with spatial downsampling
• Multimodal RoPE: MRoPE-I variant (interleaved)

 

Mixture of Experts (MoE)

VariantNotes
Sparse MoEClassic auxiliary loss + z router loss
DeepSeek MoEFine-grained + shared expert isolation + auxiliary loss-free load balancing
Nvidia LatentMoElatent/low rank compression + experts rebalancing

 

Alignment & Reasoning

MethodNotes
DPO*With cDPO for noisy labels, step by step
RLHF with GRPOincluding variants: Dr. GRPO, DAPO, GSPO, SAPO
RLVR with GRPOReasoning with GPT2-m
Reinforcement Pretraining (RPT)with Qwen3-0.6B

 

Multimodal

Key Components
Part 1: GPT to ViT• Image patches + learnable CLS token + positional encoding
• Full Attention
• Classification head
Part 2: VLM• ViT-LLM adapter (multimodal alignment/fine-tuning)
• Early fusion (image + text embeddings)

 

Fine-tuning (SFT)

TypeMethod
ClassifierHidden state retrieval for the last real token
Instruction*

 

Other Model-Agnostic Techniques and Papers (common folder)

Notes
Hyper-connections• mHC-lite: optimization of mHC to get faster and exact doubly stochasticity (with Birkhoff-von Neumann theorem)
• DeepSeek Manifold-Constrained Hyper-connections (mHC) with Sinkhorn-Knopp and Qwen3 convergence test
• Classic (unconstrained) Hyper-connections (alternative to residual connections) and test on "hyper-connected" Qwen3 for convergence
QK-ClipQuery-Key clipping (naive & per head + GQA compatible) from Moonshot.ai's MuonClip and experimental "Magnitude" variant.
Speculative DecodingGoogle's original version
Reinforced Attention Learning (RAL)Google's auxiliary policy-gradient loss, regularizing attention weights via Advantage-Weighted Jensen-Shannon divergence
Dynamic TanhNormalization-free alternative to RMSNorm/LayerNorm (Zhu et al., 2025)
RoPE + YaRNNTK-aware + by-part/wavelength scaling
LoRAsclassic LoRA, LoRA-XS and Meta TinyLoRA
Number Token LossRegression-like loss on number tokens — Wasserstein Distance variant (Zausinger et al., 2025)
generate.pycommon sampling functions: temperature, top-k, top-p, min-p
experimental

 

* : Already covered by @rasbt; my code is similar.

1 : The original GPT-2 implementation only included causal masks, not attention masks. (In OpenAI's code, causal masks are called "attention mask", which can be confusing)

 

Acknowledgements

Most notably, the Open-source community, without whom none of this would have been possible.
Whether academia, top AI labs or independent researchers, I am grateful for their shared knowledge and research.

Research papers used in the repo are always cited and linked in the relevant readmes or code comments.

Special mention to @rasbt for the LLMs-from-scratch book/repo, which made me kickstart this repo and became a base for verbose re-implementations of various research papers.

 

Install

uv venv && uv pip install -e .

or

uv sync