POET & POET-X for LLM Pretraining
June 7, 2026 · View on GitHub
Reparameterized LLM Training via Orthogonal Equivalence Transformation
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
Table of Contents
Overview
This repository contains the official implementation of POET and POET-X — a family of reparameterized LLM training algorithms that optimize weight matrices through Orthogonal Equivalence Transformation (OET), achieving superior generalization with provably bounded weight spectra.
POET's three learning phases: conical shell searching → stable learning → final adjusting.
Because POEX-X is an efficient version of POET without introducing any approximation, our repo will only provide the implementation of POET-X. The original implementation of POET is obsolete.
Installation
git clone https://github.com/Sphere-AI-Lab/poet.git
cd poet
pip install -e .
Requirements:
- Python ≥ 3.10
- PyTorch ≥ 2.7
- CUDA ≥ 12.6
- Triton ≥ 3.4.0
Quick Start
Get started with POET in just a few lines of code:
from poet_torch import POETConfig, POETModel, get_poet_optimizer
# 1. Create config
config = POETConfig(
block_size=256, # POET Block size
merge_interval=200, # Steps between merge-then-reinitialize
)
# 2. Wrap your model
model = POETModel(your_model, config)
# 3. Create optimizer
optimizer = get_poet_optimizer(model, config)
# 4. Training loop
for step, batch in enumerate(dataloader):
loss = model(**batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.merge_if_needed(step) # Automatic merge
Key Components
| Component | Description |
|---|---|
POETConfig | Configuration for block size, merge interval, and variant selection |
POETModel | Wraps your model to apply Orthogonal Equivalence Transformation |
get_poet_optimizer() | Creates an optimizer tailored for POET's orthogonal parameters |
merge_if_needed() | Periodically absorbs orthogonal matrices into base weights |
📁 More Examples: Explore comprehensive training scripts and additional APIs in the examples/ directory.
POET
Method
POET reparameterizes each weight matrix as:
where is a fixed randomly initialized matrix, and , are learnable orthogonal matrices. Training only updates and , leaving unchanged.
Why orthogonal transformations? They preserve singular values exactly — giving POET direct, provable control over the weight spectrum throughout training.
Dynamics of singular values: POET (right) avoids the large singular value growth seen in standard AdamW training (left).
Spectral Diversity
POET maintains consistently higher SVD entropy (singular value diversity) throughout training compared to AdamW and Muon.
Efficient Approximation: Stochastic Primitive Optimization (SPO)
Large orthogonal matrices are expensive to optimize naively. POET introduces two efficient variants:
- POET-FS (Fully Stochastic SPO): Randomly samples a small submatrix at each step. Highly parameter-efficient; decouples parameter count from matrix size.
- POET-BS (Block-Stochastic SPO): Block-diagonal structure with random permutations; transforms all dimensions simultaneously. More expressive per parameter.
Weight update coverage: POET-BS achieves more even updates across all weight elements compared to POET-FS.
Orthogonal matrices are parameterized via Cayley-Neumann Parameterization (CNP), which approximates the matrix inverse using a truncated Neumann series for numerical stability:
A merge-then-reinitialize trick periodically absorbs into , preventing error accumulation and keeping the Neumann series convergent.
Results
POET outperforms AdamW with significantly fewer trainable parameters across all LLaMA model sizes on C4.
| Method | Params | 60M PPL | 130M PPL | 350M PPL | 1.3B PPL |
|---|---|---|---|---|---|
| AdamW | Full | 26.68 | 20.82 | 16.78 | 14.73 |
| GaLore | Full | 29.81 | 22.35 | 17.99 | 18.33 |
| LoRA (r=64) | ~5% | 39.70 | 32.07 | 25.19 | 20.55 |
| POET-BS (b=128) | ~13% | 26.90 | 21.86 | 18.05 | 16.24 |
| POET-BS (b=256) | ~26% | 25.29 | 19.88 | 16.27 | 14.56 |
Quantitative comparison of validation perplexity
POET-FS (b=1/2) still outperforms AdamW even when AdamW is trained with ~3× more tokens.
POET-X
Overview
POET-X is a scalable, memory-efficient variant of POET that makes orthogonal equivalence training practical at the billion-parameter scale.
The original POET must store the full transformed weight for backpropagation, making it more memory-intensive than AdamW. POET-X resolves this through a suite of engineering innovations.
Key Results
Latency breakdown: POET-X reduces forward+backward latency from 10.59ms (POET) to 1.38ms (POET-Xfast), approaching standard linear layers.
Memory breakdown for Llama-8B training on a single GPU. POET-X_mem achieves PEFT-level memory; POET runs OOM.
Pretraining Results
Llama-3B pretraining on 60B C4 tokens: POET-X achieves better PPL than AdamW and all memory-efficient baselines.
POET-XQ (quantized): Best PPL of 14.78 with minimal memory footprint, outperforming GaLore and APOLLO.
Training dynamics with different block sizes:
Validation PPL curves at block size b=256 (left) and b=1024 (right).
Memory Efficiency
Peak GPU memory across model sizes (3B–13B) and sequence lengths: POET-X_mem outperforms all baselines including LoRA.
Throughput & Distributed Scaling
POET-X closely follows ideal linear scaling on 64× H100s, while AdamW (FSDP) plateaus due to communication overhead.
Method: Key Optimizations
The core insight is an input-centric formulation that avoids materializing the full transformed weight:
This reduces complexity from to a sequence of matrix-vector products.
Four engineering innovations:
- Permutation Acceleration — Custom CUDA kernels for index-mapped permutations (up to 20× speedup).
- Permutation Reduction — Pre-computes permuted weights once per inner loop, eliminating redundant ops.
- Batch-Parallel Strategy — Treats each block of block-diagonal , as an independent batch element; avoids large sparse matrix construction.
- Fused Cayley-Neumann Kernels — Triton kernel loads and into shared memory once for all terms; backward pass also fused.
Fused Cayley-Neumann parameterization: batch-wise implementation via Triton kernel fusion.
POET-X Variants
| Variant | Memory | Speed | Notes |
|---|---|---|---|
POET-X_fast | Medium | Fast | Standard autograd, saves activation |
POET-X_mem | Lowest | Moderate | Gradient checkpointing, recomputes on-the-fly |
POET-XQ | Lowest | High throughput | INT8 quantized base weights, dequantized on-the-fly |
Citation
@InProceedings{qiu2025poet,
title={Reparameterized LLM Training via Orthogonal Equivalence Transformation},
author={Qiu, Zeju and Buchholz, Simon and Xiao, Tim Z. and Dax, Maximilian and Sch{\"o}lkopf, Bernhard and Liu, Weiyang},
booktitle={NeurIPS},
year={2025}
}
@InProceedings{qiu2026poetx,
title={POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation},
author={Qiu, Zeju and Liu, Lixin and Weller, Adrian and Shi, Han and Liu, Weiyang},
booktitle={ICML},
year={2026},
}