POET & POET-X for LLM Pretraining

June 7, 2026 · View on GitHub

Reparameterized LLM Training via Orthogonal Equivalence Transformation

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Overview
Installation
Quick Start
POET
POET-X
Citation
Related Work

Overview

This repository contains the official implementation of POET and POET-X — a family of reparameterized LLM training algorithms that optimize weight matrices through Orthogonal Equivalence Transformation (OET), achieving superior generalization with provably bounded weight spectra.

POET three learning phases
POET's three learning phases: conical shell searching → stable learning → final adjusting.

Because POEX-X is an efficient version of POET without introducing any approximation, our repo will only provide the implementation of POET-X. The original implementation of POET is obsolete.

Installation

git clone https://github.com/Sphere-AI-Lab/poet.git
cd poet
pip install -e .

Requirements:

Python ≥ 3.10
PyTorch ≥ 2.7
CUDA ≥ 12.6
Triton ≥ 3.4.0

Quick Start

Get started with POET in just a few lines of code:

from poet_torch import POETConfig, POETModel, get_poet_optimizer

# 1. Create config
config = POETConfig(
    block_size=256,       # POET Block size
    merge_interval=200,   # Steps between merge-then-reinitialize
)

# 2. Wrap your model
model = POETModel(your_model, config)

# 3. Create optimizer
optimizer = get_poet_optimizer(model, config)

# 4. Training loop
for step, batch in enumerate(dataloader):
    loss = model(**batch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    model.merge_if_needed(step)  # Automatic merge

Key Components

Component	Description
`POETConfig`	Configuration for block size, merge interval, and variant selection
`POETModel`	Wraps your model to apply Orthogonal Equivalence Transformation
`get_poet_optimizer()`	Creates an optimizer tailored for POET's orthogonal parameters
`merge_if_needed()`	Periodically absorbs orthogonal matrices into base weights

📁 More Examples: Explore comprehensive training scripts and additional APIs in the examples/ directory.

POET

Method

POET reparameterizes each weight matrix as:

$W_{RP} = R \, W_0 \, P$

where $W_0 \in \mathbb{R}^{m \times n}$ is a fixed randomly initialized matrix, and $R \in \mathbb{R}^{m \times m}$ , $P \in \mathbb{R}^{n \times n}$ are learnable orthogonal matrices. Training only updates $R$ and $P$ , leaving $W_0$ unchanged.

Why orthogonal transformations? They preserve singular values exactly — giving POET direct, provable control over the weight spectrum throughout training.

Singular value dynamics
Dynamics of singular values: POET (right) avoids the large singular value growth seen in standard AdamW training (left).

Spectral Diversity

SVD entropy comparison
POET maintains consistently higher SVD entropy (singular value diversity) throughout training compared to AdamW and Muon.

Efficient Approximation: Stochastic Primitive Optimization (SPO)

Large orthogonal matrices $R \in \mathbb{R}^{m \times m}$ are expensive to optimize naively. POET introduces two efficient variants:

POET-FS (Fully Stochastic SPO): Randomly samples a small $b \times b$ submatrix at each step. Highly parameter-efficient; decouples parameter count from matrix size.
POET-BS (Block-Stochastic SPO): Block-diagonal structure with random permutations; transforms all dimensions simultaneously. More expressive per parameter.

Weight update patterns
Weight update coverage: POET-BS achieves more even updates across all weight elements compared to POET-FS.

Orthogonal matrices are parameterized via Cayley-Neumann Parameterization (CNP), which approximates the matrix inverse using a truncated Neumann series for numerical stability:

$R = (I + Q)(I - Q)^{-1} \approx (I + Q)\left(I + \sum_{i=1}^{k} Q^i\right)$

A merge-then-reinitialize trick periodically absorbs $R, P$ into $W_0$ , preventing error accumulation and keeping the Neumann series convergent.

Results

Validation perplexity vs parameters
POET outperforms AdamW with significantly fewer trainable parameters across all LLaMA model sizes on C4.

Method	Params	60M PPL	130M PPL	350M PPL	1.3B PPL
AdamW	Full	26.68	20.82	16.78	14.73
GaLore	Full	29.81	22.35	17.99	18.33
LoRA (r=64)	~5%	39.70	32.07	25.19	20.55
POET-BS (b=128)	~13%	26.90	21.86	18.05	16.24
POET-BS (b=256)	~26%	25.29	19.88	16.27	14.56

Quantitative comparison of validation perplexity

Training speedup
POET-FS (b=1/2) still outperforms AdamW even when AdamW is trained with ~3× more tokens.

POET-X

Overview

POET-X is a scalable, memory-efficient variant of POET that makes orthogonal equivalence training practical at the billion-parameter scale.

The original POET must store the full transformed weight $RW_0P$ for backpropagation, making it more memory-intensive than AdamW. POET-X resolves this through a suite of engineering innovations.

Key Results

Latency breakdown: POET-X reduces forward+backward latency from 10.59ms (POET) to 1.38ms (POET-Xfast), approaching standard linear layers.

Memory breakdown for Llama-8B training on a single GPU. POET-X_mem achieves PEFT-level memory; POET runs OOM.

Pretraining Results

PPL results
Llama-3B pretraining on 60B C4 tokens: POET-X achieves better PPL than AdamW and all memory-efficient baselines.

PPL results quantized
POET-XQ (quantized): Best PPL of 14.78 with minimal memory footprint, outperforming GaLore and APOLLO.

Training dynamics with different block sizes:

Val PPL b=256 Val PPL b=1024
Validation PPL curves at block size b=256 (left) and b=1024 (right).

This reduces complexity from $O(nm^2)$ to a sequence of matrix-vector products.

Four engineering innovations:

Permutation Acceleration — Custom CUDA kernels for index-mapped permutations (up to 20× speedup).
Permutation Reduction — Pre-computes permuted weights once per inner loop, eliminating redundant ops.
Batch-Parallel Strategy — Treats each block of block-diagonal $G_P$ , $G_R$ as an independent batch element; avoids large sparse matrix construction.
Fused Cayley-Neumann Kernels — Triton kernel loads $Q$ and $Q^2$ into shared memory once for all terms; backward pass also fused.

Cayley-Neumann illustration
Fused Cayley-Neumann parameterization: batch-wise implementation via Triton kernel fusion.

POET-X Variants

Variant	Memory	Speed	Notes
`POET-X_fast`	Medium	Fast	Standard autograd, saves activation $b$
`POET-X_mem`	Lowest	Moderate	Gradient checkpointing, recomputes $b$ on-the-fly
`POET-XQ`	Lowest	High throughput	INT8 quantized base weights, dequantized on-the-fly

Citation

@InProceedings{qiu2025poet,
  title={Reparameterized LLM Training via Orthogonal Equivalence Transformation},
  author={Qiu, Zeju and Buchholz, Simon and Xiao, Tim Z. and Dax, Maximilian and Sch{\"o}lkopf, Bernhard and Liu, Weiyang},
  booktitle={NeurIPS},
  year={2025}
}

@InProceedings{qiu2026poetx,
  title={POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation}, 
  author={Qiu, Zeju and Liu, Lixin and Weller, Adrian and Shi, Han and Liu, Weiyang},
  booktitle={ICML},
  year={2026},
}

OFT — Orthogonal Finetuning for diffusion models
GaLore — Gradient low-rank projection
Muon — Gradient orthogonalization optimizer

POET & POET-X for LLM Pretraining

Table of Contents

Overview

Installation

Quick Start

Key Components

POET

Method

Spectral Diversity

Efficient Approximation: Stochastic Primitive Optimization (SPO)

Results

POET-X

Overview

Key Results

Pretraining Results

Memory Efficiency

Throughput & Distributed Scaling

Method: Key Optimizations

POET-X Variants

Citation