kbeta

November 7, 2025 · View on GitHub

kbeta – Kourkoutas‑β Optimiser 🌞🦎🚀📈

Reference implementation of Kourkoutas‑β: A Sunspike‑Driven Adam Optimizer with Desert Flair Published as arXiv:2508.12996.

This repository provides the optimiser implementation together with example workloads for reproducibility.

Table of Contents

Key ideas
What's new
Project layout
Quick start
Installation
Using Kourkoutas‑β in your own model
Example workloads
Dataset and creation (verifiable)
Model and training protocol
Optimizers and settings
Companion repositories
Tests & linting
Further reading
Citation
License
Community-Reported Settings (SDXL / T2I)
Contributing & roadmap

Key ideas

Layer‑wise dynamic β₂ driven by a bounded sun‑spike signal (gradient norm vs. EMA).
Two β₂ parameters: β₂_min for agility under spikes, β₂_max for stability when calm.
Optional features: soft‑max AMSGrad, trust‑region clipping, adaptive tiny term.
Drop‑in compatibility: recovers exact Adam when dynamic β₂ and extras are disabled.
100 % Apple MLX compatible – no PyTorch required.

See the paper for derivations, experiments, and theoretical analysis.

What’s new

Community wrapper: @Koratahiu published an adapter that adds Kourkoutas‑β to any optimizer stack
→ https://github.com/Koratahiu/Advanced_Optimizers

They used with it SDXL/T2I and became their default vs fixed β₂; see Community-Reported Settings (SDXL / T2I) for recommended Kourkoutas‑β settings in this case.

Now recommended in the community-maintained Awesome PINNs list (Nov 2025).

Upcoming (nearing completion)

Availability diagnostics (A* and η) — coming to a companion repo:
We expose Actionable Availability $A^\*_{\text{act}}$ and Durable Loss‑Conversion Efficiency $\eta$ , which let you read training as “work extracted from the landscape”.

Think of it as a heat‑engine view: Kourkoutas‑β harvests available work and reduces dissipation. A teaser figure is in the MLX PINN/Transformer repos; a full diagnostic pack will follow.
NVIDIA GPU support is targeted via a PyTorch port (CUDA) or MLX-CUDA (both WIP).

Kourkoutas‑β is an Adam‑style optimiser whose second‑moment decay β₂ is no longer a hard‑wired constant. Instead, every update computes a sun‑spike score—a single, cheap scalar that compares the current gradient magnitude to its exponentially‑weighted history. We then map that score to β₂ on the fly:

Sun‑spike	Lizard metaphor	Adaptive behaviour
High	The desert sun is scorching — the lizard is “fully warmed up” and sprints.	Lower β₂ toward β₂,min → second‑moment memory shortens, allowing rapid, large parameter moves.
Low	It’s cool; the lizard feels sluggish and takes cautious steps.	Raise β₂ toward β₂,max → longer memory, filtering noise and producing steadier updates.

Because the sun‑spike diagnostic exists only in Kourkoutas‑β, the method can be viewed as Adam with a temperature‑controlled β₂ schedule: warm gradients trigger exploration; cooler gradients favour exploitation and stability.

Project layout

kbeta
├── src/kbeta/                   # pip package
│   ├── __init__.py              # exports KourkoutasBeta / KourkoutasSoftmaxFlex
│   └── optim/
│       └── kbeta_softmax.py     # implementation
│
├── examples/
│   └── transformer_char_lm/     # Testbed D: character‑level LM on small‑enwik8
│
├── tests/                       # pytest suite (smoke + ablation tests)
├── assets/                      # logo and figure
├── pyproject.toml
├── MANIFEST.in
├── README.md
└── LICENSE

Quick start

# 1. clone your fork
git clone git@github.com:<YOUR-USERNAME>/kbeta.git
cd kbeta

# 2. create a fresh virtualenv
python -m venv .venv && source .venv/bin/activate

# 3. editable install + dev extras
pip install -e ".[dev]"

# 4. run the smoke + ablation tests
pytest -q

Installation

Option 1: PyPI wheels (end-users)

If you only want the optimiser in your own MLX projects, install from PyPI:

pip install kbeta

This gives you just the kbeta package with the latest MLX.

For development tools and examples:

pip install "kbeta[dev]"

For exact reproducibility of the paper results (MLX 0.26.3, Adam-95/999 baselines):

pip install "kbeta[repro]"

Option 2: Cloning the repo (researchers / contributors)

If you want to run the example workloads or contribute to development, clone the repo:

git clone https://github.com/sck-at-ucy/kbeta.git
cd kbeta
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

This installs the package in editable mode and makes all example scripts available.

Minimal example

import time
import mlx.core as mx
import mlx.nn as nn
from kbeta import KourkoutasBeta

num_features, num_examples, num_iters, lr = 100, 1000, 1000, 0.01

# True parameters and data
w_star = mx.random.normal((num_features,))
X = mx.random.normal((num_examples, num_features))
y = X @ w_star + 1e-2 * mx.random.normal((num_examples,))

# Simple model with one parameter
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = mx.zeros((num_features,))

    def __call__(self, x):
        return x @ self.w

model = Model()

def loss_fn(m):
    return 0.5 * mx.mean(mx.square(m(X) - y))

opt = KourkoutasBeta(learning_rate=lr)
opt.init(model.parameters())

grad_fn = nn.value_and_grad(model,loss_fn)

tic = time.time()
for _ in range(num_iters):
    loss, grads = grad_fn(model)
    opt.update(model, grads)
    mx.eval(model.parameters())
toc = time.time()

error_norm = float(mx.linalg.norm(model.w - w_star))
print(f"Loss={loss.item():.5f}, L2|w-w*|={error_norm:.5f} "
      f"Throughput={num_iters/(toc-tic):.1f} it/s")

Example workloads

Important: 👉 👉 The 2‑D Transformer (Heat2D, Testbed A) and 3‑D PINN (Heat3D, Testbed B) of the paper are released as separate repositories:

This repo includes the Transformer – Testbed D (Char-level LM on small-enwik8)

Folder	Paper section	What it shows	How to run
`examples/transformer_char_lm`	§ 6.4 (Testbed D)	Character‑level LM on small‑enwik8	`python examples/transformer_char_lm/testbed_d.py --text ./data/small_enwik8.txt --opt kbeta`

Running Transformer – Testbed D (Char-level LM on small-enwik8)

All commands assume running from the repo root (adjust accordingly) 👉 Make sure you have generated ./data/small-enwik8.txt and the ./logs_enwi directory as described below.

Run the Transformer training with the same options used in the paper (adapted to the repo paths):

  python -u examples/transformer_char_lm/testbed_d.py --text ./data/small-enwik8.txt     --steps 50001 --batch 4 --d_model 512 --n_layer 6 --n_head 8     --ctx 512 --lmin 16 --lmax 512 --warmup 250 --opt kbeta --adam_beta2 0.95     --layer_bucket per-array --barrier_every 100 --eval_every 500     --lr 1e-3     --seed 0 --fixed_eval_seed 1234 --deterministic --compile     --wd 0.0 --lr_schedule "1:1e-3,30000:5e-4,40000:1e-4,60000:1e-5"     2>&1 | tee "logs_enwik/kbeta_seed0.log"

This reproduces a run that mirros the testbed reported in the paper with full logging under logs_enwik/.

Dataset and creation (verifiable)

We use the first 30 MB of enwik8 (the classic Hutter Prize corpus). The slice is created deterministically:

curl -L -o enwik8.zip https://data.deepai.org/enwik8.zip
unzip enwik8.zip
head -c 30000000 enwik8 > small-enwik8.txt
mkdir -p data && mv small-enwik8.txt data/
mkdir ./logs_enwik

Checksums on our machine:

sha256sum enwik8
# 2b49720e...c024a8

sha256sum data/small-enwik8.txt
# e0152eee...298b7

Re-creating small-enwik8.txt reproduced the same SHA‑256 (bit‑for‑bit identity).

Model and training protocol

As in the provided script, we train:

Architecture: 6‑block Transformer (d_model=512, n_head=8, FFN width = 4d) GELU, LayerNorm, causal self‑attention; no dropout or weight decay.
Data schedule: variable sequence length with deterministic bucketing (L \in [16,512]), rounded to multiples of 32; batch = 4; context window = 512.
Steps: 50,001
Learning rate schedule:
- 1e‑3 for steps 1 ≤ s < 30k
- 5e‑4 for 30k ≤ s < 40k
- 1e‑4 for 40k ≤ s ≤ 50k
Evaluation: fixed held‑out batch (length = 256, B = 128) reporting cross‑entropy and BPC.
Runs: 10 matched seeds (0–9).

Optimizers and settings

Kourkoutas‑β (ours): β₁=0.9; dynamic β₂∈[0.88,0.999]; α=0.93 (EMA for sunspike); ε=1e‑8; warm‑up=250 steps; bias_correction="beta2max"; per‑array stable buckets; no AMSGrad/clip/adaptive‑tiny; diagnostics off.
Adam‑95: MLX Adam (β₁=0.9, β₂=0.95, ε=1e‑8), bias correction on.
Adam‑999: MLX Adam (β₁=0.9, β₂=0.999, ε=1e‑8), bias correction on.

Companion repositories

This repository hosts the core optimizer implementation and the char-level Transformer example (Testbed D).

Other workloads from the paper are available in dedicated repositories:

kbeta-transformer2d – 2-D Transformer surrogate for Heat2D (Testbed A).
kbeta-pinn3d – 3-D Physics-Informed Neural Network for Heat3D (Testbed B).

These companion repos share the same optimizer API and training protocol, so you can directly apply KourkoutasBeta with no code changes.

Tests & linting

pytest                 # unit & ablation tests
ruff check .           # style / imports / naming
pre-commit run --all   # run all hooks (if installed)

Continuous Integration (CI) runs these checks automatically.

Resource	Why it Matters for Kourkoutas-β & kbeta
MLX Beyond Language (repo) https://github.com/sck-at-ucy/MLX_BeyondLanguage	Companion project that demonstrates how to scale MLX Transformer workloads beyond language (e.g., physics, vision). Provides coding conventions, dataset helpers, and plotting utilities reused in `kbeta` examples.
MLX framework (Apple) https://github.com/ml-explore/mlx	The underlying tensor/NN library that powers `kbeta`. Understanding MLX’s compile/runtime model explains why adaptive optimisers like Kourkoutas-β can hit full Metal GPU speed without CUDA.
Article: Kourkoutas-β: An Adam-style Optimizer with Dynamic Memory for Bursty Gradients https://arxiv.org/abs/2508.12996	The permanent arXiv link to the paper introducing Kourkoutas-β, with derivation, convergence analysis, and ablations.
kbeta-transformer2d (Heat2D benchmark) https://github.com/sck-at-ucy/kbeta-transformer2d	2-D Transformer surrogate workload (Testbed A). Demonstrates how Kourkoutas-β performs on PDE-constrained sequence modeling. ✅ Public release now available.
kbeta-pinn3d (PINN benchmark) https://github.com/sck-at-ucy/kbeta-pinn3d	3-D Physics-Informed Neural Network workload (Testbed B) that logs β₂ “spike” diagnostics during training. Lets you compare Kourkoutas-β on PDE-constrained PINNs vs. data-driven Transformers. ✅ Public release now available.

Citation

If you use this code or method in your research, please link back to this repo and cite:

Paper (arXiv preprint)

@article{Kassinos2025Kourkoutas,
  title   = {Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair},
  author  = {Stavros Kassinos},
  journal = {arXiv preprint arXiv:2508.12996},
  year    = {2025},
  url     = {http://arxiv.org/abs/2508.12996}
}

** Software (Zenodo archive**)

@software{kassinos2025kourkoutasbeta,
  author       = {Stavros Kassinos},
  title        = {Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair},
  year         = 2025,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.16902741},
  url          = {https://doi.org/10.5281/zenodo.16902741}
}

License

This work is distributed under the MIT License—see LICENSE for details.

Community-Reported Settings (SDXL / T2I)

Reported by @Koratahiu (using an RTX 3090, batch size = 4):

β₁: 0.9
β₂ range: beta2_min = 0.90, beta2_max = 0.999
EMA for sunspike: α = 0.95
Fixed β₂ baselines for comparison: β₂ ∈ { 0.99, 0.999 }
Learning rate: Primarily Prodigy (Adam with adaptive LR, no manual tuning)

With standard Adam, a higher manual LR improved results slightly;
K-β, using the same LR, significantly outperformed and was more stable.

💡 Note: In fused-backpass pipelines, EMAs are sometimes lagged by one step
(since full gradients aren’t all present at once). K-β’s logic is fully
compatible with that pattern.

More details in @KorataHiu’s repo

Contributing & roadmap

We welcome issues & PRs!

Planned milestones:

v0.1.0 – optimiser + char‑LM demo (public).
v0.2.0 – PDE workloads migrated to their own repos.
v1.0.0 – journal publication, pip wheels for macOS/Apple Silicon & Linux.

Happy sprinting in the (numerical) desert 🌞🦎🚀📈