Benchmarking Optimizers for Tabular MLPs
April 17, 2026 · View on GitHub
:scroll: arXiv :books: Other tabular DL projects
This is the official repository of the paper "Benchmarking Optimizers for Tabular MLPs".
The README has instructions on:
Benchmark Results
We benchmark 15 optimizers on 17 tabular datasets for training MLPs and MLP-based Models. We find that Muon consistently outperforms AdamW. We aslo highlight AdamW with Exponential Moving Average (EMA) of model weights as a simple way to improve AdamW on plain MLPs, though its effect is less consistent for advanced MLP-based models.
Read the paper for a deeper overview of the benchmark and results
![]() |
| Figure 1 from the paper. The left column shows average ranks of optimizers (lower is better). The middle shows average improvement over the AdamW baseline. The right most part shows wins ties and losses against AdamW. |
Using Muon and EMA in Practice
This section provides a practical recipe for using Muon and EMA in tabular deep learning pipelines.
The examples below use:
tabmfor MLP-style tabular backbonesrtdl_num_embeddingsfor numerical feature embeddingsMuonfor the optimizer
If you use uv, you can install everything with:
uv add tabm rtdl-num-embeddings git+https://github.com/KellerJordan/Muon
For EMA, you do not need any extra package: PyTorch already provides the necessary utilities via
torch.optim.swa_utils.
Muon: Applying to Tabular MLP-based models
Muon should only be applied to hidden weights of the backbone; We do not apply Muon to the output layer matrix and to feature embedding matrices. In the following snippet we provide a simple parameter groupping that can be used with pytorch tabular DL models, which are using MLP backbones (e.g. from tabm) and embeddings for numerical features from rtdl_num_embeddings.
import torch.nn as nn
import tabm
from muon import SingleDeviceMuonWithAuxAdam
from rtdl_num_embeddings import (
LinearEmbeddings,
LinearReLUEmbeddings,
PiecewiseLinearEmbeddings,
PeriodicEmbeddings,
)
def make_optimizer(
model: nn.Module,
*,
lr: float = 3e-4, # the lr values are arbitrary, you should most probably tune the lr
weight_decay: float = 0.1,
betas: tuple[float, float] = (0.9, 0.999),
eps: float = 1e-8,
muon_lr: float = 0.002, # same for the muon lr
):
# Muon: matrix-like hidden weights of the backbone only.
muon_params = set([
module.weight
for module in model.backbone.modules()
if isinstance(module, nn.Linear) and 1 not in module.weight.shape[-2:]
])
# Zero weight decay: biases, normalization layers, numerical embeddings.
zero_wd_types = (
nn.BatchNorm1d,
nn.LayerNorm,
nn.InstanceNorm1d,
LinearEmbeddings,
LinearReLUEmbeddings,
PiecewiseLinearEmbeddings,
PeriodicEmbeddings,
)
zero_wd_params = set([
p
for module in model.modules()
for name, p in module.named_parameters()
if name.endswith('bias') or isinstance(module, zero_wd_types)
])
return SingleDeviceMuonWithAuxAdam([
dict(
params=list(muon_params),
lr=muon_lr,
momentum=muon_momentum,
weight_decay=(weight_decay if muon_weight_decay is None else muon_weight_decay),
use_muon=True,
),
dict(
params=[p for p in zero_wd_params if p not in muon_params],
lr=lr,
betas=betas,
eps=eps,
weight_decay=0.0,
use_muon=False,
),
dict(
params=[
p
for p in model.parameters()
if p not in muon_params and p not in zero_wd_params
],
lr=lr,
betas=betas,
eps=eps,
weight_decay=weight_decay,
use_muon=False,
),
])
Muon: Example Usage
Here is a sketch of how Muon could be used with the above make_optimizer function with a basic MLP with LinearReLU embeddings:
import torch.nn as nn
import tabm
from rtdl_num_embeddings import LinearReLUEmbeddings
n_num_features = 24
d_embedding = 16
d_out = 1
class Model(nn.Module):
def __init__(self):
super().__init__()
self.num_embeddings = LinearReLUEmbeddings(n_num_features, d_embedding)
# this is a regular MLP backbone, just from the tabm package
self.backbone = tabm.MLPBackbone(
d_in=n_num_features * d_embedding,
n_blocks=3,
d_block=512,
dropout=0.1,
)
self.head = nn.Linear(512, d_out)
def forward(self, x_num):
x = self.num_embeddings(x_num).flatten(1)
x = self.backbone(x)
return self.head(x)
model = Model()
optimizer = make_optimizer(model, lr=3e-4, weight_decay=1e-4, muon_lr=0.02)
EMA: basic example
Exponential Moving Average (EMA) is easy to add and it can often provide improvements over plain AdamW for MLP models.
To use PyTorch's built-in EMA utilities, during the model setup you can do:
import torch
import tabm
from torch.optim.swa_utils import AveragedModel, get_ema_multi_avg_fn
model = Model() # can be the model from the above snippet
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
ema_model = AveragedModel(
model,
multi_avg_fn=get_ema_multi_avg_fn(0.99),
)
Then update the EMA model after each optimizer step:
for x, y in train_loader:
optimizer.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
optimizer.step()
ema_model.update_parameters(model)
Reproducing Paper Results
In this section we provide an overview of all necessary code, configs, results and notebooks needed to:
- Reproduce experiments (continue reading this document for instructions).
- Reproduce figures and tables from the paper (see this section).
Repository layout
bin/— scripts for tuning, evaluation, and model traininglib/— shared utilities and notebook/result helperslib/optim/— local optimizer wrappers/modifications used in the papervendor/— vendored optimizer implementations kept close to upstreamexp/— published experiment trees (config.toml,report.json)notebooks/— cleaned artifact notebookassets/— generated figure/table assets used by the paper and this READMEtools/prepare_tabred.py— helper for preparing the TabReD datasets
Experiment layout
Each experiment directory stores the following artifacts:
exp/<model>/<optimizer>/<...>/<dataset>/
tuning/
config.toml
report.json
evaluation/
config.toml
report.json
The files are:
config.toml— experiment configuration;report.json— machine-readable results;
Environment setup
We use uv for managing dependencies. Install uv and on first setup run:
uv sync --managed-python
Data setup
License terms remain those of the original datasets.
Step 1: default datasets
Download and unpack the non-TabReD datasets:
mkdir -p local data
wget https://huggingface.co/datasets/rototoHF/tabm-data/resolve/main/data.tar -O local/tabm-data.tar.gz
tar -xvf local/tabm-data.tar.gz -C data
Step 2: TabReD datasets
Download the TabReD benchmark to local/tabred, then prepare it:
uv run tools/prepare_tabred.py local/tabred data --force
Examples: How to Reproduce Experiments
To illustrate the experiment workflow: here is a concrete example of how to reproduce the MLP Muon results on a churn dataset. This assumes you have uv and have downloaded the datasets into the project root into the data directory.
First we copy and reset the experiment. The experiment is either tuninig or an evaluation directory with a config.toml file. Thus to reproduce tuning with subsequent evaluation we need to copy a tuning/config.toml subtree we want to reproduce. You may do it like this:
# can be run from shell with uv run python -c '... (the exact syntax depends on your shell)
import lib.experiment
dst = "local/exp/reproduce-mlp-muon/churn/tuning"
lib.experiment.copy("exp/mlp/muon/churn/tuning", dst)
lib.experiment.reset(dst)
Then you can use the bin/go.py utility to launch the experiments.
To launch both tuning and evaluation on a newly created experiment, run:
# important to explicitly set the GPU you are using
export CUDA_VISIBLE_DEVICES=0
uv run bin/go.py local/exp/reproduce-mlp-muon/churn/tuning --resume --n-seeds 10
You can also launch tuning and evaluation separately with respective tuning/config.toml and evaluation/config.toml files with:
uv run bin/tune.py path/to/tuning # or...
# uv run bin/evalute.py path/to/evaluation
Same goes for the individual runs via uv run bin/ffn.py path/to/exp, but for this the relevant config needs to be created. Like this, for example:
# can be run in shell with uv run python -c
import lib.experiment
src = "exp/mlp/adamw-ema/tabred/delivery-eta/evaluation"
evaluation_config = lib.experiment.load_config(src))
config = {'seed': 0, **evaluation_config['base_config']}
lib.experiment.create('local/exp/adamw-ema-delivery-eta-single-seed', config=config, parents=True)
Reproducing Paper Artifacts
To generate and play arround with artifacts generation, use the following notebook which runs top to bottom and generates all tables and figures used in the paper from report.json files in this repo:
notebooks/make_artifacts.ipynb
The generated artifacts are stored in assets/
