Normalization cheat sheet
May 31, 2026 · View on GitHub
A quick reference for the normalization layers in neural/neuralnetwork.pas. The
goal is to make it easy to pick the right layer: which axes it reduces over,
whether it has learnable parameters, what it computes, and when to reach for
it.
A few conventions used below:
- Sample means one item in the batch. neural-api processes one sample at a
time, so "per-sample" statistics are computed from a single volume of shape
SizeX * SizeY * Depth— there are no batch statistics anywhere in this list (this is the main difference from textbook BatchNorm). - Depth is the channel axis;
(X, Y)are the spatial axes. gamma= learnable scale,beta= learnable bias,alpha= learnable scalar.
Summary table
| Layer (constructor) | Reduces over | Learnable params | Formula (per element) | Pick it when |
|---|---|---|---|---|
TNNetLayerNorm.Create() | whole sample (X, Y and Depth) | gamma + beta, per-element (one per XYDepth) | y = gamma * (x - mean) / sqrt(var + eps) + beta | Transformers / RNNs; the general-purpose batch-independent norm. |
TNNetRMSNorm.Create() | whole sample (X, Y and Depth) | gamma only, per-element | y = gamma * x / sqrt(mean(x^2) + eps) | Cheaper LayerNorm for transformers; skip mean-centering. |
TNNetRMSNormGated.Create() | whole sample (X, Y and Depth) for the RMS; gate is per-channel | gate logit g[c] only, per-channel (init 0) | y = (x / sqrt(mean(x^2) + eps)) * sigmoid(g[c]) | RMSNorm with a learnable per-channel on/off gate instead of gamma. |
TNNetSwitchableNorm.Create() | whole sample (X, Y and Depth) | two scalar mixing logits only (init 0) | y = a_ln*L + a_rms*R, (a_ln,a_rms)=softmax(w_ln,w_rms), L=(x-mean)/sqrt(var+eps), R=x/sqrt(mean(x^2)+eps) | Let the net learn the LayerNorm vs RMSNorm mix; no per-element affine. |
TNNetZScore.Create() | whole sample (X, Y and Depth) | none | y = (x - mean) / sqrt(var + eps) | LayerNorm's normalization without the affine; fixed standardization. |
TNNetGroupNorm.Create(Groups) | each contiguous channel group, over (X, Y + channels-in-group) | gamma + beta, per-element | y = gamma * (x - mean_g) / sqrt(var_g + eps) + beta | Vision / small batches where BatchNorm is unstable. |
TNNetInstanceNorm.Create() | each single channel, over (X, Y) | gamma + beta, per-element | GroupNorm with Groups = Depth | Style transfer / generative vision; per-channel contrast. |
TNNetMovingStdNormalization.Create() | whole sample (running mean/std) | 2 trainable scalars (shift, scale) | y = (x - shift) / std | Drop-in BatchNorm-ish standardization for the whole tensor. |
TNNetChannelStdNormalization.Create() | per channel | per-channel scale (+ inherited zero-center) | per-channel zero-center then * scale[c] | Per-channel std normalization (the repo's "channel std norm"). |
TNNetPixelNorm.Create() | per (X, Y) pixel, over Depth | none | y = x / sqrt(mean_depth(x^2) + eps) | StyleGAN-style per-pixel feature-vector norm; GAN generators. |
TNNetL2Normalize.Create([axis][,eps]) | axis 0 (default): per (X, Y) over Depth; axis 1: whole sample; axis 2: per-channel over (X, Y) | none | y = x / sqrt(sum(x^2) + eps) | Unit-length feature vectors (embeddings, cosine similarity). |
TNNetUnitNorm.Create() | whole sample (flattened) | none | y = x / sqrt(sum_all(x^2) + eps) | Full-volume unit-L2 (Keras "UnitNorm"); alias of the line above. |
TNNetMinMaxNorm.Create([eps][,perChannel]) | full-volume (default): whole sample (X, Y and Depth); perChannel: per channel over (X, Y) | none | y = (x - min) / (max - min + eps) | Rescale to ~[0, 1], globally or independently per channel. |
TNNetGRN.Create() | per channel L2 over (X, Y), then across channels | gamma + beta, per-channel (both init 0) | y = gamma[c] * (x * Nx[c]) + beta[c] + x | ConvNeXt-V2 blocks; channel-competition contrast norm. |
TNNetDyT.Create() | nothing (no statistics) | gamma + beta per channel, single alpha | y = gamma[c] * tanh(alpha * x) + beta[c] | Normalization-FREE drop-in LayerNorm replacement. |
TNNetLogitNormalize.Create([tau][,eps]) | per (X, Y) over Depth | none | y = x / (tau * sqrt(sum_depth(x^2)) + eps) | Pre-softmax logit regularizer for calibration / OOD. |
Per-layer notes
TNNetLayerNorm
Constructor: Create(). eps = 1e-5 (fixed). Normalizes each sample over all
its elements (SizeX*SizeY*Depth) to zero mean / unit variance, then applies a
learnable per-element gamma (init 1) and beta (init 0). gamma and beta have
one weight per element of the volume, not per channel. The general-purpose,
batch-independent norm — first choice for transformers and recurrent models.
TNNetRMSNorm
Constructor: Create(). eps = 1e-5 (fixed). Like LayerNorm but divides by the
root-mean-square of the elements without subtracting the mean, then applies a
learnable per-element gamma (init 1). No beta. Cheaper than LayerNorm and a
common choice in modern transformer stacks.
TNNetRMSNormGated
Constructor: Create(). eps = 1e-5 (fixed), and the RMS is taken over the
whole sample with no mean subtraction, exactly like TNNetRMSNorm. The
difference is the affine: instead of a per-element gamma, it applies a learnable
per-channel sigmoid gate y[x,y,c] = n[x,y,c] * sigmoid(g[c]), where
n = x / sqrt(mean(x^2) + eps) and there is one gate logit g[c] per Depth
channel (the TNNetGatedResidual storage pattern: FNeurons[0].Weights holds
the Depth logits). The logits are initialised to 0, so at init every gate is
sigmoid(0) = 0.5 and the layer simply halves the normalized activation; channels
then learn to open (→1) or close (→0) independently. The backward pass routes the
input error through both the per-channel scale sigmoid(g[c]) and the shared
invRMS Jacobian (the RMS term couples all elements of the sample), reusing
RMSNorm's exact normalization Jacobian. Pick it when you want RMSNorm but with a
cheap, learnable per-channel gating instead of a full per-element scale.
TNNetSwitchableNorm
Constructor: Create(). eps = 1e-5 (fixed). It computes both a
LayerNorm-style normalization L = (x - mean) / sqrt(var + eps) and an
RMSNorm-style normalization R = x / sqrt(mean(x^2) + eps) over the whole
sample (matching TNNetLayerNorm / TNNetRMSNorm), then returns a learnable
softmax-weighted convex combination
y[x,y,d] = a_ln * L[x,y,d] + a_rms * R[x,y,d], where
(a_ln, a_rms) = softmax(w_ln, w_rms) so a_ln + a_rms = 1 and both are
non-negative. The only learnable parameters are the two scalar mixing logits
w_ln, w_rms (stored in FNeurons[0].Weights, exactly two values) — there is
no per-element gamma/beta. Both logits are initialised to 0, so at
init the softmax is 0.5/0.5 and an untrained layer is an exact 50/50 blend of
LayerNorm and RMSNorm. The backward pass feeds a_ln*OutputError through the
LayerNorm input Jacobian and a_rms*OutputError through the RMSNorm input
Jacobian and sums them; the logit gradients are
dL/da_ln = sum_all(OutputError·L), dL/da_rms = sum_all(OutputError·R), pushed
through the 2-logit softmax Jacobian d a_i/d w_j = a_i*(delta_ij - a_j). Pick
it when you do not want to commit to LayerNorm or RMSNorm up front and would
rather let training interpolate between them.
TNNetZScore
Constructor: Create(). The unparameterised core of LayerNorm:
y = (x - mean) / sqrt(var + eps) over the whole sample, with no learnable
gamma/beta. Use it when you want fixed standardization without an affine.
TNNetGroupNorm
Constructor: Create(Groups: integer). eps = 1e-5 (fixed). Splits Depth into
Groups contiguous channel groups and normalizes each group independently over
(X, Y + channels-in-group), then applies a learnable per-element gamma (init 1)
and beta (init 0) over the full volume. Repo behavior note: if Depth is
not divisible by Groups, it silently falls back to a single group
(equivalent to a per-sample LayerNorm-without-mean-split) rather than erroring.
Good for vision tasks and small-batch regimes where BatchNorm is noisy.
TNNetInstanceNorm
Constructor: Create(). A TNNetGroupNorm with Groups = Depth — one channel
per group — resolved from the input depth at SetPrevLayer time. Each channel is
normalized independently over its spatial (X, Y) extent. Same learnable
per-element gamma/beta as GroupNorm. Typical in style transfer and generative
vision models.
TNNetMovingStdNormalization
Constructor: Create(). The repo's batch-norm-style "moving" standardization:
subtracts a learned shift and divides by a learned standard-deviation scalar over
the whole tensor. It carries 2 trainable scalars (shift and std). Repo
behavior note: the std update is deliberately damped (≈100x slower than the
zero-centering term, see GetMaxAbsoluteDelta returning * 0.01) to avoid
overflow spikes; the std divisor is only applied when std > 0 and std <> 1.
Use it as a possible drop-in replacement for batch normalization on a whole
tensor (also reachable via TNNet.AddMovingNorm).
TNNetChannelStdNormalization
Constructor: Create(). This is the repo's per-channel std normalization
(descends from TNNetChannelZeroCenter): it zero-centers per channel and then
multiplies each channel by a trainable per-channel scale (one weight per Depth
channel, init 1). Repo behavior note: the std-deviation learning is again
heavily damped (-FLearningRate*0.01 / channelSize) and, on the backward pass,
the channel-error scaling is clamped with SetMin(1) because "the direction of
the error is more important than its magnitude." Reachable per channel via
TNNet.AddChannelMovingNorm.
TNNetPixelNorm
Constructor: Create(). eps = 1e-8 (fixed). StyleGAN-style per-pixel
feature-vector normalization: for each (X, Y) position the Depth-dimensional
vector is divided by its root-mean-square over the depth axis, giving each
pixel a unit-RMS feature vector. Parameter-free. Common in GAN generators.
TNNetL2Normalize
Constructors: Create(), Create(eps), Create(axis), Create(axis, eps).
eps default 1e-8. Selectable reduction scope stored in FStruct[0]:
axis = 0(default, and what bareCreate()/Create(eps)give) — per spatial position(X, Y), normalize the depth vector to unit L2 norm. This preserves the historical behavior.axis = 1— reduce sum-of-squares over the entire flattened sample so the whole volume has unit L2 norm.axis = 2— per depth channel, reduce sum-of-squares over the spatial positions(X, Y)so each channel's feature map is independently scaled to unit L2 norm (n_d = sqrt(sum_{x,y} x[x,y,d]^2 + eps)). The per-(X,Y)-over-depth transpose ofaxis = 0.
No learnable parameters; the exact (I - y y^T)/n Jacobian is applied on the
backward pass over the chosen scope. Use for unit-length embeddings / cosine
similarity. Note: this is a true L2 unit-norm, not a mean/variance
standardization.
TNNetUnitNorm
Constructor: Create(). A thin subclass of TNNetL2Normalize whose default
constructor selects the full-volume scope (axis = 1, eps 1e-8) — i.e. it
is the Keras "UnitNorm" name for full-volume L2 normalization. Serializes under
its own class name. Behaviorally identical to TNNetL2Normalize.Create(1).
TNNetMinMaxNorm
Constructors: Create(), Create(eps). eps default 1e-7. Rescales the whole
sample by its own global min/max — reduced over all positions (X, Y and
Depth) — to approximately [0, 1]: y = (x - m) / ((M - m) + eps). No learnable
parameters. Repo behavior note: the backward pass is a true subgradient that
routes a bulk 1/denom term to every element plus exact coupling corrections at
the argmin and argmax indices (held fixed); for a constant volume eps keeps it
finite and a single index absorbs both corrections.
A third constructor Create(eps, perChannel) selects the reduction scope. With
perChannel = false (the default for Create()/Create(eps)) the min/max are
reduced over the whole sample as above. With perChannel = true the min/max are
reduced over the spatial positions (X, Y) only, independently per depth
channel, so each channel d gets its own (min_d, max_d) and is rescaled to
approximately [0, 1] on its own: y[x,y,d] = (x - m_d) / ((M_d - m_d) + eps).
The backward pass is the same subgradient structure scoped to each channel. The
mode is stored in FStruct[0] (0 = full-volume, 1 = per-channel) and round-trips
through Save/Load. This mirrors the per-(X,Y)-over-depth vs full-volume split
offered by TNNetL2Normalize.
TNNetGRN
Constructor: Create(). eps = 1e-6 (fixed). Global Response Normalization
(ConvNeXt-V2, Woo et al. 2023). For each channel it computes an L2 response over
(X, Y), divides by the mean response across channels, then applies a
learnable per-channel gamma[c] and beta[c] plus a residual add of the
input:
Y[x,y,c] = gamma[c] * (X[x,y,c] * Nx[c]) + beta[c] + X[x,y,c] where
Nx[c] = Gx[c] / mean_c(Gx). gamma and beta init to 0, so the layer is the
identity at start. Use inside ConvNeXt-V2-style blocks for channel competition.
TNNetDyT
Constructor: Create(). Dynamic Tanh (Liu et al. 2025) — a normalization-free
drop-in LayerNorm replacement that uses no batch or per-sample statistics:
Y[x,y,c] = gamma[c] * tanh(alpha * X[x,y,c]) + beta[c]. Learnable params: a
single layer-wide scalar alpha (init 1.0) plus per-channel gamma (init 1)
and beta (init 0). Pick it to drop LayerNorm's per-token statistics while
keeping a squashing + affine response.
TNNetLogitNormalize
Constructors: Create(), Create(tau), Create(tau, eps). tau default 1.0,
eps default 1e-8. A pre-softmax regularizer (Wei et al. 2022) that divides the
depth-axis logit vector at each (X, Y) by a tau-scaled L2 norm:
y_i = x_i / (tau * sqrt(sum_j x_j^2 + safety) + eps). No learnable parameters.
Repo behavior note: with tau = 1 and eps = 0 it reduces exactly to
TNNetL2Normalize (axis 0). Improves calibration and OOD detection by bounding
logit magnitudes during training.