Normalization cheat sheet

May 31, 2026 · View on GitHub

A quick reference for the normalization layers in neural/neuralnetwork.pas. The goal is to make it easy to pick the right layer: which axes it reduces over, whether it has learnable parameters, what it computes, and when to reach for it.

A few conventions used below:

  • Sample means one item in the batch. neural-api processes one sample at a time, so "per-sample" statistics are computed from a single volume of shape SizeX * SizeY * Depth — there are no batch statistics anywhere in this list (this is the main difference from textbook BatchNorm).
  • Depth is the channel axis; (X, Y) are the spatial axes.
  • gamma = learnable scale, beta = learnable bias, alpha = learnable scalar.

Summary table

Layer (constructor)Reduces overLearnable paramsFormula (per element)Pick it when
TNNetLayerNorm.Create()whole sample (X, Y and Depth)gamma + beta, per-element (one per XYDepth)y = gamma * (x - mean) / sqrt(var + eps) + betaTransformers / RNNs; the general-purpose batch-independent norm.
TNNetRMSNorm.Create()whole sample (X, Y and Depth)gamma only, per-elementy = gamma * x / sqrt(mean(x^2) + eps)Cheaper LayerNorm for transformers; skip mean-centering.
TNNetRMSNormGated.Create()whole sample (X, Y and Depth) for the RMS; gate is per-channelgate logit g[c] only, per-channel (init 0)y = (x / sqrt(mean(x^2) + eps)) * sigmoid(g[c])RMSNorm with a learnable per-channel on/off gate instead of gamma.
TNNetSwitchableNorm.Create()whole sample (X, Y and Depth)two scalar mixing logits only (init 0)y = a_ln*L + a_rms*R, (a_ln,a_rms)=softmax(w_ln,w_rms), L=(x-mean)/sqrt(var+eps), R=x/sqrt(mean(x^2)+eps)Let the net learn the LayerNorm vs RMSNorm mix; no per-element affine.
TNNetZScore.Create()whole sample (X, Y and Depth)noney = (x - mean) / sqrt(var + eps)LayerNorm's normalization without the affine; fixed standardization.
TNNetGroupNorm.Create(Groups)each contiguous channel group, over (X, Y + channels-in-group)gamma + beta, per-elementy = gamma * (x - mean_g) / sqrt(var_g + eps) + betaVision / small batches where BatchNorm is unstable.
TNNetInstanceNorm.Create()each single channel, over (X, Y)gamma + beta, per-elementGroupNorm with Groups = DepthStyle transfer / generative vision; per-channel contrast.
TNNetMovingStdNormalization.Create()whole sample (running mean/std)2 trainable scalars (shift, scale)y = (x - shift) / stdDrop-in BatchNorm-ish standardization for the whole tensor.
TNNetChannelStdNormalization.Create()per channelper-channel scale (+ inherited zero-center)per-channel zero-center then * scale[c]Per-channel std normalization (the repo's "channel std norm").
TNNetPixelNorm.Create()per (X, Y) pixel, over Depthnoney = x / sqrt(mean_depth(x^2) + eps)StyleGAN-style per-pixel feature-vector norm; GAN generators.
TNNetL2Normalize.Create([axis][,eps])axis 0 (default): per (X, Y) over Depth; axis 1: whole sample; axis 2: per-channel over (X, Y)noney = x / sqrt(sum(x^2) + eps)Unit-length feature vectors (embeddings, cosine similarity).
TNNetUnitNorm.Create()whole sample (flattened)noney = x / sqrt(sum_all(x^2) + eps)Full-volume unit-L2 (Keras "UnitNorm"); alias of the line above.
TNNetMinMaxNorm.Create([eps][,perChannel])full-volume (default): whole sample (X, Y and Depth); perChannel: per channel over (X, Y)noney = (x - min) / (max - min + eps)Rescale to ~[0, 1], globally or independently per channel.
TNNetGRN.Create()per channel L2 over (X, Y), then across channelsgamma + beta, per-channel (both init 0)y = gamma[c] * (x * Nx[c]) + beta[c] + xConvNeXt-V2 blocks; channel-competition contrast norm.
TNNetDyT.Create()nothing (no statistics)gamma + beta per channel, single alphay = gamma[c] * tanh(alpha * x) + beta[c]Normalization-FREE drop-in LayerNorm replacement.
TNNetLogitNormalize.Create([tau][,eps])per (X, Y) over Depthnoney = x / (tau * sqrt(sum_depth(x^2)) + eps)Pre-softmax logit regularizer for calibration / OOD.

Per-layer notes

TNNetLayerNorm

Constructor: Create(). eps = 1e-5 (fixed). Normalizes each sample over all its elements (SizeX*SizeY*Depth) to zero mean / unit variance, then applies a learnable per-element gamma (init 1) and beta (init 0). gamma and beta have one weight per element of the volume, not per channel. The general-purpose, batch-independent norm — first choice for transformers and recurrent models.

TNNetRMSNorm

Constructor: Create(). eps = 1e-5 (fixed). Like LayerNorm but divides by the root-mean-square of the elements without subtracting the mean, then applies a learnable per-element gamma (init 1). No beta. Cheaper than LayerNorm and a common choice in modern transformer stacks.

TNNetRMSNormGated

Constructor: Create(). eps = 1e-5 (fixed), and the RMS is taken over the whole sample with no mean subtraction, exactly like TNNetRMSNorm. The difference is the affine: instead of a per-element gamma, it applies a learnable per-channel sigmoid gate y[x,y,c] = n[x,y,c] * sigmoid(g[c]), where n = x / sqrt(mean(x^2) + eps) and there is one gate logit g[c] per Depth channel (the TNNetGatedResidual storage pattern: FNeurons[0].Weights holds the Depth logits). The logits are initialised to 0, so at init every gate is sigmoid(0) = 0.5 and the layer simply halves the normalized activation; channels then learn to open (→1) or close (→0) independently. The backward pass routes the input error through both the per-channel scale sigmoid(g[c]) and the shared invRMS Jacobian (the RMS term couples all elements of the sample), reusing RMSNorm's exact normalization Jacobian. Pick it when you want RMSNorm but with a cheap, learnable per-channel gating instead of a full per-element scale.

TNNetSwitchableNorm

Constructor: Create(). eps = 1e-5 (fixed). It computes both a LayerNorm-style normalization L = (x - mean) / sqrt(var + eps) and an RMSNorm-style normalization R = x / sqrt(mean(x^2) + eps) over the whole sample (matching TNNetLayerNorm / TNNetRMSNorm), then returns a learnable softmax-weighted convex combination y[x,y,d] = a_ln * L[x,y,d] + a_rms * R[x,y,d], where (a_ln, a_rms) = softmax(w_ln, w_rms) so a_ln + a_rms = 1 and both are non-negative. The only learnable parameters are the two scalar mixing logits w_ln, w_rms (stored in FNeurons[0].Weights, exactly two values) — there is no per-element gamma/beta. Both logits are initialised to 0, so at init the softmax is 0.5/0.5 and an untrained layer is an exact 50/50 blend of LayerNorm and RMSNorm. The backward pass feeds a_ln*OutputError through the LayerNorm input Jacobian and a_rms*OutputError through the RMSNorm input Jacobian and sums them; the logit gradients are dL/da_ln = sum_all(OutputError·L), dL/da_rms = sum_all(OutputError·R), pushed through the 2-logit softmax Jacobian d a_i/d w_j = a_i*(delta_ij - a_j). Pick it when you do not want to commit to LayerNorm or RMSNorm up front and would rather let training interpolate between them.

TNNetZScore

Constructor: Create(). The unparameterised core of LayerNorm: y = (x - mean) / sqrt(var + eps) over the whole sample, with no learnable gamma/beta. Use it when you want fixed standardization without an affine.

TNNetGroupNorm

Constructor: Create(Groups: integer). eps = 1e-5 (fixed). Splits Depth into Groups contiguous channel groups and normalizes each group independently over (X, Y + channels-in-group), then applies a learnable per-element gamma (init 1) and beta (init 0) over the full volume. Repo behavior note: if Depth is not divisible by Groups, it silently falls back to a single group (equivalent to a per-sample LayerNorm-without-mean-split) rather than erroring. Good for vision tasks and small-batch regimes where BatchNorm is noisy.

TNNetInstanceNorm

Constructor: Create(). A TNNetGroupNorm with Groups = Depth — one channel per group — resolved from the input depth at SetPrevLayer time. Each channel is normalized independently over its spatial (X, Y) extent. Same learnable per-element gamma/beta as GroupNorm. Typical in style transfer and generative vision models.

TNNetMovingStdNormalization

Constructor: Create(). The repo's batch-norm-style "moving" standardization: subtracts a learned shift and divides by a learned standard-deviation scalar over the whole tensor. It carries 2 trainable scalars (shift and std). Repo behavior note: the std update is deliberately damped (≈100x slower than the zero-centering term, see GetMaxAbsoluteDelta returning * 0.01) to avoid overflow spikes; the std divisor is only applied when std > 0 and std <> 1. Use it as a possible drop-in replacement for batch normalization on a whole tensor (also reachable via TNNet.AddMovingNorm).

TNNetChannelStdNormalization

Constructor: Create(). This is the repo's per-channel std normalization (descends from TNNetChannelZeroCenter): it zero-centers per channel and then multiplies each channel by a trainable per-channel scale (one weight per Depth channel, init 1). Repo behavior note: the std-deviation learning is again heavily damped (-FLearningRate*0.01 / channelSize) and, on the backward pass, the channel-error scaling is clamped with SetMin(1) because "the direction of the error is more important than its magnitude." Reachable per channel via TNNet.AddChannelMovingNorm.

TNNetPixelNorm

Constructor: Create(). eps = 1e-8 (fixed). StyleGAN-style per-pixel feature-vector normalization: for each (X, Y) position the Depth-dimensional vector is divided by its root-mean-square over the depth axis, giving each pixel a unit-RMS feature vector. Parameter-free. Common in GAN generators.

TNNetL2Normalize

Constructors: Create(), Create(eps), Create(axis), Create(axis, eps). eps default 1e-8. Selectable reduction scope stored in FStruct[0]:

  • axis = 0 (default, and what bare Create() / Create(eps) give) — per spatial position (X, Y), normalize the depth vector to unit L2 norm. This preserves the historical behavior.
  • axis = 1 — reduce sum-of-squares over the entire flattened sample so the whole volume has unit L2 norm.
  • axis = 2 — per depth channel, reduce sum-of-squares over the spatial positions (X, Y) so each channel's feature map is independently scaled to unit L2 norm (n_d = sqrt(sum_{x,y} x[x,y,d]^2 + eps)). The per-(X,Y)-over-depth transpose of axis = 0.

No learnable parameters; the exact (I - y y^T)/n Jacobian is applied on the backward pass over the chosen scope. Use for unit-length embeddings / cosine similarity. Note: this is a true L2 unit-norm, not a mean/variance standardization.

TNNetUnitNorm

Constructor: Create(). A thin subclass of TNNetL2Normalize whose default constructor selects the full-volume scope (axis = 1, eps 1e-8) — i.e. it is the Keras "UnitNorm" name for full-volume L2 normalization. Serializes under its own class name. Behaviorally identical to TNNetL2Normalize.Create(1).

TNNetMinMaxNorm

Constructors: Create(), Create(eps). eps default 1e-7. Rescales the whole sample by its own global min/max — reduced over all positions (X, Y and Depth) — to approximately [0, 1]: y = (x - m) / ((M - m) + eps). No learnable parameters. Repo behavior note: the backward pass is a true subgradient that routes a bulk 1/denom term to every element plus exact coupling corrections at the argmin and argmax indices (held fixed); for a constant volume eps keeps it finite and a single index absorbs both corrections.

A third constructor Create(eps, perChannel) selects the reduction scope. With perChannel = false (the default for Create()/Create(eps)) the min/max are reduced over the whole sample as above. With perChannel = true the min/max are reduced over the spatial positions (X, Y) only, independently per depth channel, so each channel d gets its own (min_d, max_d) and is rescaled to approximately [0, 1] on its own: y[x,y,d] = (x - m_d) / ((M_d - m_d) + eps). The backward pass is the same subgradient structure scoped to each channel. The mode is stored in FStruct[0] (0 = full-volume, 1 = per-channel) and round-trips through Save/Load. This mirrors the per-(X,Y)-over-depth vs full-volume split offered by TNNetL2Normalize.

TNNetGRN

Constructor: Create(). eps = 1e-6 (fixed). Global Response Normalization (ConvNeXt-V2, Woo et al. 2023). For each channel it computes an L2 response over (X, Y), divides by the mean response across channels, then applies a learnable per-channel gamma[c] and beta[c] plus a residual add of the input: Y[x,y,c] = gamma[c] * (X[x,y,c] * Nx[c]) + beta[c] + X[x,y,c] where Nx[c] = Gx[c] / mean_c(Gx). gamma and beta init to 0, so the layer is the identity at start. Use inside ConvNeXt-V2-style blocks for channel competition.

TNNetDyT

Constructor: Create(). Dynamic Tanh (Liu et al. 2025) — a normalization-free drop-in LayerNorm replacement that uses no batch or per-sample statistics: Y[x,y,c] = gamma[c] * tanh(alpha * X[x,y,c]) + beta[c]. Learnable params: a single layer-wide scalar alpha (init 1.0) plus per-channel gamma (init 1) and beta (init 0). Pick it to drop LayerNorm's per-token statistics while keeping a squashing + affine response.

TNNetLogitNormalize

Constructors: Create(), Create(tau), Create(tau, eps). tau default 1.0, eps default 1e-8. A pre-softmax regularizer (Wei et al. 2022) that divides the depth-axis logit vector at each (X, Y) by a tau-scaled L2 norm: y_i = x_i / (tau * sqrt(sum_j x_j^2 + safety) + eps). No learnable parameters. Repo behavior note: with tau = 1 and eps = 0 it reduces exactly to TNNetL2Normalize (axis 0). Improves calibration and OOD detection by bounding logit magnitudes during training.