Regularization

March 25, 2026 · View on GitHub

Overview & Motivation

Regularization adds a penalty term Ω(θ)\Omega(\theta) to the loss function that discourages overly complex models:

Ltotal=L(y^,y)+λΩ(θ)\mathcal{L}_{\text{total}} = \mathcal{L}(\hat{y}, y) + \lambda \, \Omega(\theta)

where λ>0\lambda > 0 controls the strength of the penalty. Without regularization, a neural network with enough parameters can memorize the training data perfectly yet generalize poorly to new inputs — a phenomenon called overfitting. Regularization biases the optimizer toward simpler solutions that tend to generalize better.

Mathematical Theory

L2 Regularization (Ridge / Weight Decay)

$$\Omega_{L2}(\theta) = \frac{1}{2}\sum_{i=1}^{P} \theta_i^2 = \frac{1}{2}|\theta|_$2^{2}$$$

ΩL2θi=θi\frac{\partial \Omega_{L2}}{\partial \theta_i} = \theta_i

Effect on update rule:

θt+1=θtη(θL+λθt)=(1ηλ)θtηθL\theta_{t+1} = \theta_t - \eta(\nabla_\theta \mathcal{L} + \lambda \theta_t) = (1 - \eta\lambda)\theta_t - \eta \nabla_\theta \mathcal{L}

The factor (1ηλ)(1 - \eta\lambda) shrinks weights toward zero each step — hence the name weight decay. Large weights are penalized quadratically, keeping the model smooth.

L1 Regularization (Lasso)

ΩL1(θ)=i=1Pθi=θ1\Omega_{L1}(\theta) = \sum_{i=1}^{P} |\theta_i| = \|\theta\|_1

ΩL1θi=sign(θi)\frac{\partial \Omega_{L1}}{\partial \theta_i} = \operatorname{sign}(\theta_i)

Effect: L1 drives small weights exactly to zero, producing a sparse model. This is useful for feature selection — irrelevant connections are pruned automatically.

Comparison

PropertyL1L2
Penalty shapeDiamond (corners at axes)Sphere
SparsityPromotes exact zerosShrinks toward zero but rarely reaches it
Gradient at θi=0\theta_i = 0Undefined (sub-gradient)Zero
Best forFeature selection, sparse modelsGeneral-purpose weight control

Geometric Interpretation

The regularized loss can be viewed as constrained optimization:

minθL(θ)subject toΩ(θ)c\min_\theta \mathcal{L}(\theta) \quad \text{subject to} \quad \Omega(\theta) \le c

where cc is determined by λ\lambda. L1 constrains θ\theta to a diamond, so the optimal point tends to lie at a corner (sparse). L2 constrains to a sphere, so the optimal point balances all dimensions.

Complexity Analysis

OperationTimeSpace
ΩL2(θ)\Omega_{L2}(\theta)O(P)O(P)O(1)O(1)
ΩL2\nabla \Omega_{L2}O(P)O(P)O(P)O(P)
ΩL1(θ)\Omega_{L1}(\theta)O(P)O(P)O(1)O(1)
ΩL1\nabla \Omega_{L1}O(P)O(P)O(P)O(P)

Regularization adds negligible computational cost — one pass over the parameter vector per training iteration.

Step-by-Step Walkthrough

Scenario: 3 parameters, θ=[0.8,0.3,0.5]\theta = [0.8, -0.3, 0.5], λ=0.1\lambda = 0.1.

L2 Regularization:

StepComputationResult
Penalty12(0.64+0.09+0.25)=0.49\frac{1}{2}(0.64 + 0.09 + 0.25) = 0.49ΩL2=0.49\Omega_{L2} = 0.49
Gradient[0.8,0.3,0.5][0.8, -0.3, 0.5]Added to L\nabla\mathcal{L}
Contribution to loss$0.1 \times 0.49 = 0.049$

L1 Regularization:

StepComputationResult
Penalty$0.8 + 0.3 + 0.5 = 1.6$ΩL1=1.6\Omega_{L1} = 1.6
Gradient[1,1,1][1, -1, 1]Added to L\nabla\mathcal{L}
Contribution to loss$0.1 \times 1.6 = 0.16$

After several L1 updates (η=0.1\eta = 0.1, λ=0.1\lambda = 0.1): the θ2=0.3\theta_2 = -0.3 component, already small, is driven to exactly zero. The model effectively prunes that connection.

Pitfalls & Edge Cases

  • λ\lambda too large. The model underfits — weights are driven so close to zero that the network cannot represent the function. Cross-validate λ\lambda.
  • λ\lambda too small. Negligible effect; overfitting persists.
  • L1 non-differentiability. At θi=0\theta_i = 0, the L1 gradient is undefined. Use sub-gradient sign(0)=0\operatorname{sign}(0) = 0 or proximal operators for exact handling.
  • Regularizing biases. Conventionally, bias parameters are excluded from regularization because they do not contribute to model complexity. This library regularizes all parameters in the flat vector — be aware of this if bias control matters.
  • Fixed-point precision. The regularization term can be much smaller than the main loss when λ\lambda is small. In low-precision fixed-point, it may round to zero. Scale λ\lambda or use a wider accumulator.

Variants & Generalizations

VariantKey Difference
Elastic Net\Omega = \alpha \|\theta\|_1 + (1-\alpha)\|\theta\|_2^{2}$$; combines L1 sparsity with L2 smoothness
DropoutRandomly zeroes activations during training; implicit ensemble regularization
Early stoppingHalts training before overfitting; regularization without modifying the loss
Data augmentationExpands the training set with transformed copies; reduces overfitting by increasing data diversity
Spectral normalizationConstrains the spectral norm of weight matrices; stabilizes GAN training
Weight clippingHard constraint: θic\left\|\theta_i\right\| \le c; used in Wasserstein GANs

Applications

  • Preventing overfitting — The primary use case for any neural network trained on limited data (common in embedded scenarios).
  • Feature selection — L1 regularization identifies and prunes irrelevant input connections.
  • Model compression — Sparse models (via L1) require less storage and computation for deployment on MCUs.
  • Transfer learning — L2 regularization keeps fine-tuned weights close to pre-trained values.

Connections to Other Algorithms

graph TD
    Reg["Regularization"]
    Loss["Loss Functions"]
    Opt["Optimizer"]
    Model["Model"]
    LR["Linear Regression"]

    Reg -->|"λ Ω(θ) added to ℒ"| Loss
    Loss --> Opt
    Opt --> Model
    Reg -.->|"L2 + MSE = Ridge regression"| LR
ComponentRelationship
Loss FunctionsRegularization is a penalty added to the loss: Ltotal=L+λΩ\mathcal{L}_{\text{total}} = \mathcal{L} + \lambda\Omega
OptimizerReceives the combined gradient L+λΩ\nabla\mathcal{L} + \lambda\nabla\Omega
Linear RegressionL2-regularized MSE with a linear model is Ridge regression; L1 is Lasso

References & Further Reading

  • Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016 — Chapter 7 (regularization).
  • Tibshirani, R., "Regression shrinkage and selection via the lasso", JRSS-B, 58(1), 1996.
  • Krogh, A. and Hertz, J.A., "A simple weight decay can improve generalization", NeurIPS, 1991.