Regularization
March 25, 2026 · View on GitHub
Overview & Motivation
Regularization adds a penalty term to the loss function that discourages overly complex models:
where controls the strength of the penalty. Without regularization, a neural network with enough parameters can memorize the training data perfectly yet generalize poorly to new inputs — a phenomenon called overfitting. Regularization biases the optimizer toward simpler solutions that tend to generalize better.
Mathematical Theory
L2 Regularization (Ridge / Weight Decay)
$$\Omega_{L2}(\theta) = \frac{1}{2}\sum_{i=1}^{P} \theta_i^2 = \frac{1}{2}|\theta|_$2^{2}$$$
Effect on update rule:
The factor shrinks weights toward zero each step — hence the name weight decay. Large weights are penalized quadratically, keeping the model smooth.
L1 Regularization (Lasso)
Effect: L1 drives small weights exactly to zero, producing a sparse model. This is useful for feature selection — irrelevant connections are pruned automatically.
Comparison
| Property | L1 | L2 |
|---|---|---|
| Penalty shape | Diamond (corners at axes) | Sphere |
| Sparsity | Promotes exact zeros | Shrinks toward zero but rarely reaches it |
| Gradient at | Undefined (sub-gradient) | Zero |
| Best for | Feature selection, sparse models | General-purpose weight control |
Geometric Interpretation
The regularized loss can be viewed as constrained optimization:
where is determined by . L1 constrains to a diamond, so the optimal point tends to lie at a corner (sparse). L2 constrains to a sphere, so the optimal point balances all dimensions.
Complexity Analysis
| Operation | Time | Space |
|---|---|---|
Regularization adds negligible computational cost — one pass over the parameter vector per training iteration.
Step-by-Step Walkthrough
Scenario: 3 parameters, , .
L2 Regularization:
| Step | Computation | Result |
|---|---|---|
| Penalty | ||
| Gradient | Added to | |
| Contribution to loss | $0.1 \times 0.49 = 0.049$ |
L1 Regularization:
| Step | Computation | Result |
|---|---|---|
| Penalty | $0.8 + 0.3 + 0.5 = 1.6$ | |
| Gradient | Added to | |
| Contribution to loss | $0.1 \times 1.6 = 0.16$ |
After several L1 updates (, ): the component, already small, is driven to exactly zero. The model effectively prunes that connection.
Pitfalls & Edge Cases
- too large. The model underfits — weights are driven so close to zero that the network cannot represent the function. Cross-validate .
- too small. Negligible effect; overfitting persists.
- L1 non-differentiability. At , the L1 gradient is undefined. Use sub-gradient or proximal operators for exact handling.
- Regularizing biases. Conventionally, bias parameters are excluded from regularization because they do not contribute to model complexity. This library regularizes all parameters in the flat vector — be aware of this if bias control matters.
- Fixed-point precision. The regularization term can be much smaller than the main loss when is small. In low-precision fixed-point, it may round to zero. Scale or use a wider accumulator.
Variants & Generalizations
| Variant | Key Difference |
|---|---|
| Elastic Net | \Omega = \alpha \|\theta\|_1 + (1-\alpha)\|\theta\|_2^{2}$$; combines L1 sparsity with L2 smoothness |
| Dropout | Randomly zeroes activations during training; implicit ensemble regularization |
| Early stopping | Halts training before overfitting; regularization without modifying the loss |
| Data augmentation | Expands the training set with transformed copies; reduces overfitting by increasing data diversity |
| Spectral normalization | Constrains the spectral norm of weight matrices; stabilizes GAN training |
| Weight clipping | Hard constraint: ; used in Wasserstein GANs |
Applications
- Preventing overfitting — The primary use case for any neural network trained on limited data (common in embedded scenarios).
- Feature selection — L1 regularization identifies and prunes irrelevant input connections.
- Model compression — Sparse models (via L1) require less storage and computation for deployment on MCUs.
- Transfer learning — L2 regularization keeps fine-tuned weights close to pre-trained values.
Connections to Other Algorithms
graph TD
Reg["Regularization"]
Loss["Loss Functions"]
Opt["Optimizer"]
Model["Model"]
LR["Linear Regression"]
Reg -->|"λ Ω(θ) added to ℒ"| Loss
Loss --> Opt
Opt --> Model
Reg -.->|"L2 + MSE = Ridge regression"| LR
| Component | Relationship |
|---|---|
| Loss Functions | Regularization is a penalty added to the loss: |
| Optimizer | Receives the combined gradient |
| Linear Regression | L2-regularized MSE with a linear model is Ridge regression; L1 is Lasso |
References & Further Reading
- Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016 — Chapter 7 (regularization).
- Tibshirani, R., "Regression shrinkage and selection via the lasso", JRSS-B, 58(1), 1996.
- Krogh, A. and Hertz, J.A., "A simple weight decay can improve generalization", NeurIPS, 1991.