Gradient Descent Optimizer
March 25, 2026 · View on GitHub
Overview & Motivation
An optimizer adjusts a model's parameter vector to minimize a loss function . The simplest and most fundamental strategy is gradient descent: repeatedly step in the direction of steepest descent:
where is the learning rate — the single most important hyper-parameter in neural network training.
This library provides a batch gradient descent optimizer with a fixed learning rate and maximum iteration count, suitable for small embedded models where training data fits in memory and simplicity is paramount.
Mathematical Theory
The Gradient
The gradient is obtained via back-propagation. Each component tells how much the loss changes when is perturbed by a small amount. The gradient points uphill; subtracting it moves downhill.
Convergence Conditions
For a convex loss with Lipschitz-continuous gradient (-smooth):
provided . This gives a convergence rate of .
For non-convex losses (typical in neural networks), gradient descent converges to a stationary point () which may be a local minimum or saddle point — not necessarily the global minimum.
Learning Rate Sensitivity
| Behavior | |
|---|---|
| Too small () | Slow convergence, many iterations wasted |
| Optimal () | Fastest reliable convergence |
| Too large () | Oscillation and divergence |
graph LR
subgraph "Learning Rate Effect"
A["η too small<br/>crawls to minimum"]
B["η just right<br/>smooth convergence"]
C["η too large<br/>oscillates / diverges"]
end
Complexity Analysis
| Operation | Time | Space |
|---|---|---|
| One gradient evaluation | per parameter | gradient vector |
| One parameter update | In-place | |
| Total training (T iterations) | working memory |
Memory overhead is minimal: only the current parameter vector, the gradient vector, and the optimizer result. No momentum buffers or second-moment estimates.
Step-by-Step Walkthrough
Setup: 2 parameters, , , loss \mathcal{L}(\theta) = \theta_1^{2} + \theta_2^{2}$$ (quadratic bowl).
Iteration 1:
| Step | Computation | Result |
|---|---|---|
| Gradient | ||
| Update | ||
| Loss | 6^{2}8^{2}$$ | $3.20$ |
Iteration 2:
| Step | Computation | Result |
|---|---|---|
| Gradient | ||
| Update | ||
| Loss | $2.048$ |
Pattern: Each iteration multiplies the loss by . After 20 iterations, .
Convergence trace:
graph LR
I0["θ₀ = (2, -1)<br/>ℒ = 5.0"] --> I1["θ₁ = (1.6, -0.8)<br/>ℒ = 3.2"]
I1 --> I2["θ₂ = (1.28, -0.64)<br/>ℒ = 2.05"]
I2 --> Idots["⋯"]
Idots --> I20["θ₂₀ ≈ (0.01, -0.005)<br/>ℒ ≈ 0.0003"]
Pitfalls & Edge Cases
- Learning rate too high. The loss oscillates or increases. Start with and decrease if unstable.
- Flat regions / saddle points. Gradient descent stalls when at non-optimal points. Momentum-based methods escape saddle points faster.
- Ill-conditioned loss landscape. When eigenvalues of the Hessian span many orders of magnitude, gradient descent oscillates along steep directions while barely progressing along flat ones. Adaptive methods (Adam) handle this better.
- Max iterations reached. The optimizer returns whatever parameters it has at
maxIterations— always checkresult.finalCostto assess convergence quality. - Fixed-point gradients. In Q15/Q31, multiply-accumulate in the gradient can overflow. Scale the learning rate to keep weight updates within representable range.
Variants & Generalizations
| Variant | Key Difference |
|---|---|
| SGD (Stochastic Gradient Descent) | Uses a random mini-batch per iteration; noise helps escape local minima |
| SGD with Momentum | Accumulates a velocity: ; smooths out oscillations |
| Nesterov Momentum | Evaluates gradient at the lookahead position ; faster convergence |
| AdaGrad | Per-parameter adaptive learning rate based on accumulated squared gradients |
| RMSProp | Exponentially decaying average of squared gradients; handles non-stationary objectives |
| Adam | Combines momentum and RMSProp with bias correction; most popular adaptive method |
Applications
- On-device neural network training — Gradient descent trains small models directly on the microcontroller.
- Online adaptation — Updating model parameters in real-time as new sensor data arrives.
- Calibration — Minimizing calibration error for sensor linearization.
- System identification — Fitting model parameters to measured input-output data.
Connections to Other Algorithms
graph TD
Opt["Optimizer<br/>(Gradient Descent)"]
Loss["Loss Functions"]
Model["Model"]
Reg["Regularization"]
LR["Linear Regression"]
Loss -->|"∇ℒ"| Opt
Model -->|"θ, ∇θ"| Opt
Reg -->|"penalty gradient"| Loss
Opt -.->|"analytical solution at η→∞, 1 step"| LR
| Component | Relationship |
|---|---|
| Loss Functions | Provides the Cost() and Gradient() the optimizer calls each iteration |
| Model | Passes initial parameters to the optimizer and receives optimized parameters back |
| Regularization | Adds a penalty gradient to , biasing the optimizer toward simpler models |
| Linear Regression | For MSE on a linear model, gradient descent converges to the same solution as the normal equation |
References & Further Reading
- Ruder, S., "An overview of gradient descent optimization algorithms", arXiv:1609.04747, 2016.
- Bottou, L., Curtis, F.E., and Nocedal, J., "Optimization methods for large-scale machine learning", SIAM Review, 60(2), 2018.
- Kingma, D.P. and Ba, J., "Adam: A method for stochastic optimization", ICLR, 2015.