Bayesian Optimization
April 4, 2026 · View on GitHub
Overview & Motivation
Bayesian Optimization (BO) is a gradient-free, sample-efficient strategy for optimizing black-box functions that are expensive to evaluate. Unlike gradient-based methods (gradient descent, Adam), BO does not require access to derivatives. It is ideal for calibrating numerical controllers, tuning hyperparameters, or any optimization task where each function evaluation is costly — for example, running a full closed-loop simulation.
BO maintains a probabilistic surrogate model (typically a Gaussian Process) of the objective function. At each iteration, it uses an acquisition function to select the next query point that balances exploration (uncertain regions) with exploitation (promising regions). After evaluating the true objective at that point, the surrogate is updated and the process repeats.
Mathematical Theory
Gaussian Process Surrogate
A Gaussian Process (GP) defines a distribution over functions:
With zero mean and squared-exponential (RBF) kernel:
where is the length-scale and is the signal variance.
Given observations , the GP posterior at a new point is Gaussian:
where:
- is the kernel matrix
- is the cross-kernel vector
- is the observation noise variance
Expected Improvement Acquisition Function
For minimization, the Expected Improvement (EI) at given best observed value is:
For a GP surrogate this evaluates analytically:
where is the standard normal CDF and is the PDF.
Fixed-Size GP Implementation
To avoid dynamic memory allocation, the kernel matrix is pre-allocated at compile-time with maximum capacity . When only observations are present, the inactive rows and columns are padded with an identity block, preserving positive-definiteness and allowing Gaussian elimination with the full system. The resulting alpha vector has zeros in the inactive entries.
Complexity Analysis
| Case | Time per iteration | Space | Notes |
|---|---|---|---|
| Best | Dominated by GP solve | ||
| Average | candidates, parameters | ||
| Worst | Same as average |
Where = maximum observations, = candidate count, = parameter count, = current number of observations.
All memory is pre-allocated at compile-time on the stack — no dynamic allocation required.
Step-by-Step Walkthrough
Example: minimize on .
- Initialization: draw 2 random points, e.g. , .
- Evaluate: , . Best so far .
- Build GP: compute $2 \times 2\mathbf{K}\ell=1\sigma_f=1\sigma_n=0.01$.
- Solve: .
- Maximize EI: sample 100 candidates uniformly, compute for each, select .
- Evaluate: . Update GP. Repeat.
- Convergence (after 20 iterations): best point converges near .
Pitfalls & Edge Cases
- GP kernel matrix conditioning: If two observations are nearly identical, can be ill-conditioned. The noise variance acts as a regularizer; set it for stability.
- Exploration vs exploitation: A small candidate sample count (below ~30) may miss the EI maximum in high dimensions. Increase the candidate count for parameters.
- Bounds scaling: EI maximization by uniform sampling from assumes a normalized search space. Provide appropriate bounds and the optimizer will rescale internally.
- Fixed capacity: The maximum observation budget is a compile-time parameter. Set it large enough for the intended number of BO iterations. Exceeding the capacity at runtime triggers a precondition failure.
Variants & Generalizations
- Upper Confidence Bound (UCB): alternative to EI: . More explicit trade-off via .
- GP hyperparameter optimization: Maximize the log marginal likelihood over , , to adapt the surrogate to the observed data automatically.
- Gradient-enhanced BO: When gradients are available, use GP regression on jointly for faster convergence.
- Batch BO: Select candidates per iteration (parallel evaluations) using the -EI criterion.
Applications
- Controller weight tuning (e.g., MPC state/control cost weights Q, R)
- Hyperparameter optimisation for machine learning models
- Experiment design in system identification
- Any engineering optimisation where each evaluation is costly (simulation, hardware in the loop)
Connections to Other Algorithms
- Gaussian Process (GP): The surrogate model is a GP; the Kalman filter is formally equivalent to GP regression with a Markovian kernel.
- Gaussian Elimination / Cholesky: Used to solve the GP kernel system .
- Gradient Descent: BO is the gradient-free counterpart — preferred when the objective is non-differentiable or noisy.
- Expectation-Maximisation (EM): Both maintain a probabilistic model refined by data; EM operates on latent-variable likelihood while BO operates on a surrogate of an arbitrary black-box.
Numerical Properties
- Real-time suitability: No — the GP linear solve at each iteration precludes hard real-time use. Suitable for offline calibration or non-real-time configuration stages.
- Precision: Floating-point arithmetic only. Gaussian process kernels require transcendental functions — exponential, complementary error function, square root — that are incompatible with fixed-point representations. The kernel matrix inversion loses precision for very flat kernels ( domain width). Length-scale should remain in the range relative to the normalised parameter domain.
- Noise regularisation: The diagonal noise term must be positive to keep the kernel matrix positive-definite; values below $10^{-3}$ risk ill-conditioning.
- Memory: All storage is pre-allocated at compile-time — the maximum observation budget and the candidate count are both compile-time parameters. No dynamic allocation is required.
References & Further Reading
- Brochu, Cora, de Freitas, "A Tutorial on Bayesian Optimization of Expensive Cost Functions", 2010
- Rasmussen & Williams, Gaussian Processes for Machine Learning, MIT Press, 2006 — Chapter 2 (GP regression), Chapter 5 (model selection)
- Snoek, Larochelle, Adams, "Practical Bayesian Optimization of Machine Learning Algorithms", NeurIPS 2012