Linear Regression

March 21, 2026 · View on GitHub

Overview & Motivation

Given a set of observations {(xi,yi)}\{(x_i, y_i)\}, we often want to find the best-fit linear relationship between features xx and target yy. Linear regression does this by finding the coefficients β\beta that minimize the sum of squared residuals — the gap between predicted and observed values.

The approach is fundamental because:

  • It has a closed-form solution (the normal equation), so no iterative optimization is needed.
  • It provides a baseline model against which more complex methods are measured.
  • The solution is the maximum likelihood estimator under Gaussian noise assumptions.

This library solves the normal equation directly using Gaussian elimination, making it suitable for small-to-moderate feature counts typical in embedded estimation tasks.

Mathematical Theory

The Model

y=β0+β1x1+β2x2++βpxp+εy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon

In matrix form, augmenting XX with a column of ones for the intercept:

y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

where XRn×(p+1)\mathbf{X} \in \mathbb{R}^{n \times (p+1)}, βRp+1\boldsymbol{\beta} \in \mathbb{R}^{p+1}, yRn\mathbf{y} \in \mathbb{R}^n.

Normal Equation

Minimizing yXβ2\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 with respect to β\boldsymbol{\beta} yields:

β=(XTX)1XTy\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

This is solved in practice by forming XTX\mathbf{X}^T\mathbf{X} and XTy\mathbf{X}^T\mathbf{y}, then using Gaussian elimination to solve the (p+1)×(p+1)(p+1) \times (p+1) system.

Geometric Interpretation

The prediction y^=Xβ\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} is the orthogonal projection of y\mathbf{y} onto the column space of X\mathbf{X}. The residual yy^\mathbf{y} - \hat{\mathbf{y}} is perpendicular to every column of X\mathbf{X}.

Complexity Analysis

PhaseTimeSpaceNotes
XTX\mathbf{X}^T\mathbf{X}O(np2)O(n p^2)O(p2)O(p^2)Dominated by matrix multiply
XTy\mathbf{X}^T\mathbf{y}O(np)O(np)O(p)O(p)Matrix-vector multiply
SolveO(p3)O(p^3)O(p2)O(p^2)Gaussian elimination
PredictO(p)O(p)O(1)O(1)Dot product

For embedded use with small pp (say 10\leq 10), the entire fit completes in microseconds.

Step-by-Step Walkthrough

Data: 3 samples, 1 feature.

xxyy
12
24
35

Step 1 — Design matrix (with bias column):

X=[111213]\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}

Step 2 — Compute XTX\mathbf{X}^T\mathbf{X} and XTy\mathbf{X}^T\mathbf{y}:

XTX=[36614],XTy=[1123]\mathbf{X}^T\mathbf{X} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix}, \quad \mathbf{X}^T\mathbf{y} = \begin{bmatrix} 11 \\ 23 \end{bmatrix}

Step 3 — Solve [36614]β=[1123]\begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \boldsymbol{\beta} = \begin{bmatrix} 11 \\ 23 \end{bmatrix}:

Forward elimination → back-substitution: β1=1.5\beta_1 = 1.5, β0=2/30.667\beta_0 = 2/3 \approx 0.667

Result: y^=0.667+1.5x\hat{y} = 0.667 + 1.5x

Prediction at x=4x = 4: y^=0.667+6.0=6.667\hat{y} = 0.667 + 6.0 = 6.667

Pitfalls & Edge Cases

  • Multicollinearity. If features are nearly linearly dependent, XTX\mathbf{X}^T\mathbf{X} is ill-conditioned and the solution is unstable. Consider regularization (Ridge/Lasso) or removing redundant features.
  • Fewer samples than features (n<p+1n < p+1). The system is underdetermined and XTX\mathbf{X}^T\mathbf{X} is singular. At a minimum, np+1n \geq p+1.
  • Outliers have outsized influence because the squared loss amplifies large residuals. Robust alternatives (Huber loss, RANSAC) exist outside this library.
  • Extrapolation danger. The linear model has no mechanism to detect when it is being queried far from the training data range.
  • Fixed-point precision. For Q15/Q31 types, features should be scaled to [1,1)[-1, 1) to avoid overflow in XTX\mathbf{X}^T\mathbf{X}.

Variants & Generalizations

VariantKey Difference
Ridge regression (L2)Adds λβ2\lambda\|\boldsymbol{\beta}\|^2 to the cost; solves (XTX+λI)β=XTy(\mathbf{X}^T\mathbf{X} + \lambda I)\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}
Lasso regression (L1)Adds λβ1\lambda\|\boldsymbol{\beta}\|_1; promotes sparsity but requires iterative optimization
Polynomial regressionAdds powers of xx as new features; still linear in parameters
Weighted least squaresWeights each sample differently; solves (XTWX)β=XTWy(\mathbf{X}^T W \mathbf{X})\boldsymbol{\beta} = \mathbf{X}^T W \mathbf{y}
Recursive least squaresUpdates β\boldsymbol{\beta} incrementally as new data arrives; suited for online estimation

Applications

  • Sensor calibration — Fitting a linear transfer function between raw ADC counts and physical units.
  • Trend estimation — Extracting linear trends from noisy time series (temperature, voltage drift).
  • System identification — Estimating static gain or simple dynamic relationships.
  • Feature importance — The magnitude of βi\beta_i indicates the influence of feature ii (after feature scaling).
  • Predictive maintenance — Modeling degradation rate from operating-condition features.

Connections to Other Algorithms

graph LR
    LR["Linear Regression"]
    GE["Gaussian Elimination"]
    YW["Yule-Walker"]
    NN["Neural Network"]
    LR --> GE
    YW -.->|"similar normal-equation structure"| LR
    NN -.->|"single linear layer = regression"| LR
AlgorithmRelationship
Gaussian EliminationUsed to solve the normal equation system
Yule-WalkerStructurally similar — also solves a linear system derived from correlations
Neural NetworkA single-layer neural network with no activation and MSE loss reduces to linear regression

References & Further Reading

  • Hastie, T., Tibshirani, R. and Friedman, J., The Elements of Statistical Learning, 2nd ed., Springer, 2009 — Chapter 3.
  • Bishop, C.M., Pattern Recognition and Machine Learning, Springer, 2006 — Chapter 3.
  • Strang, G., Linear Algebra and Its Applications, 4th ed., Thomson, 2006 — Section 4.3.