Basis Functions and Generalisation
Neil Lawrence
2025-09-09
Dedan Kimathi University, Nyeri, Kenya
Nonlinear Regression with Linear Models
Nonlinear Regression
Problem with Linear Regression—\(\mathbf{ x}\) may not be linearly related to \(\mathbf{ y}\) .
Potential solution: create a feature space: define \(\phi(\mathbf{ x})\) where \(\phi(\cdot)\) is a nonlinear function of \(\mathbf{ x}\) .
Model for target is a linear combination of these nonlinear functions \[f(\mathbf{ x}) = \sum_{j=1}^mw_j \phi_j(\mathbf{ x})\]
Basis Functions
Instead of working in input space, \(\mathbf{ x}\) ..
Build models in a new space \(\boldsymbol{ \phi}(\mathbf{ x})\)
Quadratic Basis
Basis functions can be global. E.g. quadratic basis: \[
\boldsymbol{ \phi}^\top = [1, x, x^2]
\]
\[
\begin{align*}
\phi_1(x) & = 1, \\
\phi_2(x) & = x, \\
\phi_3(x) & = x^2.
\end{align*}
\]
Quadratic Basis
\[
\boldsymbol{ \phi}(x) = \begin{bmatrix} 1\\ x\\ x^2\end{bmatrix}.
\]
Design Matrix
\[
\boldsymbol{ \Phi}(\mathbf{ x}) =
\begin{bmatrix} 1 & x_1 &
x_1^2 \\
1 & x_2 & x_2^2\\
\vdots & \vdots & \vdots \\
1 & x_n& x_n^2
\end{bmatrix}
\]
Functions Derived from Quadratic Basis
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}}
\]
❮ ❯
Quadratic Functions
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}}
\]
❮ ❯
Choice of Basis
The polynomial represents one choice of basis
\[
\phi_j(x_i) = x_i^j
\]
Polynomial Basis
\[
\phi_j(x) = x^j
\]
❮ ❯
Functions Derived from Polynomial Basis
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} + {\color{magenta}{w_3 x^3}} + {\color{red}{w_4 x^4}}
\]
❮ ❯
Different Basis
Polynomial basis widely used in engineering and graphics.
Drawbacks in ML, value rises quickly when \(| \mathbf{ x}| > 1\) .
Radial Basis Functions
Basis functions can be local e.g. radial (or Gaussian) basis \[
\phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right)
\]
Radial Basis Functions
❮ ❯
Functions Derived from Radial Basis
\[
f(x) = \color{cyan}{w_1 e^{-2(x+1)^2}} + \color{green}{w_2e^{-2x^2}} + \color{yellow}{w_3 e^{-2(x-1)^2}}
\]
❮ ❯
Rectified Linear Units
The ReLU is a basis function used in neural nets
\[
\phi_j(x) = xH(x+ v_j)
\]
❮ ❯
Functions Derived from Relu Basis
\[
f(x) = \color{cyan}{w_0} + \color{green}{w_1 xH(x+1.0) } + \color{yellow}{w_2 xH(x+0.33) } + \color{magenta}{w_3 xH(x-0.33)} + \color{red}{w_4 xH(x-1.0)}
\]
❮ ❯
Hyperbolic Tangent Basis
The hyperbolc tangent was formerly popular for neural nets.
\[
\phi_j(x) = \tanh(v_j x+ v_0)
\]
❮ ❯
Functions Derived from Tanh Basis
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \text{tanh}\left(x+1\right)}} + {\color{yellow}{w_2 \text{tanh}\left(x+0.33\right)}} + {\color{magenta}{w_3 \text{tanh}\left(x-0.33\right)}} + {\color{red}{w_4 \text{tanh}\left(x-1\right)}}
\]
❮ ❯
Fourier Basis
In signal processing we often use the Fourier basis
\[
f(x) = w_0 + w_1 \sin(x) + w_2 \cos(x) + w_3 \sin(2x) + w_4 \cos(2x)
\]
❮ ❯
Functions Derived from Fourier Basis
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \sin(x)}} + {\color{yellow}{w_2 \cos(x)}} + {\color{magenta}{w_3 \sin(2x)}} + {\color{red}{w_4 \cos(2x)}}
\]
❮ ❯
Fitting Basis Function Models
Olympic Marathon Data
Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.
Image from Wikimedia Commons
Olympic Marathon Data
Olympic marathon pace times since 1896.
Notebook Example
In the notebook you are asked to scale the weights to fit functions to Olympic Marathon data.
Basis Function Models
The prediction function is now defined as \[
f(\mathbf{ x}_i) = \sum_{j=1}^mw_j \phi_{i, j}
\]
Vector Notation
Write in vector notation, \[
f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}_i
\]
Log Likelihood for Basis Function Model
The likelihood of a single data point is \[
p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}\right).
\]
Log Likelihood for Basis Function Model
Leading to a log likelihood for the data set of
\[\begin{aligned}
L(\mathbf{ w},\sigma^2)= & -\frac{n}{2}\log \sigma^2-\frac{n}{2}\log 2\pi \\ & -\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}.
\end{aligned}\]
Objective Function
And a corresponding objective function of the form \[
L(\mathbf{ w},\sigma^2)= \frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}.
\]
Expand the Brackets
\[\begin{aligned}
L(\mathbf{ w},\sigma^2) = &\frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2}\sum_{i=1}^{n}y_i\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\\ &+\frac{1}{2\sigma^2}\sum_{i=1}^{n}\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\mathbf{ w}+\text{const}.
\end{aligned}\]
Expand the Brackets
\[\begin{aligned} L(\mathbf{ w}, \sigma^2) = & \frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2} \mathbf{ w}^\top\sum_{i=1}^{n}\boldsymbol{ \phi}_i y_i\\ & +\frac{1}{2\sigma^2}\mathbf{ w}^{\top}\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}+\text{const}.\end{aligned}\]
Design Matrices
Design matrix notation \[
\boldsymbol{ \Phi}= \begin{bmatrix} \mathbf{1} & \mathbf{ x}& \mathbf{ x}^2\end{bmatrix}
\] so that \[
\boldsymbol{ \Phi}\in \Re^{n\times p}.
\]
Multivariate Derivatives Reminder
\[\frac{\text{d}\mathbf{a}^{\top}\mathbf{ w}}{\text{d}\mathbf{ w}}=\mathbf{a}\] \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\]
or
\[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}\]
for symmetric \(\mathbf{A}\) .
Differentiate
Differentiate wrt \(\mathbf{ w}\)
\[\begin{aligned}
\frac{\text{d} E\left(\mathbf{ w},\sigma^2 \right)}{\text{d}\mathbf{ w}}= & -\frac{1}{\sigma^2} \sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i \\ & +\frac{1}{\sigma^2} \left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}
\end{aligned}\]
Find Stationar Point
Set to zero leading to \[\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}^{*}=\sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i.\]
Matrix Notation
\[
\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^\top = \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\] \[\sum _{i=1}^{n}\boldsymbol{ \phi}_iy_i = \boldsymbol{ \Phi}^\top \mathbf{ y}
\]
Update Equations
To find \(\mathbf{ w}^{*}\) solve \[
\left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right) \mathbf{ w}^{*} = \boldsymbol{ \Phi}^\top \mathbf{ y}
\]
The equation for \(\left.\sigma^2\right.^{*}\) may also be found \[
\left.\sigma^2\right.^{*}=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\boldsymbol{ \phi}_i\right)^{2}}{n}\]
Avoid Direct Inverse
E.g. Solve for \(\mathbf{ w}\) \[
\left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)\mathbf{ w}= \boldsymbol{ \Phi}^\top \mathbf{ y}
\]
\[
\mathbf{A}\mathbf{x} = \mathbf{b}.
\]
Solving
See np.linalg.solve
In practice use \(\mathbf{Q}\mathbf{R}\) decomposition (see lab class notes).
Solution with QR Decomposition
\[
\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\boldsymbol{\beta} =
\boldsymbol{ \Phi}^\top \mathbf{ y}
\] substitute \(\boldsymbol{ \Phi}= \mathbf{Q}\mathbf{R}\) \[
(\mathbf{Q}\mathbf{R})^\top
(\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top
\mathbf{ y}
\] \[
\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R}
\boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y}
\]
\[
\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top
\mathbf{ y}
\] \[
\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{ y}
\]
More nummerically stable.
Avoids the intermediate computation of \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\) .
Non-linear but Linear in the Parameters
Model is non-linear, but linear in parameters \[
f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})
\]
\(\mathbf{ x}\) is inside the non-linearity, but \(\mathbf{ w}\) is outside. \[
f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x};
\mathbf{ v}),
\]
Polynomial Fits to Olympic Marthon Data
Fit linear model with polynomial basis to marathon data.
Try different numbers of basis functions (different degress of polynomial).
Check the quality of fit.
Linear Fit
\[f(x, \mathbf{ w}) = w_0 + w_1x\]
Fit of a 1-degree polynomial (a linear model) to the Olympic marathon data.
Cubic Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_3 x^3\]
Fit of a 3-degree polynomial (a cubic model) to the Olympic marathon data.
9th Degree Polynomial Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_9 x^9\]
Fit of a 9-degree polynomial to the Olympic marathon data.
16th Degree Polynomial Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\]
Fit of a 16-degree polynomial to the Olympic marathon data.
28th Degree Polynomial Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{28} x^{28}\]
Fit of a 28-degree polynomial to the Olympic marathon data.
Polynomial Fits to Olympic Data
❮ ❯
Empirical Risk Minimisation
Expected Loss
\[
R(\mathbf{ w}) = \int L(y, x, \mathbf{ w}) \mathbb{P}(y, x) \text{d}y
\text{d}x.
\]
Loss Function
Here \(L(\cdot)\) is the loss function.
Different interpretation of the objective .
The cost you pay for mistakes.
Sample-Based Approximations
Sample based approximation: replace true expectation with sum over samples. \[
\int f(z) p(z) \text{d}z\approx \frac{1}{s}\sum_{i=1}^s f(z_i).
\]
Allows us to approximate true integral with a sum \[
R(\mathbf{ w}) \approx \frac{1}{n}\sum_{i=1}^{n} L(y_i, x_i, \mathbf{ w}).
\]
Empirical Risk Minimization
If the loss is the squared loss \[
L(y, x, \mathbf{ w}) = (y-\mathbf{ w}^\top\boldsymbol{\phi}(x))^2,
\]
This recovers the empirical risk \[
R(\mathbf{ w}) \approx \frac{1}{n} \sum_{i=1}^{n}
(y_i - \mathbf{ w}^\top \boldsymbol{\phi}(x_i))^2
\]
Estimating Risk through Validation
Validation on the Olympic Marathon Data
Polynomial Fit: Training Error
The next thing we’ll do is consider a quadratic fit. We will compute the training error for the two fits.
Polynomial Fits to Olympics Data
❮ ❯
Hold Out Validation on Olympic Marathon Data
Overfitting
Increase number of basis functions we obtain a better ‘fit’ to the data.
How will the model perform on previously unseen data?
Let’s consider predicting the future.
Interpolation
Predicting the wining time for 1946 Olympics is interpolation .
This is because we have times from 1936 and 1948.
If we want a model for interpolation how can we test it?
One trick is to sample the validation set from throughout the data set.
Future Prediction: Interpolation
❮ ❯
Choice of Validation Set
The choice of validation set should reflect how you will use the model in practice.
For extrapolation into the future we tried validating with data from the future.
For interpolation we chose validation set from data.
For different validation sets we could get different results.
Leave One Out Error
Take training set and remove one point.
Train on the remaining data.
Compute the error on the point you removed (which wasn’t in the training data).
Do this for each point in the training set in turn.
Average the resulting error.
This is the leave one out error.
\(k\) -fold Cross Validation
Leave one out error can be very time consuming.
Need to train your algorithm \(n\) times.
An alternative: \(k\) fold cross validation.
The Bootstrap
\[
\mathbf{ y}, \mathbf{X}\sim \mathbb{P}(y, \mathbf{ x})
\]
Resample Dataset
def bootstrap(X):
"Return a bootstrap sample from a data set."
n = X.shape[0 ]
ind = np.random.choice(n, n, replace= True ) # Sample randomly with replacement.
return X[ind, :]
Bootstrap and Olympic Marathon Data
Linear Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x\]
Fit of a 1 degree polynomial (a linear model) to the olympic marathon data.
Cubic Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_{3} x^3\]
Fit of a 3 degree polynomial (a cubic model) to the olympic marathon data.
9th Degree Polynomial Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{9} x^{9}\]
Fit of a 9 degree polynomial to the olympic marathon data.
16th Degree Polynomial Fit
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\]
Fit of a 16 degree polynomial to the olympic marathon data.
Bootstrap Confidence Intervals
Bootstrap confidence intervals for a cubic polynomial fit to Olympic marathon data. The shaded region shows the 95% confidence interval, and individual bootstrap fits are shown in light green.
Bias Variance Decomposition
Generalisation error \[\begin{align*}
R(\mathbf{ w}) = & \int \left(y- f^*(\mathbf{ x})\right)^2 \mathbb{P}(y, \mathbf{ x}) \text{d}y\text{d}\mathbf{ x}\\
& \triangleq \mathbb{E}\left[ \left(y- f^*(\mathbf{ x})\right)^2 \right].
\end{align*}\]
Decompose
Decompose as \[
\begin{align*}
\mathbb{E}\left[ \left(y- f(\mathbf{ x})\right)^2 \right] = & \text{bias}\left[f^*(\mathbf{ x})\right]^2 \\
& + \text{variance}\left[f^*(\mathbf{ x})\right] \\ \\ &+\sigma^2,
\end{align*}
\]
Variance
Given by \[
\text{variance}\left[f^*(\mathbf{ x})\right] = \mathbb{E}\left[\left(f^*(\mathbf{ x}) - \mathbb{E}\left[f^*(\mathbf{ x})\right]\right)^2\right].
\]
Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.
Bias-Variance Analysis for Olympic Marathon Data
Bias-variance tradeoff for polynomial models on Olympic marathon data. The bias decreases with model complexity while variance increases. The optimal model balances these two sources of error.
No Free Lunch Theorem
No universally best learner across all data-generating processes
Performance gains arise from assumptions (inductive bias) matching the task
Implication: model choice and regularisation must reflect prior beliefs about the problem
Position in this Lecture
After bias–variance and bootstrap: why we cannot expect one degree/basis to win universally
Connects to cross-validation and model selection: choose bias consistent with data
Motivates regularisation: encode assumptions to trade bias for variance appropriately
Practical Takeaways
Use validation to select assumptions that fit the domain
Prefer simple models unless evidence supports added complexity
Make assumptions explicit (document basis, priors, regularisers)
Regularisation
Linear system, solve:
\[
\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\mathbf{ w}= \boldsymbol{ \Phi}^\top\mathbf{ y}
\] But if \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\) then this is not well posed.
Tikhonov Regularisation
Updated objective: \[
L(\mathbf{ w}) = (\mathbf{ y}- \mathbf{ f})^\top(\mathbf{ y}- \mathbf{ f}) + \alpha\left\Vert \mathbf{W} \right\Vert_2^2
\]
Hessian: \[
\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha \mathbf{I}
\]
Splines, Functions, Hilbert Kernels
Can also regularize the function \(f(\cdot)\) directly.
This approach taken in splines and Wahba (1990) and kernels Schölkopf and Smola (2001) .
Mathematically more elegant, but algorithmically less flexible and harder to scale.
Training with Noise
Other regularisation approaches such as dropout (Srivastava et al., 2014)
Often perturbing the neural network structure or inputs.
Can have elegant interpretations (see e.g. Bishop (1995) )
Also interpreted as ensemble or Bayesian methods.