Basis Functions and Generalisation 
  Neil Lawrence
  2025-09-09 
  Dedan Kimathi University, Nyeri, Kenya
 
Nonlinear Regression with Linear Models 
 
Nonlinear Regression 
Problem with Linear Regression—\(\mathbf{ x}\)  may not be linearly related to \(\mathbf{ y}\) . 
Potential solution: create a feature space: define \(\phi(\mathbf{ x})\)  where \(\phi(\cdot)\)  is a nonlinear function of \(\mathbf{ x}\) . 
Model for target is a linear combination of these nonlinear functions \[f(\mathbf{ x}) = \sum_{j=1}^mw_j \phi_j(\mathbf{ x})\]  
 
 
Basis Functions 
Instead of working in input space, \(\mathbf{ x}\) .. 
Build models in a new space \(\boldsymbol{ \phi}(\mathbf{ x})\)  
 
 
Quadratic Basis 
Basis functions can be global. E.g. quadratic basis: \[
\boldsymbol{ \phi}^\top = [1, x, x^2]
\]  
 
\[
\begin{align*}
\phi_1(x) & = 1, \\
\phi_2(x) & = x, \\
\phi_3(x) & = x^2.
\end{align*}
\] 
 
Quadratic Basis 
\[
\boldsymbol{ \phi}(x) = \begin{bmatrix} 1\\ x\\ x^2\end{bmatrix}.
\] 
 
Design Matrix 
\[
\boldsymbol{ \Phi}(\mathbf{ x}) = 
\begin{bmatrix} 1 & x_1 &
x_1^2 \\
1 & x_2 & x_2^2\\
\vdots & \vdots & \vdots \\
1 & x_n& x_n^2
\end{bmatrix}
\] 
 
Functions Derived from Quadratic Basis 
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}}
\] 
❮  ❯ 
 
 
 
 
Quadratic Functions 
\[
f(x) = {\color{cyan}{w_0}}   + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}}
\] 
❮  ❯ 
 
 
 
 
Choice of Basis 
The polynomial represents one choice of basis 
 
\[
\phi_j(x_i) = x_i^j
\] 
 
Polynomial Basis 
\[
\phi_j(x) = x^j
\] 
❮  ❯ 
 
 
 
 
 
 
Functions Derived from Polynomial Basis 
\[
f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} + {\color{magenta}{w_3 x^3}} + {\color{red}{w_4 x^4}}
\] 
❮  ❯ 
 
 
 
 
Different Basis 
Polynomial basis widely used in engineering and graphics. 
Drawbacks in ML, value rises quickly when \(| \mathbf{ x}| > 1\) . 
 
 
Radial Basis Functions 
Basis functions can be local e.g. radial (or Gaussian) basis \[
\phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right)
\]  
 
 
Radial Basis Functions 
❮  ❯ 
 
 
 
 
Functions Derived from Radial Basis 
\[
f(x) = \color{cyan}{w_1 e^{-2(x+1)^2}}  + \color{green}{w_2e^{-2x^2}} + \color{yellow}{w_3 e^{-2(x-1)^2}}
\] 
❮  ❯ 
 
 
 
 
Rectified Linear Units 
The ReLU is a basis function used in neural nets 
 
\[
\phi_j(x) = xH(x+ v_j)
\] 
❮  ❯ 
 
 
 
 
 
 
Functions Derived from Relu Basis 
\[
f(x) = \color{cyan}{w_0}   + \color{green}{w_1 xH(x+1.0) } + \color{yellow}{w_2 xH(x+0.33) } + \color{magenta}{w_3 xH(x-0.33)} +  \color{red}{w_4 xH(x-1.0)}
\] 
❮  ❯ 
 
 
 
 
Hyperbolic Tangent Basis 
The hyperbolc tangent was formerly popular for neural nets. 
 
\[
\phi_j(x) = \tanh(v_j x+ v_0)
\] 
❮  ❯ 
 
 
 
 
 
 
Functions Derived from Tanh Basis 
\[
f(x) = {\color{cyan}{w_0}}   + {\color{green}{w_1 \text{tanh}\left(x+1\right)}}  + {\color{yellow}{w_2 \text{tanh}\left(x+0.33\right)}}  + {\color{magenta}{w_3 \text{tanh}\left(x-0.33\right)}} + {\color{red}{w_4 \text{tanh}\left(x-1\right)}}
\] 
❮  ❯ 
 
 
 
 
Fourier Basis 
In signal processing we often use the Fourier basis 
 
\[
f(x) = w_0  + w_1 \sin(x) + w_2 \cos(x) + w_3 \sin(2x) + w_4 \cos(2x)
\] 
❮  ❯ 
 
 
 
 
 
 
Functions Derived from Fourier Basis 
\[
f(x) = {\color{cyan}{w_0}}  + {\color{green}{w_1 \sin(x)}} + {\color{yellow}{w_2 \cos(x)}} + {\color{magenta}{w_3 \sin(2x)}} + {\color{red}{w_4 \cos(2x)}}
\] 
❮  ❯ 
 
 
 
 
Fitting Basis Function Models 
 
Olympic Marathon Data 
Gold medal times for Olympic Marathon since 1896. 
Marathons before 1924 didn’t have a standardized distance. 
Present results using pace per km. 
In 1904 Marathon was badly organized leading to very slow times. 
 
 
Image from Wikimedia Commons  
 
 
 
Olympic Marathon Data 
Olympic marathon pace times since 1896.
 
 
Notebook Example 
In the notebook you are asked to scale the weights to fit functions to Olympic Marathon data. 
 
 
Basis Function Models 
The prediction function  is now defined as \[
f(\mathbf{ x}_i) = \sum_{j=1}^mw_j \phi_{i, j}
\]  
 
 
Vector Notation 
Write in vector notation, \[
f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}_i
\]  
 
 
Log Likelihood for Basis Function Model 
The likelihood of a single data point is \[
p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}\right).
\]  
 
 
Log Likelihood for Basis Function Model 
Leading to a log likelihood for the data set of
\[\begin{aligned}
L(\mathbf{ w},\sigma^2)= & -\frac{n}{2}\log \sigma^2-\frac{n}{2}\log 2\pi \\ & -\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}.
\end{aligned}\]  
 
 
Objective Function 
And a corresponding objective function  of the form \[
L(\mathbf{ w},\sigma^2)= \frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}.
\]  
 
 
Expand the Brackets 
\[\begin{aligned}
  L(\mathbf{ w},\sigma^2) = &\frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2}\sum_{i=1}^{n}y_i\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\\ &+\frac{1}{2\sigma^2}\sum_{i=1}^{n}\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\mathbf{ w}+\text{const}.
\end{aligned}\] 
 
Expand the Brackets 
\[\begin{aligned} L(\mathbf{ w}, \sigma^2) = & \frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2} \mathbf{ w}^\top\sum_{i=1}^{n}\boldsymbol{ \phi}_i y_i\\ & +\frac{1}{2\sigma^2}\mathbf{ w}^{\top}\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}+\text{const}.\end{aligned}\] 
 
Design Matrices 
Design matrix notation \[
\boldsymbol{ \Phi}= \begin{bmatrix} \mathbf{1} & \mathbf{ x}& \mathbf{ x}^2\end{bmatrix}
\]  so that \[
\boldsymbol{ \Phi}\in \Re^{n\times p}.
\]  
 
 
Multivariate Derivatives Reminder 
\[\frac{\text{d}\mathbf{a}^{\top}\mathbf{ w}}{\text{d}\mathbf{ w}}=\mathbf{a}\]  \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\] 
or
\[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}\] 
for symmetric \(\mathbf{A}\) .
 
Differentiate 
Differentiate wrt \(\mathbf{ w}\) 
\[\begin{aligned}
\frac{\text{d} E\left(\mathbf{ w},\sigma^2 \right)}{\text{d}\mathbf{ w}}= & -\frac{1}{\sigma^2} \sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i \\ & +\frac{1}{\sigma^2} \left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}
\end{aligned}\] 
 
Find Stationar Point 
Set to zero leading to \[\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}^{*}=\sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i.\]  
 
 
Matrix Notation 
\[
\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^\top = \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\]  \[\sum _{i=1}^{n}\boldsymbol{ \phi}_iy_i = \boldsymbol{ \Phi}^\top \mathbf{ y}
\] 
 
Update Equations 
To find \(\mathbf{ w}^{*}\)  solve \[
\left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)  \mathbf{ w}^{*} = \boldsymbol{ \Phi}^\top \mathbf{ y}
\]  
The equation for \(\left.\sigma^2\right.^{*}\)  may also be found \[
\left.\sigma^2\right.^{*}=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\boldsymbol{ \phi}_i\right)^{2}}{n}\]  
 
 
Avoid Direct Inverse 
E.g. Solve for \(\mathbf{ w}\)  \[
\left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)\mathbf{ w}= \boldsymbol{ \Phi}^\top \mathbf{ y}
\]  
 
\[
\mathbf{A}\mathbf{x} = \mathbf{b}.
\] 
 
Solving 
See np.linalg.solve 
In practice use \(\mathbf{Q}\mathbf{R}\)  decomposition (see lab class notes). 
 
 
Solution with QR Decomposition 
\[
\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\boldsymbol{\beta} =
\boldsymbol{ \Phi}^\top \mathbf{ y}
\]  substitute \(\boldsymbol{ \Phi}= \mathbf{Q}\mathbf{R}\)  \[
(\mathbf{Q}\mathbf{R})^\top
(\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top
\mathbf{ y}
\]  \[
\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R}
\boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y}
\] 
 
\[
\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top
\mathbf{ y}
\]  \[
\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{ y}
\] 
 
More nummerically stable. 
Avoids the intermediate computation of \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\) . 
 
 
Non-linear but Linear in the Parameters 
Model is non-linear, but linear in parameters \[
f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})
\]  
\(\mathbf{ x}\)  is inside the non-linearity, but \(\mathbf{ w}\)  is outside. \[
f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x};
\mathbf{ v}),
\]  
 
Polynomial Fits to Olympic Marthon Data 
Fit linear model with polynomial basis to marathon data. 
Try different numbers of basis functions (different degress of polynomial). 
Check the quality of fit. 
 
 
Linear Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1x\] 
Fit of a 1-degree polynomial (a linear model) to the Olympic marathon data.
 
 
Cubic Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_3 x^3\] 
Fit of a 3-degree polynomial (a cubic model) to the Olympic marathon data.
 
 
9th Degree Polynomial Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_9 x^9\] 
Fit of a 9-degree polynomial to the Olympic marathon data.
 
 
16th Degree Polynomial Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\] 
Fit of a 16-degree polynomial to the Olympic marathon data.
 
 
28th Degree Polynomial Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{28} x^{28}\] 
Fit of a 28-degree polynomial to the Olympic marathon data.
 
 
Polynomial Fits to Olympic Data 
❮  ❯ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Empirical Risk Minimisation 
 
Expected Loss 
\[
R(\mathbf{ w}) = \int L(y, x, \mathbf{ w}) \mathbb{P}(y, x) \text{d}y
\text{d}x.
\] 
 
Loss Function 
Here \(L(\cdot)\)  is the loss function. 
Different interpretation of the objective . 
The cost you pay for mistakes. 
 
 
Sample-Based Approximations 
Sample based approximation: replace true expectation with sum over samples. \[
\int f(z) p(z) \text{d}z\approx \frac{1}{s}\sum_{i=1}^s f(z_i).
\] 
Allows us to approximate true integral with a sum \[
R(\mathbf{ w}) \approx \frac{1}{n}\sum_{i=1}^{n} L(y_i, x_i, \mathbf{ w}).
\] 
 
 
Empirical Risk Minimization 
If the loss is the squared loss  \[
L(y, x, \mathbf{ w}) = (y-\mathbf{ w}^\top\boldsymbol{\phi}(x))^2,
\]  
This recovers the empirical risk  \[
R(\mathbf{ w}) \approx \frac{1}{n} \sum_{i=1}^{n}
(y_i - \mathbf{ w}^\top \boldsymbol{\phi}(x_i))^2
\]  
 
 
Estimating Risk through Validation 
 
Validation on the Olympic Marathon Data 
 
Polynomial Fit: Training Error 
The next thing we’ll do is consider a quadratic fit. We will compute the training error for the two fits.
 
Polynomial Fits to Olympics Data 
❮  ❯ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Hold Out Validation on Olympic Marathon Data 
 
Overfitting 
Increase number of basis functions we obtain a better ‘fit’ to the data. 
How will the model perform on previously unseen data? 
Let’s consider predicting the future. 
 
 
Interpolation 
Predicting the wining time for 1946 Olympics is interpolation . 
This is because we have times from 1936 and 1948. 
If we want a model for interpolation  how can we test it? 
One trick is to sample the validation set from throughout the data set. 
 
 
Future Prediction: Interpolation 
❮  ❯ 
 
 
 
 
 
 
 
 
 
 
 
 
Choice of Validation Set 
The choice of validation set should reflect how you will use the model in practice. 
For extrapolation into the future we tried validating with data from the future. 
For interpolation we chose validation set from data. 
For different validation sets we could get different results. 
 
 
Leave One Out Error 
Take training set and remove one point. 
Train on the remaining data. 
Compute the error on the point you removed (which wasn’t in the training data). 
Do this for each point in the training set in turn. 
Average the resulting error. 
This is the leave one out error. 
 
 
\(k\) -fold Cross Validation
Leave one out error can be very time consuming. 
Need to train your algorithm \(n\)  times. 
An alternative: \(k\)  fold cross validation. 
 
 
The Bootstrap 
\[
\mathbf{ y}, \mathbf{X}\sim \mathbb{P}(y, \mathbf{ x})
\] 
 
Resample Dataset 
def  bootstrap(X):"Return a bootstrap sample from a data set." =  X.shape[0 ]=  np.random.choice(n, n, replace= True ) # Sample randomly with replacement. return  X[ind, :] 
Bootstrap and Olympic Marathon Data 
 
Linear Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x\] 
Fit of a 1 degree polynomial (a linear model) to the olympic marathon data.
 
 
Cubic Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_{3} x^3\] 
Fit of a 3 degree polynomial (a cubic model) to the olympic marathon data.
 
 
9th Degree Polynomial Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{9} x^{9}\] 
Fit of a 9 degree polynomial to the olympic marathon data.
 
 
16th Degree Polynomial Fit 
\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\] 
Fit of a 16 degree polynomial to the olympic marathon data.
 
 
Bootstrap Confidence Intervals 
Bootstrap confidence intervals for a cubic polynomial fit to Olympic marathon data. The shaded region shows the 95% confidence interval, and individual bootstrap fits are shown in light green.
 
 
Bias Variance Decomposition 
Generalisation error \[\begin{align*}
R(\mathbf{ w}) = & \int \left(y- f^*(\mathbf{ x})\right)^2 \mathbb{P}(y, \mathbf{ x}) \text{d}y\text{d}\mathbf{ x}\\
& \triangleq \mathbb{E}\left[ \left(y- f^*(\mathbf{ x})\right)^2 \right].
\end{align*}\] 
 
Decompose 
Decompose as \[
\begin{align*}
\mathbb{E}\left[ \left(y- f(\mathbf{ x})\right)^2 \right] = & \text{bias}\left[f^*(\mathbf{ x})\right]^2 \\
& + \text{variance}\left[f^*(\mathbf{ x})\right] \\ \\ &+\sigma^2,
\end{align*}
\] 
 
Variance 
Given by \[
\text{variance}\left[f^*(\mathbf{ x})\right] = \mathbb{E}\left[\left(f^*(\mathbf{ x}) - \mathbb{E}\left[f^*(\mathbf{ x})\right]\right)^2\right].
\] 
Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.
 
 
Bias-Variance Analysis for Olympic Marathon Data 
Bias-variance tradeoff for polynomial models on Olympic marathon data. The bias decreases with model complexity while variance increases. The optimal model balances these two sources of error.
 
 
No Free Lunch Theorem 
No universally best learner across all data-generating processes 
Performance gains arise from assumptions (inductive bias) matching the task 
Implication: model choice and regularisation must reflect prior beliefs about the problem 
 
 
Position in this Lecture 
After bias–variance and bootstrap: why we cannot expect one degree/basis to win universally 
Connects to cross-validation and model selection: choose bias consistent with data 
Motivates regularisation: encode assumptions to trade bias for variance appropriately 
 
 
Practical Takeaways 
Use validation to select assumptions that fit the domain 
Prefer simple models unless evidence supports added complexity 
Make assumptions explicit (document basis, priors, regularisers) 
 
 
Regularisation 
Linear system, solve:
\[
\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\mathbf{ w}= \boldsymbol{ \Phi}^\top\mathbf{ y}
\]  But if \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\)  then this is not well posed.
 
Tikhonov Regularisation 
Updated objective: \[
L(\mathbf{ w}) = (\mathbf{ y}- \mathbf{ f})^\top(\mathbf{ y}- \mathbf{ f}) + \alpha\left\Vert \mathbf{W} \right\Vert_2^2
\]  
Hessian: \[
\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha \mathbf{I}
\]  
 
 
Splines, Functions, Hilbert Kernels 
Can also regularize the function \(f(\cdot)\)  directly. 
This approach taken in splines  and Wahba (1990)  and kernels Schölkopf and Smola (2001) . 
Mathematically more elegant, but algorithmically less flexible and harder to scale. 
 
 
Training with Noise 
Other regularisation approaches such as dropout  (Srivastava et al., 2014)  
Often perturbing the neural network structure or inputs. 
Can have elegant interpretations (see e.g. Bishop (1995) ) 
Also interpreted as ensemble  or Bayesian  methods.