Basis Functions and Generalisation

Nonlinear Regression with Linear Models

Nonlinear Regression

Problem with Linear Regression—\(\mathbf{ x}\) may not be linearly related to \(\mathbf{ y}\).
Potential solution: create a feature space: define \(\phi(\mathbf{ x})\) where \(\phi(\cdot)\) is a nonlinear function of \(\mathbf{ x}\).
Model for target is a linear combination of these nonlinear functions \[f(\mathbf{ x}) = \sum_{j=1}^mw_j \phi_j(\mathbf{ x})\]

Non-linear in the Inputs

Go from \[ f(\mathbf{ x}) = \mathbf{ w}^\top \mathbf{ x} \] to \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}) \]

Basis Functions

Instead of working in input space, \(\mathbf{ x}\)..
Build models in a new space \(\boldsymbol{ \phi}(\mathbf{ x})\)

Quadratic Basis

Basis functions can be global. E.g. quadratic basis: \[ \boldsymbol{ \phi}^\top = [1, x, x^2] \]

\[ \begin{align*} \phi_1(x) & = 1, \\ \phi_2(x) & = x, \\ \phi_3(x) & = x^2. \end{align*} \]

Quadratic Basis

\[ \boldsymbol{ \phi}(x) = \begin{bmatrix} 1\\ x\\ x^2\end{bmatrix}. \]

Design Matrix

\[ \boldsymbol{ \Phi}(\mathbf{ x}) = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n& x_n^2 \end{bmatrix} \]

Functions Derived from Quadratic Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} \]

Quadratic Functions

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} \]

Choice of Basis

The polynomial represents one choice of basis

\[ \phi_j(x_i) = x_i^j \]

Polynomial Basis

\[ \phi_j(x) = x^j \]

Functions Derived from Polynomial Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} + {\color{magenta}{w_3 x^3}} + {\color{red}{w_4 x^4}} \]

Different Basis

Polynomial basis widely used in engineering and graphics.
Drawbacks in ML, value rises quickly when \(| \mathbf{ x}| > 1\).

Radial Basis Functions

Basis functions can be local e.g. radial (or Gaussian) basis \[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right) \]

Radial Basis Functions

Functions Derived from Radial Basis

\[ f(x) = \color{cyan}{w_1 e^{-2(x+1)^2}} + \color{green}{w_2e^{-2x^2}} + \color{yellow}{w_3 e^{-2(x-1)^2}} \]

Rectified Linear Units

The ReLU is a basis function used in neural nets

\[ \phi_j(x) = xH(x+ v_j) \]

Functions Derived from Relu Basis

\[ f(x) = \color{cyan}{w_0} + \color{green}{w_1 xH(x+1.0) } + \color{yellow}{w_2 xH(x+0.33) } + \color{magenta}{w_3 xH(x-0.33)} + \color{red}{w_4 xH(x-1.0)} \]

Hyperbolic Tangent Basis

The hyperbolc tangent was formerly popular for neural nets.

\[ \phi_j(x) = \tanh(v_j x+ v_0) \]

Functions Derived from Tanh Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \text{tanh}\left(x+1\right)}} + {\color{yellow}{w_2 \text{tanh}\left(x+0.33\right)}} + {\color{magenta}{w_3 \text{tanh}\left(x-0.33\right)}} + {\color{red}{w_4 \text{tanh}\left(x-1\right)}} \]

Fourier Basis

In signal processing we often use the Fourier basis

\[ f(x) = w_0 + w_1 \sin(x) + w_2 \cos(x) + w_3 \sin(2x) + w_4 \cos(2x) \]

Functions Derived from Fourier Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \sin(x)}} + {\color{yellow}{w_2 \cos(x)}} + {\color{magenta}{w_3 \sin(2x)}} + {\color{red}{w_4 \cos(2x)}} \]

Fitting Basis Function Models

Fitting to Data

Olympic Marathon Data

Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.

Image from Wikimedia Commons

Olympic Marathon Data

Notebook Example

In the notebook you are asked to scale the weights to fit functions to Olympic Marathon data.

Basis Function Models

The prediction function is now defined as \[ f(\mathbf{ x}_i) = \sum_{j=1}^mw_j \phi_{i, j} \]

Vector Notation

Write in vector notation, \[ f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}_i \]

Log Likelihood for Basis Function Model

The likelihood of a single data point is \[ p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}\right). \]

Log Likelihood for Basis Function Model

Leading to a log likelihood for the data set of \[\begin{aligned} L(\mathbf{ w},\sigma^2)= & -\frac{n}{2}\log \sigma^2-\frac{n}{2}\log 2\pi \\ & -\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \end{aligned}\]

Objective Function

And a corresponding objective function of the form \[ L(\mathbf{ w},\sigma^2)= \frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \]

Expand the Brackets

\[\begin{aligned} L(\mathbf{ w},\sigma^2) = &\frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2}\sum_{i=1}^{n}y_i\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\\ &+\frac{1}{2\sigma^2}\sum_{i=1}^{n}\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\mathbf{ w}+\text{const}. \end{aligned}\]

Expand the Brackets

\[\begin{aligned} L(\mathbf{ w}, \sigma^2) = & \frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2} \mathbf{ w}^\top\sum_{i=1}^{n}\boldsymbol{ \phi}_i y_i\\ & +\frac{1}{2\sigma^2}\mathbf{ w}^{\top}\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}+\text{const}.\end{aligned}\]

Design Matrices

Design matrix notation \[ \boldsymbol{ \Phi}= \begin{bmatrix} \mathbf{1} & \mathbf{ x}& \mathbf{ x}^2\end{bmatrix} \] so that \[ \boldsymbol{ \Phi}\in \Re^{n\times p}. \]

Multivariate Derivatives Reminder

\[\frac{\text{d}\mathbf{a}^{\top}\mathbf{ w}}{\text{d}\mathbf{ w}}=\mathbf{a}\] \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\]

or

\[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}\]

for symmetric \(\mathbf{A}\).

Differentiate

Differentiate wrt \(\mathbf{ w}\) \[\begin{aligned} \frac{\text{d} E\left(\mathbf{ w},\sigma^2 \right)}{\text{d}\mathbf{ w}}= & -\frac{1}{\sigma^2} \sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i \\ & +\frac{1}{\sigma^2} \left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w} \end{aligned}\]

Find Stationar Point

Set to zero leading to \[\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}^{*}=\sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i.\]

Matrix Notation

\[ \sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^\top = \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\] \[\sum _{i=1}^{n}\boldsymbol{ \phi}_iy_i = \boldsymbol{ \Phi}^\top \mathbf{ y} \]

Update Equations

To find \(\mathbf{ w}^{*}\) solve \[ \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right) \mathbf{ w}^{*} = \boldsymbol{ \Phi}^\top \mathbf{ y} \]
The equation for \(\left.\sigma^2\right.^{*}\) may also be found \[ \left.\sigma^2\right.^{*}=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\boldsymbol{ \phi}_i\right)^{2}}{n}\]

Avoid Direct Inverse

E.g. Solve for \(\mathbf{ w}\) \[ \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)\mathbf{ w}= \boldsymbol{ \Phi}^\top \mathbf{ y} \]

\[ \mathbf{A}\mathbf{x} = \mathbf{b}. \]

Solving

See np.linalg.solve
In practice use \(\mathbf{Q}\mathbf{R}\) decomposition (see lab class notes).

Solution with QR Decomposition

\[ \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\boldsymbol{\beta} = \boldsymbol{ \Phi}^\top \mathbf{ y} \] substitute \(\boldsymbol{ \Phi}= \mathbf{Q}\mathbf{R}\) \[ (\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{ y} \] \[ \mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \]

\[ \mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \] \[ \mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{ y} \]

More nummerically stable.
Avoids the intermediate computation of \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\).

Non-linear but Linear in the Parameters

Model is non-linear, but linear in parameters \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}) \]
\(\mathbf{ x}\) is inside the non-linearity, but \(\mathbf{ w}\) is outside. \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}; \mathbf{ v}), \]

Polynomial Fits to Olympic Marthon Data

Fit linear model with polynomial basis to marathon data.
Try different numbers of basis functions (different degress of polynomial).
Check the quality of fit.

Linear Fit

\[f(x, \mathbf{ w}) = w_0 + w_1x\]

Cubic Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_3 x^3\]

9th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_9 x^9\]

16th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\]

28th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{28} x^{28}\]

Polynomial Fits to Olympic Data

Empirical Risk Minimisation

Expected Loss

\[ R(\mathbf{ w}) = \int L(y, x, \mathbf{ w}) \mathbb{P}(y, x) \text{d}y \text{d}x. \]

Loss Function

Here \(L(\cdot)\) is the loss function.
Different interpretation of the objective.
The cost you pay for mistakes.

Sample-Based Approximations

Sample based approximation: replace true expectation with sum over samples. \[ \int f(z) p(z) \text{d}z\approx \frac{1}{s}\sum_{i=1}^s f(z_i). \]
Allows us to approximate true integral with a sum \[ R(\mathbf{ w}) \approx \frac{1}{n}\sum_{i=1}^{n} L(y_i, x_i, \mathbf{ w}). \]

Empirical Risk Minimization

If the loss is the squared loss \[ L(y, x, \mathbf{ w}) = (y-\mathbf{ w}^\top\boldsymbol{\phi}(x))^2, \]
This recovers the empirical risk \[ R(\mathbf{ w}) \approx \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{ w}^\top \boldsymbol{\phi}(x_i))^2 \]

Estimating Risk through Validation

Validation

Validation on the Olympic Marathon Data

Polynomial Fit: Training Error

The next thing we’ll do is consider a quadratic fit. We will compute the training error for the two fits.

Polynomial Fits to Olympics Data

Hold Out Validation on Olympic Marathon Data

Overfitting

Increase number of basis functions we obtain a better ‘fit’ to the data.
How will the model perform on previously unseen data?
Let’s consider predicting the future.

Future Prediction: Extrapolation

Extrapolation

Here we are training beyond where the model has learnt.
This is known as extrapolation.
Extrapolation is predicting into the future here, but could be:
- Predicting back to the unseen past (pre 1892)
- Spatial prediction (e.g. Cholera rates outside Manchester given rates inside Manchester).

Interpolation

Predicting the wining time for 1946 Olympics is interpolation.
This is because we have times from 1936 and 1948.
If we want a model for interpolation how can we test it?
One trick is to sample the validation set from throughout the data set.

Future Prediction: Interpolation

Choice of Validation Set

The choice of validation set should reflect how you will use the model in practice.
For extrapolation into the future we tried validating with data from the future.
For interpolation we chose validation set from data.
For different validation sets we could get different results.

Leave One Out Validation

Leave One Out Error

Take training set and remove one point.
Train on the remaining data.
Compute the error on the point you removed (which wasn’t in the training data).
Do this for each point in the training set in turn.
Average the resulting error.
This is the leave one out error.

Leave One Out Validation

fold

num basis

\(k\)-fold Cross Validation

Leave one out error can be very time consuming.
Need to train your algorithm \(n\) times.
An alternative: \(k\) fold cross validation.

\(k\)-fold Cross Validation

fold

num basis

The Bootstrap

\[ \mathbf{ y}, \mathbf{X}\sim \mathbb{P}(y, \mathbf{ x}) \]

Resample Dataset

def bootstrap(X):
    "Return a bootstrap sample from a data set."
    n = X.shape[0]
    ind = np.random.choice(n, n, replace=True) # Sample randomly with replacement.
    return X[ind, :]

Bootstrap and Olympic Marathon Data

Linear Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x\]

Cubic Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_{3} x^3\]

9th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{9} x^{9}\]

16th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\]

Bootstrap Confidence Intervals

Bias Variance Decomposition

Generalisation error \[\begin{align*} R(\mathbf{ w}) = & \int \left(y- f^*(\mathbf{ x})\right)^2 \mathbb{P}(y, \mathbf{ x}) \text{d}y\text{d}\mathbf{ x}\\ & \triangleq \mathbb{E}\left[ \left(y- f^*(\mathbf{ x})\right)^2 \right]. \end{align*}\]

Decompose

Decompose as \[ \begin{align*} \mathbb{E}\left[ \left(y- f(\mathbf{ x})\right)^2 \right] = & \text{bias}\left[f^*(\mathbf{ x})\right]^2 \\ & + \text{variance}\left[f^*(\mathbf{ x})\right] \\ \\ &+\sigma^2, \end{align*} \]

Bias

Given by \[ \text{bias}\left[f^*(\mathbf{ x})\right] = \mathbb{E}\left[f^*(\mathbf{ x})\right] - f(\mathbf{ x}) \]
Error due to bias comes from a model that’s too simple.

Variance

Given by \[ \text{variance}\left[f^*(\mathbf{ x})\right] = \mathbb{E}\left[\left(f^*(\mathbf{ x}) - \mathbb{E}\left[f^*(\mathbf{ x})\right]\right)^2\right]. \]
Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.

Bias-Variance Analysis for Olympic Marathon Data

No Free Lunch Theorem

No universally best learner across all data-generating processes
Performance gains arise from assumptions (inductive bias) matching the task
Implication: model choice and regularisation must reflect prior beliefs about the problem

Statement (Informal)

Averaged over all possible labelings/functions, all algorithms have the same expected error
Differences in performance come only from restricting the problem family (assumptions)

Position in this Lecture

After bias–variance and bootstrap: why we cannot expect one degree/basis to win universally
Connects to cross-validation and model selection: choose bias consistent with data
Motivates regularisation: encode assumptions to trade bias for variance appropriately

Practical Takeaways

Use validation to select assumptions that fit the domain
Prefer simple models unless evidence supports added complexity
Make assumptions explicit (document basis, priors, regularisers)

Regularisation

Linear system, solve:

\[ \boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\mathbf{ w}= \boldsymbol{ \Phi}^\top\mathbf{ y} \] But if \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\) then this is not well posed.

Tikhonov Regularisation

Updated objective: \[ L(\mathbf{ w}) = (\mathbf{ y}- \mathbf{ f})^\top(\mathbf{ y}- \mathbf{ f}) + \alpha\left\Vert \mathbf{W} \right\Vert_2^2 \]
Hessian: \[ \boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha \mathbf{I} \]

Splines, Functions, Hilbert Kernels

Can also regularize the function \(f(\cdot)\) directly.
This approach taken in splines and Wahba (1990) and kernels Schölkopf and Smola (2001).
Mathematically more elegant, but algorithmically less flexible and harder to scale.

Training with Noise

Other regularisation approaches such as dropout (Srivastava et al., 2014)
Often perturbing the neural network structure or inputs.
Can have elegant interpretations (see e.g. Bishop (1995))
Also interpreted as ensemble or Bayesian methods.

Thanks!

References

Bishop, C.M., 1995. Training with noise is equivalent to Tikhonov regularization. Neural Computation 7, 108–116. https://doi.org/10.1162/neco.1995.7.1.108

Schölkopf, B., Smola, A.J., 2001. Learning with kernels. mit, Cambridge, MA.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958.

Wahba, G., 1990. Spline models for observational data, First. ed. SIAM. https://doi.org/10.1137/1.9781611970128