Basis Functions and Generalisation

Neil Lawrence

Dedan Kimathi University, Nyeri, Kenya

Nonlinear Regression with Linear Models

Nonlinear Regression

  • Problem with Linear Regression—\(\mathbf{ x}\) may not be linearly related to \(\mathbf{ y}\).
  • Potential solution: create a feature space: define \(\phi(\mathbf{ x})\) where \(\phi(\cdot)\) is a nonlinear function of \(\mathbf{ x}\).
  • Model for target is a linear combination of these nonlinear functions \[f(\mathbf{ x}) = \sum_{j=1}^mw_j \phi_j(\mathbf{ x})\]

Non-linear in the Inputs

  • Go from \[ f(\mathbf{ x}) = \mathbf{ w}^\top \mathbf{ x} \] to \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}) \]

Basis Functions

Basis Functions

  • Instead of working in input space, \(\mathbf{ x}\)..
  • Build models in a new space \(\boldsymbol{ \phi}(\mathbf{ x})\)

Quadratic Basis

  • Basis functions can be global. E.g. quadratic basis: \[ \boldsymbol{ \phi}^\top = [1, x, x^2] \]

\[ \begin{align*} \phi_1(x) & = 1, \\ \phi_2(x) & = x, \\ \phi_3(x) & = x^2. \end{align*} \]

Quadratic Basis

\[ \boldsymbol{ \phi}(x) = \begin{bmatrix} 1\\ x\\ x^2\end{bmatrix}. \]

Design Matrix

\[ \boldsymbol{ \Phi}(\mathbf{ x}) = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n& x_n^2 \end{bmatrix} \]

Functions Derived from Quadratic Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} \]

Quadratic Functions

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} \]

Choice of Basis

  • The polynomial represents one choice of basis

\[ \phi_j(x_i) = x_i^j \]

Polynomial Basis

\[ \phi_j(x) = x^j \]

Functions Derived from Polynomial Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} + {\color{magenta}{w_3 x^3}} + {\color{red}{w_4 x^4}} \]

Different Basis

  • Polynomial basis widely used in engineering and graphics.
  • Drawbacks in ML, value rises quickly when \(| \mathbf{ x}| > 1\).

Radial Basis Functions

  • Basis functions can be local e.g. radial (or Gaussian) basis \[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right) \]

Radial Basis Functions

Functions Derived from Radial Basis

\[ f(x) = \color{cyan}{w_1 e^{-2(x+1)^2}} + \color{green}{w_2e^{-2x^2}} + \color{yellow}{w_3 e^{-2(x-1)^2}} \]

Rectified Linear Units

  • The ReLU is a basis function used in neural nets

\[ \phi_j(x) = xH(x+ v_j) \]

Functions Derived from Relu Basis

\[ f(x) = \color{cyan}{w_0} + \color{green}{w_1 xH(x+1.0) } + \color{yellow}{w_2 xH(x+0.33) } + \color{magenta}{w_3 xH(x-0.33)} + \color{red}{w_4 xH(x-1.0)} \]

Hyperbolic Tangent Basis

  • The hyperbolc tangent was formerly popular for neural nets.

\[ \phi_j(x) = \tanh(v_j x+ v_0) \]

Functions Derived from Tanh Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \text{tanh}\left(x+1\right)}} + {\color{yellow}{w_2 \text{tanh}\left(x+0.33\right)}} + {\color{magenta}{w_3 \text{tanh}\left(x-0.33\right)}} + {\color{red}{w_4 \text{tanh}\left(x-1\right)}} \]

Fourier Basis

  • In signal processing we often use the Fourier basis

\[ f(x) = w_0 + w_1 \sin(x) + w_2 \cos(x) + w_3 \sin(2x) + w_4 \cos(2x) \]

Functions Derived from Fourier Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \sin(x)}} + {\color{yellow}{w_2 \cos(x)}} + {\color{magenta}{w_3 \sin(2x)}} + {\color{red}{w_4 \cos(2x)}} \]

Fitting Basis Function Models

Fitting to Data

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardized distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organized leading to very slow times.
Image from Wikimedia Commons

Olympic Marathon Data

Notebook Example

  • In the notebook you are asked to scale the weights to fit functions to Olympic Marathon data.

Basis Function Models

  • The prediction function is now defined as \[ f(\mathbf{ x}_i) = \sum_{j=1}^mw_j \phi_{i, j} \]

Vector Notation

  • Write in vector notation, \[ f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}_i \]

Log Likelihood for Basis Function Model

  • The likelihood of a single data point is \[ p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}\right). \]

Log Likelihood for Basis Function Model

  • Leading to a log likelihood for the data set of \[\begin{aligned} L(\mathbf{ w},\sigma^2)= & -\frac{n}{2}\log \sigma^2-\frac{n}{2}\log 2\pi \\ & -\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \end{aligned}\]

Objective Function

  • And a corresponding objective function of the form \[ L(\mathbf{ w},\sigma^2)= \frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \]

Expand the Brackets

\[\begin{aligned} L(\mathbf{ w},\sigma^2) = &\frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2}\sum_{i=1}^{n}y_i\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\\ &+\frac{1}{2\sigma^2}\sum_{i=1}^{n}\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\mathbf{ w}+\text{const}. \end{aligned}\]

Expand the Brackets

\[\begin{aligned} L(\mathbf{ w}, \sigma^2) = & \frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2} \\ & -\frac{1}{\sigma^2} \mathbf{ w}^\top\sum_{i=1}^{n}\boldsymbol{ \phi}_i y_i\\ & +\frac{1}{2\sigma^2}\mathbf{ w}^{\top}\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}+\text{const}.\end{aligned}\]

Design Matrices

  • Design matrix notation \[ \boldsymbol{ \Phi}= \begin{bmatrix} \mathbf{1} & \mathbf{ x}& \mathbf{ x}^2\end{bmatrix} \] so that \[ \boldsymbol{ \Phi}\in \Re^{n\times p}. \]

Multivariate Derivatives Reminder

\[\frac{\text{d}\mathbf{a}^{\top}\mathbf{ w}}{\text{d}\mathbf{ w}}=\mathbf{a}\] \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\]
or
\[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}\]
for symmetric \(\mathbf{A}\).

Differentiate

Differentiate wrt \(\mathbf{ w}\) \[\begin{aligned} \frac{\text{d} E\left(\mathbf{ w},\sigma^2 \right)}{\text{d}\mathbf{ w}}= & -\frac{1}{\sigma^2} \sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i \\ & +\frac{1}{\sigma^2} \left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w} \end{aligned}\]

Find Stationar Point

  • Set to zero leading to \[\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}^{*}=\sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i.\]

Matrix Notation

\[ \sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^\top = \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\] \[\sum _{i=1}^{n}\boldsymbol{ \phi}_iy_i = \boldsymbol{ \Phi}^\top \mathbf{ y} \]

Update Equations

  • To find \(\mathbf{ w}^{*}\) solve \[ \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right) \mathbf{ w}^{*} = \boldsymbol{ \Phi}^\top \mathbf{ y} \]
  • The equation for \(\left.\sigma^2\right.^{*}\) may also be found \[ \left.\sigma^2\right.^{*}=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\boldsymbol{ \phi}_i\right)^{2}}{n}\]

Avoid Direct Inverse

  • E.g. Solve for \(\mathbf{ w}\) \[ \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)\mathbf{ w}= \boldsymbol{ \Phi}^\top \mathbf{ y} \]

\[ \mathbf{A}\mathbf{x} = \mathbf{b}. \]

Solving

  • See np.linalg.solve
  • In practice use \(\mathbf{Q}\mathbf{R}\) decomposition (see lab class notes).

Solution with QR Decomposition

\[ \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\boldsymbol{\beta} = \boldsymbol{ \Phi}^\top \mathbf{ y} \] substitute \(\boldsymbol{ \Phi}= \mathbf{Q}\mathbf{R}\) \[ (\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{ y} \] \[ \mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \]

\[ \mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \] \[ \mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{ y} \]

  • More nummerically stable.
  • Avoids the intermediate computation of \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\).

Non-linear but Linear in the Parameters

  • Model is non-linear, but linear in parameters \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}) \]
  • \(\mathbf{ x}\) is inside the non-linearity, but \(\mathbf{ w}\) is outside. \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}; \mathbf{ v}), \]

Polynomial Fits to Olympic Marthon Data

  • Fit linear model with polynomial basis to marathon data.
  • Try different numbers of basis functions (different degress of polynomial).
  • Check the quality of fit.

Linear Fit

\[f(x, \mathbf{ w}) = w_0 + w_1x\]

Cubic Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_3 x^3\]

9th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_9 x^9\]

16th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\]

28th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{28} x^{28}\]

Polynomial Fits to Olympic Data

Empirical Risk Minimisation

Expected Loss

\[ R(\mathbf{ w}) = \int L(y, x, \mathbf{ w}) \mathbb{P}(y, x) \text{d}y \text{d}x. \]

Loss Function

  • Here \(L(\cdot)\) is the loss function.
  • Different interpretation of the objective.
  • The cost you pay for mistakes.

Sample-Based Approximations

  • Sample based approximation: replace true expectation with sum over samples. \[ \int f(z) p(z) \text{d}z\approx \frac{1}{s}\sum_{i=1}^s f(z_i). \]

  • Allows us to approximate true integral with a sum \[ R(\mathbf{ w}) \approx \frac{1}{n}\sum_{i=1}^{n} L(y_i, x_i, \mathbf{ w}). \]

Empirical Risk Minimization

  • If the loss is the squared loss \[ L(y, x, \mathbf{ w}) = (y-\mathbf{ w}^\top\boldsymbol{\phi}(x))^2, \]
  • This recovers the empirical risk \[ R(\mathbf{ w}) \approx \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{ w}^\top \boldsymbol{\phi}(x_i))^2 \]

Estimating Risk through Validation

Validation

Validation on the Olympic Marathon Data

Polynomial Fit: Training Error

The next thing we’ll do is consider a quadratic fit. We will compute the training error for the two fits.

Polynomial Fits to Olympics Data

Hold Out Validation on Olympic Marathon Data

Overfitting

  • Increase number of basis functions we obtain a better ‘fit’ to the data.
  • How will the model perform on previously unseen data?
  • Let’s consider predicting the future.

Future Prediction: Extrapolation

Extrapolation

  • Here we are training beyond where the model has learnt.
  • This is known as extrapolation.
  • Extrapolation is predicting into the future here, but could be:
    • Predicting back to the unseen past (pre 1892)
    • Spatial prediction (e.g. Cholera rates outside Manchester given rates inside Manchester).

Interpolation

  • Predicting the wining time for 1946 Olympics is interpolation.
  • This is because we have times from 1936 and 1948.
  • If we want a model for interpolation how can we test it?
  • One trick is to sample the validation set from throughout the data set.

Future Prediction: Interpolation

Choice of Validation Set

  • The choice of validation set should reflect how you will use the model in practice.
  • For extrapolation into the future we tried validating with data from the future.
  • For interpolation we chose validation set from data.
  • For different validation sets we could get different results.

Leave One Out Validation

Leave One Out Error

  • Take training set and remove one point.
  • Train on the remaining data.
  • Compute the error on the point you removed (which wasn’t in the training data).
  • Do this for each point in the training set in turn.
  • Average the resulting error.
  • This is the leave one out error.

Leave One Out Validation

fold

num basis

num basis

num basis

num basis

num basis

num basis

\(k\)-fold Cross Validation

  • Leave one out error can be very time consuming.
  • Need to train your algorithm \(n\) times.
  • An alternative: \(k\) fold cross validation.

\(k\)-fold Cross Validation

fold

num basis

num basis

num basis

num basis

num basis

num basis

The Bootstrap

\[ \mathbf{ y}, \mathbf{X}\sim \mathbb{P}(y, \mathbf{ x}) \]

Resample Dataset

def bootstrap(X):
    "Return a bootstrap sample from a data set."
    n = X.shape[0]
    ind = np.random.choice(n, n, replace=True) # Sample randomly with replacement.
    return X[ind, :]

Bootstrap and Olympic Marathon Data

Linear Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x\]

Cubic Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + w_{3} x^3\]

9th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{9} x^{9}\]

16th Degree Polynomial Fit

\[f(x, \mathbf{ w}) = w_0 + w_1 x+ w_2 x^2 + \dots + w_{16} x^{16}\]

Bootstrap Confidence Intervals

Bias Variance Decomposition

Generalisation error \[\begin{align*} R(\mathbf{ w}) = & \int \left(y- f^*(\mathbf{ x})\right)^2 \mathbb{P}(y, \mathbf{ x}) \text{d}y\text{d}\mathbf{ x}\\ & \triangleq \mathbb{E}\left[ \left(y- f^*(\mathbf{ x})\right)^2 \right]. \end{align*}\]

Decompose

Decompose as \[ \begin{align*} \mathbb{E}\left[ \left(y- f(\mathbf{ x})\right)^2 \right] = & \text{bias}\left[f^*(\mathbf{ x})\right]^2 \\ & + \text{variance}\left[f^*(\mathbf{ x})\right] \\ \\ &+\sigma^2, \end{align*} \]

Bias

  • Given by \[ \text{bias}\left[f^*(\mathbf{ x})\right] = \mathbb{E}\left[f^*(\mathbf{ x})\right] - f(\mathbf{ x}) \]

  • Error due to bias comes from a model that’s too simple.

Variance

  • Given by \[ \text{variance}\left[f^*(\mathbf{ x})\right] = \mathbb{E}\left[\left(f^*(\mathbf{ x}) - \mathbb{E}\left[f^*(\mathbf{ x})\right]\right)^2\right]. \]

  • Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.

Bias-Variance Analysis for Olympic Marathon Data

No Free Lunch Theorem

  • No universally best learner across all data-generating processes
  • Performance gains arise from assumptions (inductive bias) matching the task
  • Implication: model choice and regularisation must reflect prior beliefs about the problem

Statement (Informal)

  • Averaged over all possible labelings/functions, all algorithms have the same expected error
  • Differences in performance come only from restricting the problem family (assumptions)

Position in this Lecture

  • After bias–variance and bootstrap: why we cannot expect one degree/basis to win universally
  • Connects to cross-validation and model selection: choose bias consistent with data
  • Motivates regularisation: encode assumptions to trade bias for variance appropriately

Practical Takeaways

  • Use validation to select assumptions that fit the domain
  • Prefer simple models unless evidence supports added complexity
  • Make assumptions explicit (document basis, priors, regularisers)

Regularisation

Linear system, solve:

\[ \boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\mathbf{ w}= \boldsymbol{ \Phi}^\top\mathbf{ y} \] But if \(\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\) then this is not well posed.

Tikhonov Regularisation

  • Updated objective: \[ L(\mathbf{ w}) = (\mathbf{ y}- \mathbf{ f})^\top(\mathbf{ y}- \mathbf{ f}) + \alpha\left\Vert \mathbf{W} \right\Vert_2^2 \]
  • Hessian: \[ \boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha \mathbf{I} \]

Splines, Functions, Hilbert Kernels

  • Can also regularize the function \(f(\cdot)\) directly.
  • This approach taken in splines and Wahba (1990) and kernels Schölkopf and Smola (2001).
  • Mathematically more elegant, but algorithmically less flexible and harder to scale.

Training with Noise

  • Other regularisation approaches such as dropout (Srivastava et al., 2014)
  • Often perturbing the neural network structure or inputs.
  • Can have elegant interpretations (see e.g. Bishop (1995))
  • Also interpreted as ensemble or Bayesian methods.

Thanks!

References

Bishop, C.M., 1995. Training with noise is equivalent to Tikhonov regularization. Neural Computation 7, 108–116. https://doi.org/10.1162/neco.1995.7.1.108
Schölkopf, B., Smola, A.J., 2001. Learning with kernels. mit, Cambridge, MA.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958.
Wahba, G., 1990. Spline models for observational data, First. ed. SIAM. https://doi.org/10.1137/1.9781611970128