Week 6: Basis Functions
[Powerpoint][jupyter][google colab][reveal]
Abstract:
In the last session we explored least squares for univariate and multivariate regression. We introduced matrices, linear algebra and derivatives.
In this session we will introduce basis functions which allow us to implement non-linear regression models.
Setup
notutils
This small package is a helper package for various notebook utilities used below.
The software can be installed using
%pip install notutils
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/notutils
Once notutils
is installed, it can be imported in the
usual manner.
import notutils
pods
In Sheffield we created a suite of software tools for ‘Open Data Science’. Open data science is an approach to sharing code, models and data that should make it easier for companies, health professionals and scientists to gain access to data science techniques.
You can also check this blog post on Open Data Science.
The software can be installed using
%pip install pods
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/ods
Once pods
is installed, it can be imported in the usual
manner.
import pods
mlai
The mlai
software is a suite of helper functions for
teaching and demonstrating machine learning algorithms. It was first
used in the Machine Learning and Adaptive Intelligence course in
Sheffield in 2013.
The software can be installed using
%pip install mlai
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/mlai
Once mlai
is installed, it can be imported in the usual
manner.
import mlai
Nonlinear Regression
We’ve now seen how we may perform linear regression. Now, we are going to consider how we can perform non-linear regression. However, before we get into the details of how to do that we first need to consider in what ways the regression can be non-linear. Multivariate linear regression allows us to build models that take many features into account when making our prediction. In this session we are going to introduce basis functions. The term seems complicted, but they are actually based on rather a simple idea. If we are doing a multivariate linear regression, we get extra features that might help us predict our required response varible (or target value), \(y\). But what if we only have one input value? We can actually artificially generate more input values with basis functions.
Non-linear in the Inputs
When we refer to non-linear regression, we are normally referring to whether the regression is non-linear in the input space, or non-linear in the covariates. The covariates are the observations that move with the target (or response) variable. In our notation we have been using \(\mathbf{ x}_i\) to represent a vector of the covariates associated with the \(i\)th observation. The coresponding response variable is \(y_i\). If a model is non-linear in the inputs, it means that there is a non-linear function between the inputs and the response variable. Linear functions are functions that only involve multiplication and addition, in other words they can be represented through linear algebra. Linear regression involves assuming that a function takes the form \[ f(\mathbf{ x}) = \mathbf{ w}^\top \mathbf{ x} \] where \(\mathbf{ w}\) are our regression weights. A very easy way to make the linear regression non-linear is to introduce non-linear functions. When we are introducing non-linear regression these functions are known as basis functions.
Basis Functions
Basis Functions
Here’s the idea, instead of working directly on the original input space, \(\mathbf{ x}\), we build models in a new space, \(\boldsymbol{ \phi}(\mathbf{ x})\) where \(\boldsymbol{ \phi}(\cdot)\) is a vector-valued function that is defined on the space \(\mathbf{ x}\).
Quadratic Basis
Remember, that a vector-valued function is just a vector that contains functions instead of values. Here’s an example for a one dimensional input space, \(x\), being projected to a quadratic basis. First we consider each basis function in turn, we can think of the elements of our vector as being indexed so that we have \[ \begin{align*} \phi_1(x) & = 1, \\ \phi_2(x) & = x, \\ \phi_3(x) & = x^2. \end{align*} \] Now we can consider them together by placing them in a vector, \[ \boldsymbol{ \phi}(x) = \begin{bmatrix} 1\\ x\\ x^2\end{bmatrix}. \] For the vector-valued function, we have simply collected the different functions together in the same vector making them notationally easier to deal with in our mathematics.
When we consider the vector-valued function for each data point, then we place all the data into a matrix. The result is a matrix valued function, \[ \boldsymbol{ \Phi}(\mathbf{ x}) = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n& x_n^2 \end{bmatrix} \] where we are still in the one dimensional input setting so \(\mathbf{ x}\) here represents a vector of our inputs with \(n\) elements.
Let’s try constructing such a matrix for a set of inputs. First of all, we create a function that returns the matrix valued function.
import numpy as np
def quadratic(x, **kwargs):
"""Take in a vector of input values and return the design matrix associated
with the basis functions."""
return np.hstack([np.ones((x.shape[0], 1)), x, x**2])
Functions Derived from Quadratic Basis
\[ f(x) = {\color{red}{w_0}} + {\color{magenta}{w_1 x}} + {\color{blue}{w_2 x^2}} \]
This function takes in an \(n\times 1\) dimensional vector and returns an \(n\times 3\) dimensional design matrix containing the basis functions. We can plot those basis functions against there input as follows.
The actual function we observe is then made up of a sum of these functions. This is the reason for the name basis. The term basis means ‘the underlying support or foundation for an idea, argument, or process’, and in this context they form the underlying support for our prediction function. Our prediction function can only be composed of a weighted linear sum of our basis functions.
Quadratic Functions
Different Bases
Our choice of basis can be made based on what our beliefs about what is appropriate for the data. For example, the polynomial basis extends the quadratic basis to aribrary degree, so we might define the \(j\)th basis function associated with the model as \[ \phi_j(x_i) = x_i^j \] which is known as the polynomial basis.
Polynomial Basis
The polynomial basis combines higher order polynomials together to create the function. For example, the fourth order polynomial has five components to its basis function. \[ \phi_j(x) = x^j \]
import numpy as np
import mlai
from mlai import polynomial
Functions Derived from Polynomial Basis
\[ f(x) = {\color{red}{w_0}} + {\color{magenta}{w_1 x}} + {\color{blue}{w_2 x^2}} + {\color{green}{w_3 x^3}} + {\color{cyan}{w_4 x^4}} \]
To aid in understanding how a basis works, we’ve provided you with a small interactive tool for exploring this polynomial basis. The tool can be summoned with the following command.
Try moving the sliders around to change the weight of each basis
function. Click the control box display_basis
to show the
underlying basis functions (in red). The prediction function is shown in
a thick blue line. Warning the sliders aren’t presented quite
in the correct order. w_0
is associated with the bias,
w_1
is the linear term, w_2
the quadratic and
here (because we have four basis functions) we have w_3
for
the cubic term. So the subscript of the weight parameter is
always associated with the corresponding polynomial’s degree.
Exercise 1
Try increasing the number of basis functions (thereby increasing the degree of the resulting polynomial). Describe what you see as you increase number of basis up to 10. Is it easy to change the function in intiutive ways?
Different Basis
The polynomial basis is widely used in Engineering and graphics, but it has some drawbacks in machine learning: outside the input region between -1 and 1, the values of the polynomial basis rise very quickly.
Now we look at basis functions that have been used as the activation functions in neural network model.
Radial Basis Functions
Another type of basis is sometimes known as a ‘radial basis’ because the effect basis functions are constructed on ‘centres’ and the effect of each basis function decreases as the radial distance from each centre increases.
\[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right) \]
import mlai
from mlai import radial
Functions Derived from Radial Basis
\[ f(x) = \color{red}{w_1 e^{-2(x+1)^2}} + \color{magenta}{w_2e^{-2x^2}} + \color{blue}{w_3 e^{-2(x-1)^2}} \]
Rectified Linear Units
The rectified linear unit is a basis function that emerged out of the deep learning community. Rectified linear units are popular in the current generation of multilayer perceptron models, or deep networks. These basis functions start flat, and then become linear functions at a certain threshold. \[ \phi_j(x) = xH(v_j x+ v_0) \]
import numpy as np
import mlai
from mlai import relu
Functions Derived from Relu Basis
\[ f(x) = \color{red}{w_0} + \color{magenta}{w_1 xH(x+1.0) } + \color{blue}{w_2 xH(x+0.33) } + \color{green}{w_3 xH(x-0.33)} + \color{cyan}{w_4 xH(x-1.0)} \]
Hyperbolic Tangent Basis
The rectified linear unit is a basis function that used to be used a lot for neural network models. It’s related to the sigmoid function by a scaling. \[ \phi_j(x) = \tanh(v_j x+ v_0) \]
import numpy as np
import mlai
from mlai import hyperbolic_tangent
Sigmoid or hyperbolic tangent basis was popular in the original generation of multilayer perceptron models, or deep networks. These basis functions start flat, rise and then saturate.
Functions Derived from Tanh Basis
\[ f(x) = {\color{red}{w_0}} + {\color{magenta}{w_1 \text{tanh}\left(x+1\right)}} + {\color{blue}{w_2 \text{tanh}\left(x+0.33\right)}} + {\color{green}{w_3 \text{tanh}\left(x-0.33\right)}} + {\color{cyan}{w_4 \text{tanh}\left(x-1\right)}} \]
Fourier Basis
Joseph Fourier suggested that functions could be converted to a sum of sines and cosines. A Fourier basis is a linear weighted sum of these functions. \[ f(x) = w_0 + w_1 \sin(x) + w_2 \cos(x) + w_3 \sin(2x) + w_4 \cos(2x) \]
import numpy as np
import mlai
from mlai import fourier
In this code, basis functions with an odd index are sine and basis functions with an even index are cosine. The first basis function (index 0, so cosine) has a frequency of 0 and then frequencies increase every time a sine and cosine are included.
Functions Derived from Fourier Basis
\[ f(x) = {\color{red}{w_0}} + {\color{magenta}{w_1 \sin(x)}} + {\color{blue}{w_2 \cos(x)}} + {\color{green}{w_3 \sin(2x)}} + {\color{cyan}{w_4 \cos(2x)}} \]
Fitting to Data
Now we are going to consider how these basis functions can be adjusted to fit to a particular data set. We will return to the olympic marathon data from last time. First we will scale the output of the data to be zero mean and variance 1.
Olympic Marathon Data
|
|
The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. Let’s load in the data and plot.
import numpy as np
import pods
= pods.datasets.olympic_marathon_men()
data = data['X']
x = data['Y']
y
= y.mean()
offset = np.sqrt(y.var())
scale = (y - offset)/scale yhat
Things to notice about the data include the outlier in 1904, in that year the Olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed. More recent years see more consistently quick marathons.
Exercise 2
Use the tool provided above to try and find the best fit you can to the data. Explore the parameter space and give the weight values you used for the
- polynomial basis
- Radial basis
- Fourier basis
Write your answers in the code box below creating a new vector of parameters (in the correct order!) for each basis.
1, 2, 3, 4]]).shape np.asarray([[
Basis Function Models
\[ f(\mathbf{ x}_i) = \sum_{j=1}^mw_j \phi_{i, j} \]
\[ f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}_i \]
Log Likelihood for Basis Function Model
The likelihood of a single data point given the model parameters is given by \[ p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}\right). \] and making an assumption of conditional independence given the parameters we can write
\[ L(\mathbf{ w},\sigma^2)= -\frac{n}{2}\log \sigma^2-\frac{n}{2}\log 2\pi -\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \] to give us the likelihood for the whole data set.
Objective Function
Traditionally in optimization, we choose to minmize an object function (or loss function, or cost function) rather than maximizing a likelihood. For these models we minimize the negative log likelihood. This function takes the form,
\[ E(\mathbf{ w},\sigma^2)= \frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \]
To minimize this objective, we first expand the brackets as follows, \[ \begin{align} E(\mathbf{ w},\sigma^2) = &\frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2}-\frac{1}{\sigma^2}\sum_{i=1}^{n}y_i\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\\ &+\frac{1}{2\sigma^2}\sum_{i=1}^{n}\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\mathbf{ w}+\text{const}. \end{align} \]
Now we pull out the vectors, \(\mathbf{ w}\), to highlight that what we have is a multivariate quadratic form in \(\mathbf{ w}\). \[\begin{align} E(\mathbf{ w}, \sigma^2) = & \frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2}-\frac{1}{\sigma^2} \mathbf{ w}^\top\sum_{i=1}^{n}\boldsymbol{ \phi}_i y_i\\ & +\frac{1}{2\sigma^2}\mathbf{ w}^{\top}\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}+\text{const}.\end{align}\]
Design Matrices
We like to make use of design matrices for our data. Design matrices involve placing the data points into rows of the matrix and data features into the columns of the matrix. By convention, we are referincing a vector with a bold lower case letter, and a matrix with a bold upper case letter. The design matrix is therefore given by \[ \boldsymbol{ \Phi}= \begin{bmatrix} \mathbf{1} & \mathbf{ x}& \mathbf{ x}^2\end{bmatrix} \] so that \[ \boldsymbol{ \Phi}\in \Re^{n\times p}. \]
Multivariate Derivatives Reminder
To find the minimum of the objective function, we need to make use of multivariate calculus. The two results we need from multivariate calculus have the following form.
\[\frac{\text{d}\mathbf{a}^{\top}\mathbf{ w}}{\text{d}\mathbf{ w}}=\mathbf{a}\] and \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\] or if \(\mathbf{A}\) is symmetric (i.e. \(\mathbf{A}=\mathbf{A}^{\top}\)) \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}.\]
Differentiate
Differentiating with respect to the vector \(\mathbf{ w}\) we obtain \[\frac{\text{d} E\left(\mathbf{ w},\sigma^2 \right)}{\text{d}\mathbf{ w}}=-\frac{1}{\sigma^2} \sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i+\frac{1}{\sigma^2} \left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}\] Leading to \[\mathbf{ w}^{*}=\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]^{-1}\sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i,\]
Rewriting this result in matrix notation we obtain: \[ \sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^\top = \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\] \[\sum _{i=1}^{n}\boldsymbol{ \phi}_iy_i = \boldsymbol{ \Phi}^\top \mathbf{ y} \]
Setting the derivative to zero we obtain update equations for the parameter vector and the noise variance.
\[ \mathbf{ w}^{*} = \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)^{-1} \boldsymbol{ \Phi}^\top \mathbf{ y} \]
\[ \left.\sigma^2\right.^=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\boldsymbol{ \phi}_i\right)^{2}}{n}. \]
In practice we should avoid solving these equations through direct use of the inverse. Instead we solve for \(\mathbf{ w}\) in the following linear system.
\[
\left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)\mathbf{ w}=
\boldsymbol{ \Phi}^\top \mathbf{ y}
\] Compare this system with solve for \(\mathbf{x}\) in \[
\mathbf{A}\mathbf{x} = \mathbf{b}.
\] For example see the numpy.linalg.solve
or
torch.linalg.solve
.
But the correct and more stable approach is to make use of the QR decomposition.
Solution with QR Decomposition
Performing a solve instead of a matrix inverse is the more numerically stable approach, but we can do even better. A QR-decomposition of a matrix factorizes it into a matrix which is an orthogonal matrix \(\mathbf{Q}\), so that \(\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}\). And a matrix which is upper triangular, \(\mathbf{R}\). \[ \designMatrix^\top \designMatrix \boldsymbol{\beta} = \designMatrix^\top \mathbf{ y} \] and we substitute \(\designMatrix = \mathbf{Q}{\mathbf{R}\) so we have \[ (\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{ y} \] \[ \mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \]
\[ \mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \] \[ \mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{ y} \] which leaves us with a lower triangular system to solve.
This is a more numerically stable solution because it removes the need to compute \(\designMatrix^\top\designMatrix\) as an intermediate. Computing \(\designMatrix^\top\designMatrix\) is a bad idea because it involves squaring all the elements of \(\designMatrix\) and thereby potentially reducing the numerical precision with which we can represent the solution. Operating on \(\designMatrix\) directly preserves the numerical precision of the model.
This can be more particularly seen when we begin to work with basis functions in the next session. Some systems that can be resolved with the QR decomposition cannot be resolved by using solve directly.
import scipy as sp
= np.linalg.qr(\designVariable)
Q, R = sp.linalg.solve_triangular(R, Q.T@y)
w = pd.DataFrame(w, index=\designVariable.columns)
w w
Polynomial Fits to Olympic Data
import numpy as np
import pods
= mlai.polynomial
basis
= pods.datasets.olympic_marathon_men()
data
= data['X']
x = data['Y']
y
= [1892, 2020]
xlim
=mlai.Basis(mlai.polynomial, number=1, data_limits=xlim) basis
import mlai
= data['X']
x = data['Y']
y
= [1892, 2020]
xlim = 27
max_basis
= np.array([np.nan]*(max_basis))
ll = np.array([np.nan]*(max_basis))
sum_squares =mlai.Basis(mlai.polynomial, number=1, data_limits=xlim) basis
Non-linear but Linear in the Parameters
One rather nice aspect of our model is that whilst it is non-linear in the inputs, it is still linear in the parameters \(\mathbf{ w}\). This means that our derivations from before continue to operate to allow us to work with this model. In fact, although this is a non-linear regression it is still known as a linear model because it is linear in the parameters,
\[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}) \] where the vector \(\mathbf{ x}\) appears inside the basis functions, making our result, \(f(\mathbf{ x})\) non-linear in the inputs, but \(\mathbf{ w}\) appears outside our basis function, making our result linear in the parameters. In practice, our basis function itself may contain its own set of parameters, \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}; \boldsymbol{\theta}), \] that we’ve denoted here as \(\boldsymbol{\theta}\). If these parameters appear inside the basis function then our model is non-linear in these parameters.
Exercise 3
For the following prediction functions state whether the model is linear in the inputs, the parameters or both.
\(f(x) = w_1x_1 + w_2\)
\(f(x) = w_1\exp(x_1) + w_2x_2 + w_3\)
\(f(x) = \log(x_1^{w_1}) + w_2x_2^2 + w_3\)
\(f(x) = \exp(-\sum_i(x_i - w_i)^2)\)
\(f(x) = \exp(-\mathbf{ w}^\top \mathbf{ x})\)
Fitting the Model Yourself
You now have everything you need to fit a non- linear (in the inputs) basis function model to the marathon data.
Exercise 4
Choose one of the basis functions you have explored above. Compute
the design matrix on the covariates (or input data), x
. Use
the design matrix and the response variable y
to solve the
following linear system for the model parameters w
. \[
\boldsymbol{ \phi}^\top\boldsymbol{ \phi}\mathbf{ w}= \boldsymbol{
\phi}^\top \mathbf{ y}
\] Compute the corresponding error on the training data. How does
it compare to the error you were able to achieve fitting the basis
above? Plot the form of your prediction function from the least squares
estimate alongside the form of you prediction function you fitted by
hand.
Lecture on Basis Functions from GPRS Uganda
Use of QR Decomposition for Numerical Stability
In the last session we showed how rather than computing \(\mathbf{X}^\top\mathbf{X}\) as an intermediate step to our solution, we could compute the solution to the regressiond directly through QR-decomposition. Now we will consider an example with non linear basis functions where such computation is critical for forming the right answer.
TODO example with polynomials.
import numpy as np
= np.random.normal(size=(10, 1)) x
= mlai.fourier(x, 5) Phi
(np.dot(Phi.T,Phi))
*Phi Phi
Further Reading
Section 1.4 of Rogers and Girolami (2011)
Chapter 1, pg 1-6 of Bishop (2006)
Chapter 3, Section 3.1 up to pg 143 of Bishop (2006)
Thanks!
For more information on these subjects and more you might want to check the following resources.
- twitter: @lawrennd
- podcast: The Talking Machines
- newspaper: Guardian Profile Page
- blog: http://inverseprobability.com