Basis Functions

Nonlinear Regression

  • Problem with Linear Regression—\(\mathbf{ x}\) may not be linearly related to \(\mathbf{ y}\).
  • Potential solution: create a feature space: define \(\phi(\mathbf{ x})\) where \(\phi(\cdot)\) is a nonlinear function of \(\mathbf{ x}\).
  • Model for target is a linear combination of these nonlinear functions \[f(\mathbf{ x}) = \sum_{j=1}^mw_j \phi_j(\mathbf{ x})\]

Non-linear in the Inputs

\[ f(\mathbf{ x}) = \mathbf{ w}^\top \mathbf{ x} \]

Basis Functions

Basis Functions

Quadratic Basis

  • Basis functions can be global. E.g. quadratic basis: \[ \boldsymbol{ \phi}= [1, x, x^2] \]

\[ \begin{align*} \phi_1(x) & = 1, \\ \phi_2(x) & = x, \\ \phi_3(x) & = x^2. \end{align*} \]

\[ \boldsymbol{ \phi}(x) = \begin{bmatrix} 1\\ x\\ x^2\end{bmatrix}. \]

Matrix Valued Function

\[ \boldsymbol{ \Phi}(\mathbf{ x}) = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots \\ 1 & x_n& x_n^2 \end{bmatrix} \]

Functions Derived from Quadratic Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} \]

Quadratic Functions

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} \]

Different Bases

\[ \phi_j(x_i) = x_i^j \]

Polynomial Basis

\[ \phi_j(x) = x^j \]

Functions Derived from Polynomial Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 x}} + {\color{yellow}{w_2 x^2}} + {\color{magenta}{w_3 x^3}} + {\color{red}{w_4 x^4}} \]

Different Basis

Radial Basis Functions

  • Basis functions can be local e.g. radial (or Gaussian) basis \[ \phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right) \]

Functions Derived from Radial Basis

\[ f(x) = \color{cyan}{w_1 e^{-2(x+1)^2}} + \color{green}{w_2e^{-2x^2}} + \color{yellow}{w_3 e^{-2(x-1)^2}} \]

Rectified Linear Units

\[ \phi_j(x) = xH(v_j x+ v_0) \]

Functions Derived from Relu Basis

\[ f(x) = \color{cyan}{w_0} + \color{green}{w_1 xH(x+1.0) } + \color{yellow}{w_2 xH(x+0.33) } + \color{magenta}{w_3 xH(x-0.33)} + \color{red}{w_4 xH(x-1.0)} \]

Hyperbolic Tangent Basis

\[ \phi_j(x) = \tanh(v_j x+ v_0) \]

Functions Derived from Tanh Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \text{tanh}\left(x+1\right)}} + {\color{yellow}{w_2 \text{tanh}\left(x+0.33\right)}} + {\color{magenta}{w_3 \text{tanh}\left(x-0.33\right)}} + {\color{red}{w_4 \text{tanh}\left(x-1\right)}} \]

Fourier Basis

\[ f(x) = w_0 + w_1 \sin(x) + w_2 \cos(x) + w_3 \sin(2x) + w_4 \cos(2x) \]

Functions Derived from Fourier Basis

\[ f(x) = {\color{cyan}{w_0}} + {\color{green}{w_1 \sin(x)}} + {\color{yellow}{w_2 \cos(x)}} + {\color{magenta}{w_3 \sin(2x)}} + {\color{red}{w_4 \cos(2x)}} \]

Fitting to Data

Now we are going to consider how these basis functions can be adjusted to fit to a particular data set. We will return to the olympic marathon data from last time. First we will scale the output of the data to be zero mean and variance 1.

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardized distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organized leading to very slow times.
Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Basis Function Models

  • The prediction function is now defined as \[ f(\mathbf{ x}_i) = \sum_{j=1}^mw_j \phi_{i, j} \]

Vector Notation

  • Write in vector notation, \[ f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}_i \]

Log Likelihood for Basis Function Model

  • The likelihood of a single data point is \[ p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}\right). \]

Log Likelihood for Basis Function Model

  • Leading to a log likelihood for the data set of \[ L(\mathbf{ w},\sigma^2)= -\frac{n}{2}\log \sigma^2-\frac{n}{2}\log 2\pi -\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \]

Objective Function

  • And a corresponding objective function of the form \[ E(\mathbf{ w},\sigma^2)= \frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\right)^{2}}{2\sigma^2}. \]

Expand the Brackets

\[ \begin{align} E(\mathbf{ w},\sigma^2) = &\frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2}-\frac{1}{\sigma^2}\sum_{i=1}^{n}y_i\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\\ &+\frac{1}{2\sigma^2}\sum_{i=1}^{n}\mathbf{ w}^{\top}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\mathbf{ w}+\text{const}. \end{align} \]

Expand the Brackets

\[\begin{align} E(\mathbf{ w}, \sigma^2) = & \frac{n}{2}\log \sigma^2 + \frac{1}{2\sigma^2}\sum_{i=1}^{n}y_i^{2}-\frac{1}{\sigma^2} \mathbf{ w}^\top\sum_{i=1}^{n}\boldsymbol{ \phi}_i y_i\\ & +\frac{1}{2\sigma^2}\mathbf{ w}^{\top}\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}+\text{const}.\end{align}\]

Design Matrices

  • Design matrix notation \[ \boldsymbol{ \Phi}= \begin{bmatrix} \mathbf{1} & \mathbf{ x}& \mathbf{ x}^2\end{bmatrix} \] so that \[ \boldsymbol{ \Phi}\in \Re^{n\times p}. \]

Multivariate Derivatives Reminder

  • We will need some multivariate calculus. \[\frac{\text{d}\mathbf{a}^{\top}\mathbf{ w}}{\text{d}\mathbf{ w}}=\mathbf{a}\] and \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\] or if \(\mathbf{A}\) is symmetric (i.e. \(\mathbf{A}=\mathbf{A}^{\top}\)) \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}.\]

Differentiate

Differentiate wrt \(\mathbf{ w}\) \[\frac{\text{d} E\left(\mathbf{ w},\sigma^2 \right)}{\text{d}\mathbf{ w}}=-\frac{1}{\sigma^2} \sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i+\frac{1}{\sigma^2} \left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]\mathbf{ w}\] Leading to \[\mathbf{ w}^{*}=\left[\sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^{\top}\right]^{-1}\sum_{i=1}^{n}\boldsymbol{ \phi}_iy_i,\]

Matrix Notation

\[ \sum_{i=1}^{n}\boldsymbol{ \phi}_i\boldsymbol{ \phi}_i^\top = \boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\] \[\sum _{i=1}^{n}\boldsymbol{ \phi}_iy_i = \boldsymbol{ \Phi}^\top \mathbf{ y} \]

Update Equations

  • Update for \(\mathbf{ w}^{*}\) \[ \mathbf{ w}^{*} = \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)^{-1} \boldsymbol{ \Phi}^\top \mathbf{ y} \]
  • The equation for \(\left.\sigma^2\right.^{*}\) may also be found \[ \left.\sigma^2\right.^{{*}}=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\boldsymbol{ \phi}_i\right)^{2}}{n}. \]

Avoid Direct Inverse

  • E.g. Solve for \(\mathbf{ w}\) \[ \left(\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}\right)\mathbf{ w}= \boldsymbol{ \Phi}^\top \mathbf{ y} \]

\[ \mathbf{A}\mathbf{x} = \mathbf{b}. \]

  • See np.linalg.solve
  • In practice use \(\mathbf{Q}\mathbf{R}\) decomposition (see lab class notes).

Solution with QR Decomposition

\[ \designMatrix^\top \designMatrix \boldsymbol{\beta} = \designMatrix^\top \mathbf{ y} \] substitute \(\designMatrix = \mathbf{Q}{\mathbf{R}\) \[ (\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{ y} \] \[ \mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \]

\[ \mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{ y} \] \[ \mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{ y} \]

  • More nummerically stable.
  • Avoids the intermediate computation of \(\designMatrix^\top\designMatrix\).

Polynomial Fits to Olympic Data

Non-linear but Linear in the Parameters

  • Model is non-linear, but linear in parameters \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}) \]
  • \(\mathbf{ x}\) is inside the non-linearity, but \(\mathbf{ w}\) is outside. \[ f(\mathbf{ x}) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}; \boldsymbol{\theta}), \]

Lecture on Basis Functions from GPRS Uganda

Use of QR Decomposition for Numerical Stability

TODO example with polynomials.

Further Reading

  • Section 1.4 of Rogers and Girolami (2011)

  • Chapter 1, pg 1-6 of Bishop (2006)

  • Chapter 3, Section 3.1 up to pg 143 of Bishop (2006)

Thanks!

References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.
Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.