Gaussian Processes

Neil D. Lawrence

Dedan Kimathi University, Nyeri, Kenya

Review

  • Yesterday: Bayesian regression
  • Today:
    • Gaussian Processes: non parametric Bayesian modelling

Multivariate Gaussian Properties

Recall Univariate Gaussian Properties

  1. Sum of Gaussian variables is also Gaussian.

\[y_i \sim \mathscr{N}\left(\mu_i,\sigma_i^2\right)\]

\[\sum_{i=1}^{n} y_i \sim \mathscr{N}\left(\sum_{i=1}^n\mu_i,\sum_{i=1}^n\sigma_i^2\right)\]

Recall Univariate Gaussian Properties

  1. Scaling a Gaussian leads to a Gaussian.

\[y\sim \mathscr{N}\left(\mu,\sigma^2\right)\]

\[wy\sim \mathscr{N}\left(w\mu,w^2 \sigma^2\right)\]

Multivariate Consequence

If

\[\mathbf{ x}\sim \mathscr{N}\left(\boldsymbol{ \mu},\mathbf{C}\right)\]

And

\[\mathbf{ y}= \mathbf{W}\mathbf{ x}\]

Then

\[\mathbf{ y}\sim \mathscr{N}\left(\mathbf{W}\boldsymbol{ \mu},\mathbf{W}\mathbf{C}\mathbf{W}^\top\right)\]

Linear Gaussian Models

  1. linear Gaussian models are easier to deal with
  2. Even the parameters within the process can be handled, by considering a particular limit.

Linear Model Overview

  • Set each activation function computed at each data point to be

\[ \phi_{i,j} = \phi(\mathbf{ w}^{(1)}_{j}, \mathbf{ x}_{i}) \] Define design matrix \[ \boldsymbol{ \Phi}= \begin{bmatrix} \phi_{1, 1} & \phi_{1, 2} & \dots & \phi_{1, h} \\ \phi_{1, 2} & \phi_{1, 2} & \dots & \phi_{1, n} \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{n, 1} & \phi_{n, 2} & \dots & \phi_{n, h} \end{bmatrix}. \]

Matrix Representation of a Neural Network

\[y\left(\mathbf{ x}\right) = \boldsymbol{ \phi}\left(\mathbf{ x}\right)^\top \mathbf{ w}+ \epsilon\]

\[\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\]

\[\boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right)\]

Multivariate Gaussian Properties

  • If \[ \mathbf{ y}= \mathbf{W}\mathbf{ x}+ \boldsymbol{ \epsilon}, \]

  • Assume \[ \begin{align} \mathbf{ x}& \sim \mathscr{N}\left(\boldsymbol{ \mu},\mathbf{C}\right)\\ \boldsymbol{ \epsilon}& \sim \mathscr{N}\left(\mathbf{0},\boldsymbol{ \Sigma}\right) \end{align} \]

  • Then \[ \mathbf{ y}\sim \mathscr{N}\left(\mathbf{W}\boldsymbol{ \mu},\mathbf{W}\mathbf{C}\mathbf{W}^\top + \boldsymbol{ \Sigma}\right). \] If \(\boldsymbol{ \Sigma}=\sigma^2\mathbf{I}\), this is Probabilistic PCA (Tipping and Bishop, 1999).

Prior Density

  • Define \[ \mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha\mathbf{I}\right), \]
  • Rules of multivariate Gaussians to see that, \[ \mathbf{ y}\sim \mathscr{N}\left(\mathbf{0},\alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}\right). \]

\[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}. \]

Joint Gaussian Density

  • Elements are a function \(k_{i,j} = k\left(\mathbf{ x}_i, \mathbf{ x}_j\right)\)

\[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}. \]

Covariance Function

\[ k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) = \alpha \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_i\right)^\top \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_j\right) \]

  • formed by inner products of the rows of the design matrix.

Gaussian Process

  • Instead of making assumptions about our density over each data point, \(y_i\) as i.i.d.

  • make a joint Gaussian assumption over our data.

  • covariance matrix is now a function of both the parameters of the activation function, \(\mathbf{W}_1\), and the input variables, \(\mathbf{X}\).

  • Arises from integrating out \(\mathbf{ w}^{(2)}\).

Basis Functions

  • Can be very complex, such as deep kernels, (Cho and Saul, 2009) or could even put a convolutional neural network inside.
  • Viewing a neural network in this way is also what allows us to beform sensible batch normalizations (Ioffe and Szegedy, 2015).

Gaussian Processes

  • Basis function models give non-linear predictions.
  • Need to choose number and location of basis functions.
  • Gaussian processes is a general framework (basis functions special case)
  • Within the framework you can consider models with infinite basis functions.

Gaussian Processes

\[ p(\mathbf{ y}|\mathbf{X}, \mathbf{ w}) = \prod_{i=1}^{n} p(y_i | \mathbf{ x}_i, \mathbf{ w}) \]

\[ \mathbf{ y}|\mathbf{X}\sim \mathscr{N}\left(\mathbf{m}(\mathbf{X}),\mathbf{K}(\mathbf{X})\right), \]

Sampling a Function

Multi-variate Gaussians

  • We will consider a Gaussian with a particular structure of covariance matrix.
  • Generate a single sample from this 25 dimensional Gaussian density, \[ \mathbf{ f}=\left[f_{1},f_{2}\dots f_{25}\right]. \]
  • We will plot these points against their index.

Sampling a Function from a Gaussian

Joint Density of \(f_1\) and \(f_2\)

Prediction of \(f_{2}\) from \(f_{1}\)

Uluru

Prediction of \(f_2\) from \(f_1\)

  • Conditional density is also Gaussian. \[ p(f_2|f_1) = \mathscr{N}\left(f_2|\frac{k_{1, 2}}{k_{1, 1}}f_1, k_{2, 2} - \frac{k_{1,2}^2}{k_{1,1}}\right) \] where covariance of joint density is given by \[ \mathbf{K}= \begin{bmatrix} k_{1, 1} & k_{1, 2}\\ k_{2, 1} & k_{2, 2}.\end{bmatrix} \]

Joint Density of \(f_1\) and \(f_8\)

Prediction of \(f_{8}\) from \(f_{1}\)

Details

  • The single contour of the Gaussian density represents the joint distribution, \(p(f_1, f_8)\)
  • We observe a value for \(f_1=-?\)
  • Conditional density: \(p(f_8|f_1=?)\).

Multivariate Conditional Density

  • Prediction of \(\mathbf{ f}_*\) from \(\mathbf{ f}\).

  • Multivariate conditional density is also Gaussian. \[ p(\mathbf{ f}_*|\mathbf{ f}) = {\mathcal{N}\left(\mathbf{ f}_*|\mathbf{K}_{*,\mathbf{ f}}\mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{ f},\mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{ f}} \mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{K}_{\mathbf{ f},*}\right)} \]

  • Here covariance of joint density is given by \[ \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}, \mathbf{ f}} & \mathbf{K}_{*, \mathbf{ f}}\\ \mathbf{K}_{\mathbf{ f}, *} & \mathbf{K}_{*, *}\end{bmatrix} \]

Multivariate Conditional Density

  • Prediction of \(\mathbf{ f}_*\) from \(\mathbf{ f}\).

  • Multivariate conditional density is also Gaussian. \[ p(\mathbf{ f}_*|\mathbf{ f}) = {\mathcal{N}\left(\mathbf{ f}_*|\boldsymbol{ \mu},\boldsymbol{ \Sigma}\right)} \] \[ \boldsymbol{ \mu}= \mathbf{K}_{*,\mathbf{ f}}\mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{ f} \] \[ \boldsymbol{ \Sigma}= \mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{ f}} \mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{K}_{\mathbf{ f},*} \]

Prediction with Correlated Gaussians

  • Here covariance of joint density is given by \[ \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}, \mathbf{ f}} & \mathbf{K}_{*, \mathbf{ f}}\\ \mathbf{K}_{\mathbf{ f}, *} & \mathbf{K}_{*, *}\end{bmatrix} \]

Where Did This Covariance Matrix Come From?

\[ k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}- \mathbf{ x}^\prime\right\Vert^2_2}{2\ell^2}\right)\]

  • Covariance matrix is built using the inputs to the function, \(\mathbf{ x}\).

  • For the example above it was based on Euclidean distance.

  • The covariance function is also know as a kernel.

Computing Covariance

Computing Covariance

Computing Covariance

Thanks!

References

Cho, Y., Saul, L.K., 2009. Kernel methods for deep learning, in: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems 22. Curran Associates, Inc., pp. 342–350.
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Bach, F., Blei, D. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, Lille, France, pp. 448–456.
Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B 6, 611–622. https://doi.org/doi:10.1111/1467-9868.00196