Gaussian Processes

Review

Yesterday: Bayesian regression
Today:
- Gaussian Processes: non parametric Bayesian modelling

Multivariate Gaussian Properties

Recall Univariate Gaussian Properties

Sum of Gaussian variables is also Gaussian.

\[y_i \sim \mathscr{N}\left(\mu_i,\sigma_i^2\right)\]

\[\sum_{i=1}^{n} y_i \sim \mathscr{N}\left(\sum_{i=1}^n\mu_i,\sum_{i=1}^n\sigma_i^2\right)\]

Recall Univariate Gaussian Properties

Scaling a Gaussian leads to a Gaussian.

\[y\sim \mathscr{N}\left(\mu,\sigma^2\right)\]

\[wy\sim \mathscr{N}\left(w\mu,w^2 \sigma^2\right)\]

Multivariate Consequence

If

\[\mathbf{ x}\sim \mathscr{N}\left(\boldsymbol{ \mu},\mathbf{C}\right)\]

And

\[\mathbf{ y}= \mathbf{W}\mathbf{ x}\]

Then

\[\mathbf{ y}\sim \mathscr{N}\left(\mathbf{W}\boldsymbol{ \mu},\mathbf{W}\mathbf{C}\mathbf{W}^\top\right)\]

Linear Gaussian Models

linear Gaussian models are easier to deal with
Even the parameters within the process can be handled, by considering a particular limit.

Linear Model Overview

Set each activation function computed at each data point to be

\[ \phi_{i,j} = \phi(\mathbf{ w}^{(1)}_{j}, \mathbf{ x}_{i}) \] Define design matrix \[ \boldsymbol{ \Phi}= \begin{bmatrix} \phi_{1, 1} & \phi_{1, 2} & \dots & \phi_{1, h} \\ \phi_{1, 2} & \phi_{1, 2} & \dots & \phi_{1, n} \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{n, 1} & \phi_{n, 2} & \dots & \phi_{n, h} \end{bmatrix}. \]

Matrix Representation of a Neural Network

\[y\left(\mathbf{ x}\right) = \boldsymbol{ \phi}\left(\mathbf{ x}\right)^\top \mathbf{ w}+ \epsilon\]

\[\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\]

\[\boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right)\]

Multivariate Gaussian Properties

If \[ \mathbf{ y}= \mathbf{W}\mathbf{ x}+ \boldsymbol{ \epsilon}, \]
Assume \[ \begin{align} \mathbf{ x}& \sim \mathscr{N}\left(\boldsymbol{ \mu},\mathbf{C}\right)\\ \boldsymbol{ \epsilon}& \sim \mathscr{N}\left(\mathbf{0},\boldsymbol{ \Sigma}\right) \end{align} \]
Then \[ \mathbf{ y}\sim \mathscr{N}\left(\mathbf{W}\boldsymbol{ \mu},\mathbf{W}\mathbf{C}\mathbf{W}^\top + \boldsymbol{ \Sigma}\right). \] If \(\boldsymbol{ \Sigma}=\sigma^2\mathbf{I}\), this is Probabilistic PCA (Tipping and Bishop, 1999).

Prior Density

Define \[ \mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha\mathbf{I}\right), \]
Rules of multivariate Gaussians to see that, \[ \mathbf{ y}\sim \mathscr{N}\left(\mathbf{0},\alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}\right). \]

\[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}. \]

Joint Gaussian Density

Elements are a function \(k_{i,j} = k\left(\mathbf{ x}_i, \mathbf{ x}_j\right)\)

\[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}. \]

Covariance Function

\[ k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) = \alpha \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_i\right)^\top \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_j\right) \]

formed by inner products of the rows of the design matrix.

Gaussian Process

Instead of making assumptions about our density over each data point, \(y_i\) as i.i.d.
make a joint Gaussian assumption over our data.
covariance matrix is now a function of both the parameters of the activation function, \(\mathbf{W}_1\), and the input variables, \(\mathbf{X}\).
Arises from integrating out \(\mathbf{ w}^{(2)}\).

Basis Functions

Can be very complex, such as deep kernels, (Cho and Saul, 2009) or could even put a convolutional neural network inside.
Viewing a neural network in this way is also what allows us to beform sensible batch normalizations (Ioffe and Szegedy, 2015).

Gaussian Processes

Basis function models give non-linear predictions.
Need to choose number and location of basis functions.
Gaussian processes is a general framework (basis functions special case)
Within the framework you can consider models with infinite basis functions.

Gaussian Processes

\[ p(\mathbf{ y}|\mathbf{X}, \mathbf{ w}) = \prod_{i=1}^{n} p(y_i | \mathbf{ x}_i, \mathbf{ w}) \]

\[ \mathbf{ y}|\mathbf{X}\sim \mathscr{N}\left(\mathbf{m}(\mathbf{X}),\mathbf{K}(\mathbf{X})\right), \]

Sampling a Function

Multi-variate Gaussians

We will consider a Gaussian with a particular structure of covariance matrix.
Generate a single sample from this 25 dimensional Gaussian density, \[ \mathbf{ f}=\left[f_{1},f_{2}\dots f_{25}\right]. \]
We will plot these points against their index.

Sampling a Function from a Gaussian

Joint Density of \(f_1\) and \(f_2\)

Prediction of \(f_{2}\) from \(f_{1}\)

Uluru

Prediction of \(f_2\) from \(f_1\)

Conditional density is also Gaussian. \[ p(f_2|f_1) = \mathscr{N}\left(f_2|\frac{k_{1, 2}}{k_{1, 1}}f_1, k_{2, 2} - \frac{k_{1,2}^2}{k_{1,1}}\right) \] where covariance of joint density is given by \[ \mathbf{K}= \begin{bmatrix} k_{1, 1} & k_{1, 2}\\ k_{2, 1} & k_{2, 2}.\end{bmatrix} \]

Joint Density of \(f_1\) and \(f_8\)

Prediction of \(f_{8}\) from \(f_{1}\)

Details

The single contour of the Gaussian density represents the joint distribution, \(p(f_1, f_8)\)

We observe a value for \(f_1=-?\)

Conditional density: \(p(f_8|f_1=?)\).

Multivariate Conditional Density

Prediction of \(\mathbf{ f}_*\) from \(\mathbf{ f}\).
Multivariate conditional density is also Gaussian. \[ p(\mathbf{ f}_*|\mathbf{ f}) = {\mathcal{N}\left(\mathbf{ f}_*|\mathbf{K}_{*,\mathbf{ f}}\mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{ f},\mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{ f}} \mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{K}_{\mathbf{ f},*}\right)} \]
Here covariance of joint density is given by \[ \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}, \mathbf{ f}} & \mathbf{K}_{*, \mathbf{ f}}\\ \mathbf{K}_{\mathbf{ f}, *} & \mathbf{K}_{*, *}\end{bmatrix} \]

Multivariate Conditional Density

Prediction of \(\mathbf{ f}_*\) from \(\mathbf{ f}\).
Multivariate conditional density is also Gaussian. \[ p(\mathbf{ f}_*|\mathbf{ f}) = {\mathcal{N}\left(\mathbf{ f}_*|\boldsymbol{ \mu},\boldsymbol{ \Sigma}\right)} \] \[ \boldsymbol{ \mu}= \mathbf{K}_{*,\mathbf{ f}}\mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{ f} \] \[ \boldsymbol{ \Sigma}= \mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{ f}} \mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{K}_{\mathbf{ f},*} \]

Prediction with Correlated Gaussians

Here covariance of joint density is given by \[ \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}, \mathbf{ f}} & \mathbf{K}_{*, \mathbf{ f}}\\ \mathbf{K}_{\mathbf{ f}, *} & \mathbf{K}_{*, *}\end{bmatrix} \]

Where Did This Covariance Matrix Come From?

\[ k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}- \mathbf{ x}^\prime\right\Vert^2_2}{2\ell^2}\right)\]

Covariance matrix is built using the inputs to the function, \(\mathbf{ x}\).
For the example above it was based on Euclidean distance.
The covariance function is also know as a kernel.

Computing Covariance

Thanks!

References

Cho, Y., Saul, L.K., 2009. Kernel methods for deep learning, in: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems 22. Curran Associates, Inc., pp. 342–350.

Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Bach, F., Blei, D. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, Lille, France, pp. 448–456.

Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B 6, 611–622. https://doi.org/doi:10.1111/1467-9868.00196