Gaussian Processes II

Review

Yesterday: Intro to Gaussian Processes
Today:
- Gaussian Processes: optimising and covariance functions

Marginal Likelihood

Build on our understanding of marginal likelihood.

Generating from the Model

\[ \mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon} \]

Sample \[ \mathbf{ w}\sim \mathscr{N}\left(0,\alpha\mathbf{I}\right) \] \[ \boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Olympic Marathon Polynomial Sample

Sampling from the Prior

K = degree + 1
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)

f_sample = Phi_pred@w_sample

Mean Output

Each sample \(\mathbf{ w}_s\) has a prediction \(\mathbf{ f}_s\)

\[ \mathbf{ f}_s = \boldsymbol{ \Phi}\mathbf{ w}_s. \]

for i in range(num_samples):
    z_vec = np.random.normal(size=K)
    w_sample = z_vec*np.sqrt(alpha)
    f_sample = np.dot(Phi_pred,w_sample)
    _ = ax.plot(x_pred.flatten(), f_sample.flatten(), linewidth=2)

Function Space View

\[ \mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha \mathbf{I}\right) \] \[ \boldsymbol{ \Phi}= \begin{bmatrix}\boldsymbol{ \phi}(\mathbf{ x}_1) \\ \vdots \\ \boldsymbol{ \phi}(\mathbf{ x}_n)\end{bmatrix} \] \[ \mathbf{ f}= \begin{bmatrix} f_1 \\ \vdots \\ f_n\end{bmatrix} \] \[ \mathbf{ f}= \boldsymbol{ \Phi}\mathbf{ w}. \]

\[ \mathbf{ f}\sim \mathscr{N}\left(\mathbf{0},\alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top\right). \]

\[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top. \]

K = alpha*Phi_pred@Phi_pred.T

K = alpha*Phi_pred@Phi_pred.T f_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)

\[ \mathbf{ y}= \mathbf{ f}+ \boldsymbol{\epsilon} \]

\[ \epsilon \sim \mathscr{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right). \]

\[ \mathbf{ y}\sim \mathscr{N}\left(\mathbf{0},\boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top +\sigma^2\mathbf{I}\right). \]

Non-degenerate Gaussian Processes

This process is degenerate.
Covariance function is of rank at most \(h\).
As \(n\rightarrow \infty\), covariance matrix is not full rank.
Leading to \(\det{\mathbf{K}} = 0\)

Infinite Networks

In ML Radford Neal (Neal, 1994) asked “what would happen if you took \(h\rightarrow \infty\)?”

Roughly Speaking

Instead of \[ \begin{align*} k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) & = \alpha \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_i\right)^\top \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_j\right)\\ & = \alpha \sum_k \phi\left(\mathbf{ w}^{(1)}_k, \mathbf{ x}_i\right) \phi\left(\mathbf{ w}^{(1)}_k, \mathbf{ x}_j\right) \end{align*} \]

Sample infinitely many from a prior density, \(p(\mathbf{ w}^{(1)})\), \[ k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) = \alpha \int \phi\left(\mathbf{ w}^{(1)}, \mathbf{ x}_i\right) \phi\left(\mathbf{ w}^{(1)}, \mathbf{ x}_j\right) p(\mathbf{ w}^{(1)}) \text{d}\mathbf{ w}^{(1)} \]
Also applies for non-Gaussian \(p(\mathbf{ w}^{(1)})\) because of the central limit theorem.

Simple Probabilistic Program

If \[ \begin{align*} \mathbf{ w}^{(1)} & \sim p(\cdot)\\ \phi_i & = \phi\left(\mathbf{ w}^{(1)}, \mathbf{ x}_i\right), \end{align*} \] has finite variance.
Then taking number of hidden units to infinity, is also a Gaussian process.

Gaussian Process

\[ k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left( -\frac{\left\Vert \mathbf{ x}-\mathbf{ x}^\prime\right\Vert^2}{2\ell^2}\right), \]

\[ \left\Vert\mathbf{ x}- \mathbf{ x}^\prime\right\Vert^2 = (\mathbf{ x}- \mathbf{ x}^\prime)^\top (\mathbf{ x}- \mathbf{ x}^\prime) \]

Making Predictions with Gaussian Process

\[ p(\mathbf{ y}|\mathbf{X}) = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{K}|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}\mathbf{ y}^\top \left(\mathbf{K}+\sigma^2 \mathbf{I}\right)^{-1}\mathbf{ y}\right) \]

\[ E(\boldsymbol{\theta}) = \frac{1}{2} \log |\mathbf{K}| + \frac{1}{2} \mathbf{ y}^\top \left(\mathbf{K}+ \sigma^2\mathbf{I}\right)^{-1}\mathbf{ y} \]

Making Predictions

\[ \begin{bmatrix}\mathbf{ f}\\ \mathbf{ f}^*\end{bmatrix} \sim \mathscr{N}\left(\mathbf{0},\begin{bmatrix} \mathbf{K}& \mathbf{K}_\ast \\ \mathbf{K}_\ast^\top & \mathbf{K}_{\ast,\ast}\end{bmatrix}\right) \]

\[ \begin{bmatrix} \mathbf{K}& \mathbf{K}_\ast \\ \mathbf{K}_\ast^\top & \mathbf{K}_{\ast,\ast}\end{bmatrix} \]

Making Predictions

The Importance of the Covariance Function

\[ \boldsymbol{ \mu}_f= \mathbf{A}^\top \mathbf{ y}, \]

Parameter Optimisation

Improving the Numerics

In practice we shouldn’t be using matrix inverse directly to solve the GP system. One more stable way is to compute the Cholesky decomposition of the kernel matrix. The log determinant of the covariance can also be derived from the Cholesky decomposition.

Capacity Control

Gradients of the Likelihood

Overall Process Scale

Capacity Control and Data Fit

Learning Covariance Parameters

Can we determine covariance parameters from the data?

\[ \mathscr{N}\left(\mathbf{ y}|\mathbf{0},\mathbf{K}\right)=\frac{1}{(2\pi)^\frac{n}{2}{\det{\mathbf{K}}^{\frac{1}{2}}}}{\exp\left(-\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}\right)} \]

\[ \begin{aligned} \mathscr{N}\left(\mathbf{ y}|\mathbf{0},\mathbf{K}\right)=\frac{1}{(2\pi)^\frac{n}{2}\color{yellow}{\det{\mathbf{K}}^{\frac{1}{2}}}}\color{cyan}{\exp\left(-\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}\right)} \end{aligned} \]

\[ \begin{aligned} \log \mathscr{N}\left(\mathbf{ y}|\mathbf{0},\mathbf{K}\right)=&\color{yellow}{-\frac{1}{2}\log\det{\mathbf{K}}}\color{cyan}{-\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}} \\ &-\frac{n}{2}\log2\pi \end{aligned} \]

\[ E(\boldsymbol{ \theta}) = \color{yellow}{\frac{1}{2}\log\det{\mathbf{K}}} + \color{cyan}{\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}} \]

Capacity Control through the Determinant

The parameters are inside the covariance function (matrix). \[k_{i, j} = k(\mathbf{ x}_i, \mathbf{ x}_j; \boldsymbol{ \theta})\]

Eigendecomposition of Covariance

\[\mathbf{K}= \mathbf{R}\boldsymbol{ \Lambda}^2 \mathbf{R}^\top\]

\(\boldsymbol{ \Lambda}\) represents distance on axes. \(\mathbf{R}\) gives rotation.

Eigendecomposition of Covariance

\(\boldsymbol{ \Lambda}\) is diagonal, \(\mathbf{R}^\top\mathbf{R}= \mathbf{I}\).
Useful representation since \(\det{\mathbf{K}} = \det{\boldsymbol{ \Lambda}^2} = \det{\boldsymbol{ \Lambda}}^2\).

Capacity control: \(\color{yellow}{\log \det{\mathbf{K}}}\)

Quadratic Data Fit

Data Fit: \(\color{cyan}{\frac{\mathbf{ y}^\top\mathbf{K}^{-1}\mathbf{ y}}{2}}\)

\[E(\boldsymbol{ \theta}) = \color{yellow}{\frac{1}{2}\log\det{\mathbf{K}}}+\color{cyan}{\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}}\]

Data Fit Term

Exponentiated Quadratic Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}-\mathbf{ x}^\prime \right\Vert_2^2}{2\ell^2}\right)\]

Olympic Marathon Data

Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.

Image from Wikimedia Commons

Olympic Marathon Data

Gaussian Process Fit

Olympic Marathon Data GP

Fit Quality

Remove Outlier: Olympic Marathon Data GP

Della Gatta Gene Data

Given given expression levels in the form of a time series from Della Gatta et al. (2008).

Della Gatta Gene Data

Gene Expression Example

Want to detect if a gene is expressed or not, fit a GP to each gene Kalaitzis and Lawrence (2011).

http://www.biomedcentral.com/1471-2105/12/180

TP53 Gene Data GP

Multiple Optima

Example: Prediction of Malaria Incidence in Uganda

Work with Ricardo Andrade Pacheco, John Quinn and Martin Mubangizi (Makerere University, Uganda)
See AI-DEV Group.
See UN Global Pulse Disease Outbreaks Site

Malaria Prediction in Uganda

(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)

Kapchorwa District

Tororo District

Malaria Prediction in Nagongera (Sentinel Site)

Mubende District

Malaria Prediction in Uganda

GP School at Makerere

Kabarole District

Early Warning System

Early Warning Systems

Additive Covariance

\[k_f(\mathbf{ x}, \mathbf{ x}^\prime) = k_g(\mathbf{ x}, \mathbf{ x}^\prime) + k_h(\mathbf{ x}, \mathbf{ x}^\prime)\]

Gelman Book

Gelman et al. (2013)

Basis Function Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \boldsymbol{ \phi}(\mathbf{ x})^\top \boldsymbol{ \phi}(\mathbf{ x}^\prime)\]

Brownian Covariance

\[k(t, t^\prime)=\alpha \min(t, t^\prime)\]

MLP Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \arcsin\left(\frac{w \mathbf{ x}^\top \mathbf{ x}^\prime + b}{\sqrt{\left(w \mathbf{ x}^\top \mathbf{ x}+ b + 1\right)\left(w \left.\mathbf{ x}^\prime\right.^\top \mathbf{ x}^\prime + b + 1\right)}}\right)\]

RELU Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \frac{\|\mathbf{ x}\| \|\mathbf{ x}^\prime\|}{2\pi} \left(\pi - \arccos\left(\frac{\mathbf{ x}^\top \mathbf{ x}^\prime + b}{\|\mathbf{ x}\| \|\mathbf{ x}^\prime\|}\right)\right)\]

Bochners Theoerem

Given a positive finite Borel measure \(\mu\) on the real line \(\mathbb{R}\), the Fourier transform \(Q\) of \(\mu\) is the continuous function \[ Q(t) = \int_{\mathbb{R}} e^{-itx} \text{d} \mu(x). \] \(Q\) is continuous since for a fixed \(x\), the function \(e^{-itx}\) is continuous and periodic. The function \(Q\) is a positive definite function, i.e. the kernel \(k(x, x^\prime)= Q(x^\prime - x)\) is positive definite.

Bochner’s theorem (Bochner:book59?) says the converse is true, i.e. every positive definite function \(Q\) is the Fourier transform of a positive finite Borel measure. A proof can be sketched as follows (Stein, 1999)

\[ f(\mathbf{ x}) = \sum_{i=1}^ny_i \delta(\mathbf{ x}-\mathbf{ x}_i), \]

\[ F(\boldsymbol{\omega}) = \int_{-\infty}^\infty f(\mathbf{ x}) \exp\left(-i2\pi \boldsymbol{\omega}^\top \mathbf{ x}\right) \text{d} \mathbf{ x} \]

\[ F(\boldsymbol{\omega}) = \sum_{i=1}^ny_i\exp\left(-i 2\pi \boldsymbol{\omega}^\top \mathbf{ x}_i\right) \]

\[ F(\omega) = \int_{-\infty}^\infty f(t) \left[\cos(2\pi \omega t) - i \sin(2\pi \omega t) \right]\text{d} t \]

\[ \exp(ix) = \cos x + i\sin x \] we can re-express this form as \[ F(\omega) = \int_{-\infty}^\infty f(t) \exp(-i 2\pi\omega)\text{d} t \]

\[ f(t) = \int_{-\infty}^\infty F(\omega) \exp(2\pi\omega)\text{d} \omega. \]

Sinc Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \text{sinc}\left(\pi w r\right)\]

Polynomial Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha(w \mathbf{ x}^\top\mathbf{ x}^\prime + b)^d\]

Periodic Covariance

\[k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha\exp\left(\frac{-2\sin(\pi rw)^2}{\ell^2}\right)\]

Intrinsic Coregionalization Model Covariance

\[k(i, j, \mathbf{ x}, \mathbf{ x}^\prime) = b_{i,j} k(\mathbf{ x}, \mathbf{ x}^\prime)\]

Linear Model of Coregionalization Covariance

\[k(i, j, \mathbf{ x}, \mathbf{ x}^\prime) = b_{i,j} k(\mathbf{ x}, \mathbf{ x}^\prime)\]

Semi Parametric Latent Factor Covariance

GPSS: Gaussian Process Summer School

http://gpss.cc
Many lectures from past meetings available online

GPy: A Gaussian Process Framework in Python

https://github.com/SheffieldML/GPy

GPy: A Gaussian Process Framework in Python

BSD Licensed software base.
Wide availability of libraries, ‘modern’ scripting language.
Allows us to set projects to undergraduates in Comp Sci that use GPs.
Available through GitHub https://github.com/SheffieldML/GPy
Reproducible Research with Jupyter Notebook.

Features

Probabilistic-style programming (specify the model, not the algorithm).
Non-Gaussian likelihoods.
Multivariate outputs.
Dimensionality reduction.
Approximations for large data sets.

Thanks!

References

Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014. Consistent mapping of government malaria records across a changing territory delimitation. Malaria Journal 13. https://doi.org/10.1186/1475-2875-13-S1-P5

Della Gatta, G., Bansal, M., Ambesi-Impiombato, A., Antonini, D., Missero, C., Bernardo, D. di, 2008. Direct targets of the TRP63 transcription factor revealed by a combination of gene expression profiling and reverse engineering. Genome Research 18, 939–948. https://doi.org/10.1101/gr.073601.107

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., 2013. Bayesian data analysis, 3rd ed. Chapman; Hall.

Kalaitzis, A.A., Lawrence, N.D., 2011. A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression. BMC Bioinformatics 12. https://doi.org/10.1186/1471-2105-12-180

MacKay, D.J.C., 1992. Bayesian methods for adaptive models (PhD thesis). California Institute of Technology.

Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence, N.D., 2014. Malaria surveillance with multiple data sources using Gaussian process models, in: 1st International Conference on the Use of Mobile ICT in Africa.

Neal, R.M., 1994. Bayesian learning for neural networks (PhD thesis). Dept. of Computer Science, University of Toronto.

Stein, M.L., 1999. Interpolation of spatial data: Some theory for Kriging. springer.

Gaussian Processes II

Review

Marginal Likelihood

Generating from the Model

Olympic Marathon Polynomial Sample

Sampling from the Prior

Mean Output

Function Space View

Non-degenerate Gaussian Processes

Infinite Networks

Roughly Speaking

Simple Probabilistic Program

Further Reading

Gaussian Process

Making Predictions with Gaussian Process

Making Predictions

Making Predictions

The Importance of the Covariance Function

Parameter Optimisation

Improving the Numerics

Capacity Control

Gradients of the Likelihood

Overall Process Scale

Capacity Control and Data Fit

Learning Covariance Parameters

Capacity Control through the Determinant

Eigendecomposition of Covariance

Eigendecomposition of Covariance

Capacity control: \(\color{yellow}{\log \det{\mathbf{K}}}\)

Quadratic Data Fit

Data Fit: \(\color{cyan}{\frac{\mathbf{ y}^\top\mathbf{K}^{-1}\mathbf{ y}}{2}}\)

\[E(\boldsymbol{ \theta}) = \color{yellow}{\frac{1}{2}\log\det{\mathbf{K}}}+\color{cyan}{\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}}\]

Data Fit Term

Exponentiated Quadratic Covariance

Olympic Marathon Data

Olympic Marathon Data

Gaussian Process Fit

Olympic Marathon Data GP

Fit Quality

Remove Outlier: Olympic Marathon Data GP

Della Gatta Gene Data

Della Gatta Gene Data

Gene Expression Example

TP53 Gene Data GP

TP53 Gene Data GP

TP53 Gene Data GP

Multiple Optima

Example: Prediction of Malaria Incidence in Uganda

Malaria Prediction in Uganda

Kapchorwa District

Tororo District

Malaria Prediction in Nagongera (Sentinel Site)

Mubende District

Malaria Prediction in Uganda

GP School at Makerere

Kabarole District

Early Warning System

Early Warning Systems

Additive Covariance

Gelman Book

Basis Function Covariance

Brownian Covariance

MLP Covariance

RELU Covariance

Bochners Theoerem

Sinc Covariance

Polynomial Covariance

Periodic Covariance

Intrinsic Coregionalization Model Covariance

Linear Model of Coregionalization Covariance

Semi Parametric Latent Factor Covariance

GPSS: Gaussian Process Summer School

GPy: A Gaussian Process Framework in Python

GPy: A Gaussian Process Framework in Python

Features

Further Reading

Thanks!

References