Multi-output Gaussian Processes

GPy: A Gaussian Process Framework in Python

https://github.com/SheffieldML/GPy

GPy: A Gaussian Process Framework in Python

  • BSD Licensed software base.
  • Wide availability of libraries, ‘modern’ scripting language.
  • Allows us to set projects to undergraduates in Comp Sci that use GPs.
  • Available through GitHub https://github.com/SheffieldML/GPy
  • Reproducible Research with Jupyter Notebook.

Features

  • Probabilistic-style programming (specify the model, not the algorithm).
  • Non-Gaussian likelihoods.
  • Multivariate outputs.
  • Dimensionality reduction.
  • Approximations for large data sets.

Simple Kalman Filter

  • We have state vector \(\mathbf{X}= \left[\mathbf{ x}_1 \dots \mathbf{ x}_q\right] \in \mathbb{R}^{T \times q}\) and if each state evolves independently we have \[ \begin{align*} p(\mathbf{X}) &= \prod_{i=1}^qp(\mathbf{ x}_{:, i}) \\ p(\mathbf{ x}_{:, i}) &= \mathcal{N}\left(\mathbf{ x}_{:, i}|\mathbf{0},\mathbf{K}\right). \end{align*} \]

  • We want to obtain outputs through: \[ \mathbf{ y}_{i, :} = \mathbf{W}\mathbf{ x}_{i, :} \]

Stacking and Kronecker Products

  • Represent with a ‘stacked’ system: \[ p(\mathbf{ x}) = \mathcal{N}\left(\mathbf{ x}|\mathbf{0},\mathbf{I}\otimes \mathbf{K}\right) \] where the stacking is placing each column of \(\mathbf{X}\) one on top of another as \[ \mathbf{ x}= \begin{bmatrix} \mathbf{ x}_{:, 1}\\ \mathbf{ x}_{:, 2}\\ \vdots\\ \mathbf{ x}_{:, q} \end{bmatrix} \]

Kronecker Product

Kronecker Product

Stacking and Kronecker Products

  • Represent with a ‘stacked’ system: \[p (\mathbf{ x}) = \mathcal{N}\left(\mathbf{ x}|\mathbf{0},\mathbf{I}\otimes \mathbf{K}\right) \] where the stacking is placing each column of \(\mathbf{X}\) one on top of another as \[ \mathbf{ x}= \begin{bmatrix} \mathbf{ x}_{:, 1}\\ \mathbf{ x}_{:, 2}\\ \vdots\\ \mathbf{ x}_{:, q} \end{bmatrix} \]

Column Stacking

For this stacking the marginal distribution over time is given by the block diagonals.

Two Ways of Stacking

Can also stack each row of \(\mathbf{X}\) to form column vector: \[\mathbf{ x}= \begin{bmatrix} \mathbf{ x}_{1, :}\\ \mathbf{ x}_{2, :}\\ \vdots\\ \mathbf{ x}_{T, :} \end{bmatrix}\] \[p(\mathbf{ x}) = \mathcal{N}\left(\mathbf{ x}|\mathbf{0},\mathbf{K}\otimes \mathbf{I}\right)\]

Row Stacking

For this stacking the marginal distribution over the latent dimensions is given by the block diagonals.

Mapping from Latent Process to Observed

Observed Process

The observations are related to the latent points by a linear mapping matrix, \[ \mathbf{ y}_{i, :} = \mathbf{W}\mathbf{ x}_{i, :} + \boldsymbol{ \epsilon}_{i, :} \] \[ \boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right) \]

Output Covariance

This leads to a covariance of the form \[ (\mathbf{I}\otimes \mathbf{W}) (\mathbf{K}\otimes \mathbf{I}) (\mathbf{I}\otimes \mathbf{W}^\top) + \mathbf{I}\sigma^2 \] Using \((\mathbf{A}\otimes\mathbf{B}) (\mathbf{C}\otimes\mathbf{D}) = \mathbf{A}\mathbf{C} \otimes \mathbf{B}\mathbf{D}\) This leads to \[ \mathbf{K}\otimes {\mathbf{W}}{\mathbf{W}}^\top + \mathbf{I}\sigma^2 \] or \[ \mathbf{ y}\sim \mathcal{N}\left(\mathbf{0},\mathbf{W}\mathbf{W}^\top \otimes \mathbf{K}+ \mathbf{I}\sigma^2\right) \]

Kernels for Vector Valued Outputs: A Review

Kronecker Structure GPs

  • This Kronecker structure leads to several published models. \[ (\mathbf{K}(\mathbf{ x},\mathbf{ x}^\prime))_{i,i^\prime}=k(\mathbf{ x},\mathbf{ x}^\prime)k_T(i,i^\prime), \] where \(k\) has \(\mathbf{ x}\) and \(k_T\) has \(n\) as inputs.

  • Can think of multiple output covariance functions as covariances with augmented input.

  • Alongside \(\mathbf{ x}\) we also input the \(i\) associated with the output of interest.

Separable Covariance Functions

  • Taking \(\mathbf{B}= {\mathbf{W}}{\mathbf{W}}^\top\) we have a matrix expression across outputs. \[\mathbf{K}(\mathbf{ x},\mathbf{ x}^\prime)=k(\mathbf{ x},\mathbf{ x}^\prime)\mathbf{B},\] where \(\mathbf{B}\) is a \(p\times p\) symmetric and positive semi-definite matrix.

  • \(\mathbf{B}\) is called the coregionalization matrix.

  • We call this class of covariance functions separable due to their product structure.

Sum of Separable Covariance Functions

  • In the same spirit a more general class of kernels is given by \[\mathbf{K}(\mathbf{ x},\mathbf{ x}^\prime)=\sum_{j=1}^qk_{j}(\mathbf{ x},\mathbf{ x}^\prime)\mathbf{B}_{j}.\]

  • This can also be written as \[\mathbf{K}(\mathbf{X}, \mathbf{X}) = \sum_{j=1}^q\mathbf{B}_{j}\otimes k_{j}(\mathbf{X}, \mathbf{X}),\]

  • This is like several Kalman filter-type models added together, but each one with a different set of latent functions.

  • We call this class of kernels sum of separable kernels (SoS kernels).

Geostatistics

  • Use of GPs in Geostatistics is called kriging.

  • These multi-output GPs pioneered in geostatistics: prediction over vector-valued output data is known as cokriging.

  • The model in geostatistics is known as the linear model of coregionalization (LMC, Journel and Huijbregts (1978) Goovaerts (1997)).

  • Most machine learning multitask models can be placed in the context of the LMC model.

Weighted sum of Latent Functions

  • In the linear model of coregionalization (LMC) outputs are expressed as linear combinations of independent random functions.

  • In the LMC, each component \(f_i\) is expressed as a linear sum \[f_i(\mathbf{ x}) = \sum_{j=1}^q{w}_{i,{j}}{u}_{j}(\mathbf{ x}).\] where the latent functions are independent and have covariance functions \(k_{j}(\mathbf{ x},\mathbf{ x}^\prime)\).

  • The processes \(\{f_j(\mathbf{ x})\}_{j=1}^q\) are independent for \(q\neq {j}^\prime\).

Kalman Filter Special Case

  • The Kalman filter is an example of the LMC where \({u}_i(\mathbf{ x}) \rightarrow {x}_i(t)\).

  • I.e. we’ve moved form time input to a more general input space.

  • In matrix notation:

    1. Kalman filter \[\mathbf{F}= {\mathbf{W}}\mathbf{X}\]
    2. LMC \[\mathbf{F}= {\mathbf{W}}{\mathbf{U}}\] where the rows of these matrices \({\mathbf{F}}\), \(\mathbf{X}\), \({\mathbf{U}}\) each contain \(q\) samples from their corresponding functions at a different time (Kalman filter) or spatial location (LMC).

Intrinsic Coregionalization Model

  • If one covariance used for latent functions (like in Kalman filter).

  • This is called the intrinsic coregionalization model (ICM, Goovaerts (1997)).

  • The kernel matrix corresponding to a dataset \(\mathbf{X}\) takes the form \[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{B}\otimes k(\mathbf{X}, \mathbf{X}). \]

Autokrigeability

  • If outputs are noise-free, maximum likelihood is equivalent to independent fits of \(\mathbf{B}\) and \(k(\mathbf{ x}, \mathbf{ x}^\prime)\) (Helterbrand and Cressie, 1994).

  • In geostatistics this is known as autokrigeability (Wackernagel, 2003).

  • In multitask learning its the cancellation of intertask transfer (Bonilla et al., n.d.).

Intrinsic Coregionalization Model

\[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{ w}\mathbf{ w}^\top \otimes k(\mathbf{X}, \mathbf{X}). \]

\[ \mathbf{ w}= \begin{bmatrix} 1 \\ 5\end{bmatrix} \] \[ \mathbf{B}= \begin{bmatrix} 1 & 5\\ 5&25\end{bmatrix} \]

Intrinsic Coregionalization Model Covariance

\[k(i, j, \mathbf{ x}, \mathbf{ x}^\prime) = b_{i,j} k(\mathbf{ x}, \mathbf{ x}^\prime)\]

Intrinsic Coregionalization Model

\[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{B}\otimes k(\mathbf{X}, \mathbf{X}). \]

\[ \mathbf{B}= \begin{bmatrix} 1 & 0.5\\ 0.5& 1.5\end{bmatrix} \]

LMC Samples

\[\mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{B}_1 \otimes k_1(\mathbf{X}, \mathbf{X}) + \mathbf{B}_2 \otimes k_2(\mathbf{X}, \mathbf{X})\]

\[\mathbf{B}_1 = \begin{bmatrix} 1.4 & 0.5\\ 0.5& 1.2\end{bmatrix}\] \[{\ell}_1 = 1\] \[\mathbf{B}_2 = \begin{bmatrix} 1 & 0.5\\ 0.5& 1.3\end{bmatrix}\] \[{\ell}_2 = 0.2\]

LMC in Machine Learning and Statistics

  • Used in machine learning for GPs for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes.

  • Imposes the correlation of the outputs explicitly through the set of coregionalization matrices.

  • Setting \(\mathbf{B}= \mathbf{I}_p\) assumes outputs are conditionally independent given the parameters \(\boldsymbol{ \theta}\). (Lawrence and Platt, 2004; Minka and Picard, 1997; Yu et al., 2005).

  • More recent approaches for multiple output modeling are different versions of the linear model of coregionalization.

Semiparametric Latent Factor Model

  • Coregionalization matrices are rank 1 Teh et al. (n.d.). rewrite equation as \[\mathbf{K}(\mathbf{X}, \mathbf{X}) = \sum_{j=1}^q\mathbf{ w}_{:, {j}}\mathbf{ w}^{\top}_{:, {j}} \otimes k_{j}(\mathbf{X}, \mathbf{X}).\]

  • Like the Kalman filter, but each latent function has a different covariance.

  • Authors suggest using an exponentiated quadratic characteristic length-scale for each input dimension.

Semi Parametric Latent Factor Covariance

Semiparametric Latent Factor Model Samples

\[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{ w}_{:, 1}\mathbf{ w}_{:, 1}^\top \otimes k_1(\mathbf{X}, \mathbf{X}) + \mathbf{ w}_{:, 2} \mathbf{ w}_{:, 2}^\top \otimes k_2(\mathbf{X}, \mathbf{X}) \]

\[ \mathbf{ w}_1 = \begin{bmatrix} 0.5 \\ 1\end{bmatrix} \] \[ \mathbf{ w}_2 = \begin{bmatrix} 1 \\ 0.5\end{bmatrix} \]

Gaussian processes for Multi-task, Multi-output and Multi-class

  • Bonilla et al. (n.d.) suggest ICM for multitask learning.

  • Use a PPCA form for \(\mathbf{B}\): similar to our Kalman filter example.

  • Refer to the autokrigeability effect as the cancellation of inter-task transfer.

  • Also discuss the similarities between the multi-task GP and the ICM, and its relationship to the SLFM and the LMC.

Multitask Classification

  • Mostly restricted to the case where the outputs are conditionally independent given the hyperparameters \(\boldsymbol{\phi}\) (Lawrence and Platt, 2004; Minka and Picard, 1997; Rasmussen and Williams, 2006; Seeger and Jordan, 2004; Williams and Barber, 1998; Yu et al., 2005).

  • Intrinsic coregionalization model has been used in the multiclass scenario. Skolidis and Sanguinetti (2011) use the intrinsic coregionalization model for classification, by introducing a probit noise model as the likelihood.

  • Posterior distribution is no longer analytically tractable: approximate inference is required.

Computer Emulation

  • A statistical model used as a surrogate for a computationally expensive computer model.

  • Higdon et al. (2008) use the linear model of coregionalization to model images representing the evolution of the implosion of steel cylinders.

  • In Conti and O’Hagan (2009) use the ICM to model a vegetation model: called the Sheffield Dynamic Global Vegetation Model Woodward et al. (1998).

Modelling Multiple Outputs

Running Example

Olympic Sprint Data

  • Gold medal times for Olympic Sprints for Men and Women
  • 100m, 200m, 400m
  • In early years of olympics not all events run.
Image from Wikimedia Commons http://bit.ly/16kMKHQ by Darren Wilkinson

Olympic Sprint Data

Gaussian Process Fit

Olympic Marathon Data GP

Olympic Marathon Data LMC GP

Olympic Marathon Data ICM GP

Thanks!

References

Bonilla, E.V., Chai, K.M., Williams, C.K.I., n.d. Multi-task Gaussian process prediction.
Conti, S., O’Hagan, A., 2009. Bayesian emulation of complex multi-output and dynamic computer models. Journal of Statistical Planning and Inference 140, 640–651. https://doi.org/doi:10.1016/j.jspi.2009.08.006
Goovaerts, P., 1997. Geostatistics For Natural Resources Evaluation. Oxford University Press.
Helterbrand, J.D., Cressie, N.A.C., 1994. Universal cokriging under intrinsic coregionalization. Mathematical Geology 26, 205–226.
Higdon, D.M., Gattiker, J., Williams, B., Rightley, M., 2008. Computer model calibration using high dimensional output. Journal of the American Statistical Association 103, 570–583.
Journel, A.G., Huijbregts, C.J., 1978. Mining geostatistics. Academic Press, London.
Lawrence, N.D., Platt, J.C., 2004. Learning to learn with the informative vector machine. pp. 512–519. https://doi.org/10.1145/1015330.1015382
Minka, T.P., Picard, R.W., 1997. Learning how to learn is learning with point sets.
Rasmussen, C.E., Williams, C.K.I., 2006. Gaussian processes for machine learning. mit, Cambridge, MA.
Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.
Seeger, M., Jordan, M.I., 2004. Sparse Gaussian Process classification with multiple classes (No. 661). Department of Statistics, University of California at Berkeley.
Skolidis, G., Sanguinetti, G., 2011. Bayesian multitask classification with Gaussian process priors. IEEE Transactions on Neural Networks 22, 2011–2021.
Teh, Y.W., Seeger, M., Jordan, M.I., n.d. Semiparametric latent factor models. pp. 333–340.
Wackernagel, H., 2003. Multivariate geostatistics: An introduction with applications, 3rd ed. springer.
Williams, C.K.I., Barber, D., 1998. Bayesian Classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1342–1351.
Woodward, I., Lomas, M.R., Betts, R.A., 1998. Vegetation-climate feedbacks in a greenhouse world. Philosophical Transactions: Biological Sciences 353, 29–39.
Yu, K., Tresp, V., Schwaighofer, A., 2005. Learning Gaussian processes from multiple tasks, in: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005). pp. 1012–1019.