We have state vector \(\mathbf{X}= \left[\mathbf{ x}_1 \dots \mathbf{ x}_q\right] \in \mathbb{R}^{T \times q}\) and if each state evolves independently we have \[ \begin{align*} p(\mathbf{X}) &= \prod_{i=1}^qp(\mathbf{ x}_{:, i}) \\ p(\mathbf{ x}_{:, i}) &= \mathcal{N}\left(\mathbf{ x}_{:, i}|\mathbf{0},\mathbf{K}\right). \end{align*} \]
We want to obtain outputs through: \[ \mathbf{ y}_{i, :} = \mathbf{W}\mathbf{ x}_{i, :} \]
For this stacking the marginal distribution over time is given by the block diagonals.
Can also stack each row of \(\mathbf{X}\) to form column vector: \[\mathbf{ x}= \begin{bmatrix} \mathbf{ x}_{1, :}\\ \mathbf{ x}_{2, :}\\ \vdots\\ \mathbf{ x}_{T, :} \end{bmatrix}\] \[p(\mathbf{ x}) = \mathcal{N}\left(\mathbf{ x}|\mathbf{0},\mathbf{K}\otimes \mathbf{I}\right)\]
For this stacking the marginal distribution over the latent dimensions is given by the block diagonals.
The observations are related to the latent points by a linear mapping matrix, \[ \mathbf{ y}_{i, :} = \mathbf{W}\mathbf{ x}_{i, :} + \boldsymbol{ \epsilon}_{i, :} \] \[ \boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right) \]
This leads to a covariance of the form \[ (\mathbf{I}\otimes \mathbf{W}) (\mathbf{K}\otimes \mathbf{I}) (\mathbf{I}\otimes \mathbf{W}^\top) + \mathbf{I}\sigma^2 \] Using \((\mathbf{A}\otimes\mathbf{B}) (\mathbf{C}\otimes\mathbf{D}) = \mathbf{A}\mathbf{C} \otimes \mathbf{B}\mathbf{D}\) This leads to \[ \mathbf{K}\otimes {\mathbf{W}}{\mathbf{W}}^\top + \mathbf{I}\sigma^2 \] or \[ \mathbf{ y}\sim \mathcal{N}\left(\mathbf{0},\mathbf{W}\mathbf{W}^\top \otimes \mathbf{K}+ \mathbf{I}\sigma^2\right) \]
This Kronecker structure leads to several published models. \[ (\mathbf{K}(\mathbf{ x},\mathbf{ x}^\prime))_{i,i^\prime}=k(\mathbf{ x},\mathbf{ x}^\prime)k_T(i,i^\prime), \] where \(k\) has \(\mathbf{ x}\) and \(k_T\) has \(n\) as inputs.
Can think of multiple output covariance functions as covariances with augmented input.
Alongside \(\mathbf{ x}\) we also input the \(i\) associated with the output of interest.
Taking \(\mathbf{B}= {\mathbf{W}}{\mathbf{W}}^\top\) we have a matrix expression across outputs. \[\mathbf{K}(\mathbf{ x},\mathbf{ x}^\prime)=k(\mathbf{ x},\mathbf{ x}^\prime)\mathbf{B},\] where \(\mathbf{B}\) is a \(p\times p\) symmetric and positive semi-definite matrix.
\(\mathbf{B}\) is called the coregionalization matrix.
We call this class of covariance functions separable due to their product structure.
In the same spirit a more general class of kernels is given by \[\mathbf{K}(\mathbf{ x},\mathbf{ x}^\prime)=\sum_{j=1}^qk_{j}(\mathbf{ x},\mathbf{ x}^\prime)\mathbf{B}_{j}.\]
This can also be written as \[\mathbf{K}(\mathbf{X}, \mathbf{X}) = \sum_{j=1}^q\mathbf{B}_{j}\otimes k_{j}(\mathbf{X}, \mathbf{X}),\]
This is like several Kalman filter-type models added together, but each one with a different set of latent functions.
We call this class of kernels sum of separable kernels (SoS kernels).
Use of GPs in Geostatistics is called kriging.
These multi-output GPs pioneered in geostatistics: prediction over vector-valued output data is known as cokriging.
The model in geostatistics is known as the linear model of coregionalization (LMC, Journel and Huijbregts (1978) Goovaerts (1997)).
Most machine learning multitask models can be placed in the context of the LMC model.
In the linear model of coregionalization (LMC) outputs are expressed as linear combinations of independent random functions.
In the LMC, each component \(f_i\) is expressed as a linear sum \[f_i(\mathbf{ x}) = \sum_{j=1}^q{w}_{i,{j}}{u}_{j}(\mathbf{ x}).\] where the latent functions are independent and have covariance functions \(k_{j}(\mathbf{ x},\mathbf{ x}^\prime)\).
The processes \(\{f_j(\mathbf{ x})\}_{j=1}^q\) are independent for \(q\neq {j}^\prime\).
The Kalman filter is an example of the LMC where \({u}_i(\mathbf{ x}) \rightarrow {x}_i(t)\).
I.e. we’ve moved form time input to a more general input space.
In matrix notation:
If one covariance used for latent functions (like in Kalman filter).
This is called the intrinsic coregionalization model (ICM, Goovaerts (1997)).
The kernel matrix corresponding to a dataset \(\mathbf{X}\) takes the form \[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{B}\otimes k(\mathbf{X}, \mathbf{X}). \]
If outputs are noise-free, maximum likelihood is equivalent to independent fits of \(\mathbf{B}\) and \(k(\mathbf{ x}, \mathbf{ x}^\prime)\) (Helterbrand and Cressie, 1994).
In geostatistics this is known as autokrigeability (Wackernagel, 2003).
In multitask learning its the cancellation of intertask transfer (Bonilla et al., n.d.).
\[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{ w}\mathbf{ w}^\top \otimes k(\mathbf{X}, \mathbf{X}). \]
\[ \mathbf{ w}= \begin{bmatrix} 1 \\ 5\end{bmatrix} \] \[ \mathbf{B}= \begin{bmatrix} 1 & 5\\ 5&25\end{bmatrix} \]
|
\[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{B}\otimes k(\mathbf{X}, \mathbf{X}). \]
\[ \mathbf{B}= \begin{bmatrix} 1 & 0.5\\ 0.5& 1.5\end{bmatrix} \]
\[\mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{B}_1 \otimes k_1(\mathbf{X}, \mathbf{X}) + \mathbf{B}_2 \otimes k_2(\mathbf{X}, \mathbf{X})\]
\[\mathbf{B}_1 = \begin{bmatrix} 1.4 & 0.5\\ 0.5& 1.2\end{bmatrix}\] \[{\ell}_1 = 1\] \[\mathbf{B}_2 = \begin{bmatrix} 1 & 0.5\\ 0.5& 1.3\end{bmatrix}\] \[{\ell}_2 = 0.2\]
Used in machine learning for GPs for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes.
Imposes the correlation of the outputs explicitly through the set of coregionalization matrices.
Setting \(\mathbf{B}= \mathbf{I}_p\) assumes outputs are conditionally independent given the parameters \(\boldsymbol{ \theta}\). (Lawrence and Platt, 2004; Minka and Picard, 1997; Yu et al., 2005).
More recent approaches for multiple output modeling are different versions of the linear model of coregionalization.
Coregionalization matrices are rank 1 Teh et al. (n.d.). rewrite equation as \[\mathbf{K}(\mathbf{X}, \mathbf{X}) = \sum_{j=1}^q\mathbf{ w}_{:, {j}}\mathbf{ w}^{\top}_{:, {j}} \otimes k_{j}(\mathbf{X}, \mathbf{X}).\]
Like the Kalman filter, but each latent function has a different covariance.
Authors suggest using an exponentiated quadratic characteristic length-scale for each input dimension.
\[ \mathbf{K}(\mathbf{X}, \mathbf{X}) = \mathbf{ w}_{:, 1}\mathbf{ w}_{:, 1}^\top \otimes k_1(\mathbf{X}, \mathbf{X}) + \mathbf{ w}_{:, 2} \mathbf{ w}_{:, 2}^\top \otimes k_2(\mathbf{X}, \mathbf{X}) \]
\[ \mathbf{ w}_1 = \begin{bmatrix} 0.5 \\ 1\end{bmatrix} \] \[ \mathbf{ w}_2 = \begin{bmatrix} 1 \\ 0.5\end{bmatrix} \]
Bonilla et al. (n.d.) suggest ICM for multitask learning.
Use a PPCA form for \(\mathbf{B}\): similar to our Kalman filter example.
Refer to the autokrigeability effect as the cancellation of inter-task transfer.
Also discuss the similarities between the multi-task GP and the ICM, and its relationship to the SLFM and the LMC.
Mostly restricted to the case where the outputs are conditionally independent given the hyperparameters \(\boldsymbol{\phi}\) (Lawrence and Platt, 2004; Minka and Picard, 1997; Rasmussen and Williams, 2006; Seeger and Jordan, 2004; Williams and Barber, 1998; Yu et al., 2005).
Intrinsic coregionalization model has been used in the multiclass scenario. Skolidis and Sanguinetti (2011) use the intrinsic coregionalization model for classification, by introducing a probit noise model as the likelihood.
Posterior distribution is no longer analytically tractable: approximate inference is required.
A statistical model used as a surrogate for a computationally expensive computer model.
Higdon et al. (2008) use the linear model of coregionalization to model images representing the evolution of the implosion of steel cylinders.
In Conti and O’Hagan (2009) use the ICM to model a vegetation model: called the Sheffield Dynamic Global Vegetation Model Woodward et al. (1998).
|
|