Week 2: Practical Gaussian Processes
[jupyter][google colab][pdf slides][pdf worksheet]
Abstract:
Gaussian processes provide a probability measure that allows us to perform statistical inference over the space of functions. While GPs are nice as mathematical objects when we need to implement them in practice we often run into issues. In this worksheet we will do a little bit of a whirlwind tour of a couple of approaches to address these problems. We will look at how we can address the numerical issues that often appear and we will look at approximations to circumvent the computational cost associated with Gaussian processes. Importantly when continuing using these models in the course you are most likely not going to implement them yourself but instead use some of the many excellent software packages that exists. The methods that we describe here are going to show you how these packages implement GPs and it will hopefully give you an idea of the type of thinking that goes into implementation of machine learning models.
Setup
notutils
This small package is a helper package for various notebook utilities used below.
The software can be installed using
%pip install notutils
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/notutils
Once notutils
is installed, it can be imported in the
usual manner.
import notutils
pods
In Sheffield we created a suite of software tools for ‘Open Data Science’. Open data science is an approach to sharing code, models and data that should make it easier for companies, health professionals and scientists to gain access to data science techniques.
You can also check this blog post on Open Data Science.
The software can be installed using
%pip install pods
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/ods
Once pods
is installed, it can be imported in the usual
manner.
import pods
mlai
The mlai
software is a suite of helper functions for
teaching and demonstrating machine learning algorithms. It was first
used in the Machine Learning and Adaptive Intelligence course in
Sheffield in 2013.
The software can be installed using
%pip install mlai
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/mlai
Once mlai
is installed, it can be imported in the usual
manner.
import mlai
from mlai import plot
So far we have not done any type of learning with Gaussian processes. Learning is the process where we adapt the parameters of the model to the data that we observe. For a probabilistic model the object here is to fit the distribution of the model such that the data has high likelihood under the model. You can do this at many different levels, you can fit the parameters of the likelihood directly to data, referred to as maximum likelihood but in that case you have not taken into account the information in the prior, i.e. you are fitting the data in a completely unconstrained way which is likely to lead to overfitting. Instead we try to marginalise out as many of the parameters as we can to reflect the knowledge that we have of the problem and then optimise the remaining ones.
Remember, there will always be parameters left in a model as long as you place a parametrised distribution that you integrate out. For Gaussian processes, in most practical applications, we marginalise out the prior over the function values, \(p(f | \theta)\) from the likelihood to reach the marginal likelihood,
\[p(y | X, \theta) = \int p(y | f)p(f | X, \theta)d\theta\]
to reach the marginal likelihood which does not have \(f\) as parameters. However, this distribution still has the parameters of the prior \(\theta\) as a dependency.1 Learning now implies altering these parameters so that we find the ones that the probability of the observed data is maximised,
\[\theta^* = \argmax_\theta p(y | X, \theta)\]
Now let us do this specifically for a zero mean Gaussian process prior. We will focus on the zero mean setting as it is usually not the mean that gives us problems but the covariance function. However, if you want to have a parametrised mean function most of the things that we talk about will be the same. Given that most distributions you will ever work with is in the exponential class the normal approach to this is to maximise the log marginal likelihood. If we write this up for a Gaussian process it will have the following form,
\[\log p(y | X, \theta) = \int p(y | f)p(f | X, \theta)d\theta\] \[ = -\frac{1}{2}y^T\left(k(X, X + \beta^{-1}I)\right)^{-1}y - \frac{1}{2}\log \det\left(k(X, X) + \beta^{-1}I\right) - \frac{N}{2}\log 2\pi\]
Numerical Stability
To address the numerical issues related to the two expressions above we are going to exploit that the problem that we work on is actually quite structured. The covariance matrix is in a class of matrices that are called positive-definite matrices meaning they are symmetric, full rank and with all eigen-values positive. This we can exploit to make computations better conditioned. To do so we are going to use the Cholesky decomposition which allows us to write a positive definite matrix as a product of a lower-triangular matrix and its transpose,
\[K = LL^T\]
Let’s first look at how the decomposition can be used to compute the log determinant term in the marginal log-likelihood,
import numpy as np
from scipy import linalg
def compute_logdet(K):
"""
Compute log determinant of matrix K using Cholesky decomposition
Parameters:
K (ndarray): Positive definite matrix
Returns:
float: log determinant of K
"""
= np.linalg.cholesky(K)
L return 2 * np.sum(np.log(np.diag(L)))
The decomposition allows us to write:
\[\log \det K = \log \det(LL^T) = \log (\det L)^2\]
Now we have to compute the determinant of the Cholesky factor \(L\), this turns out to be very easy as the determinant of a upper/lower diagonal matrix is the product of the values on the diagonal,
\[\det L = \prod_{i} \ell_{ii}\]
If we put everything together we get:
\[\log \det K = \log\left(\prod_{i} \ell_{ii}\right)^2 = 2\sum_{i} \ell_{ii}\]
So the log determinant of the covariance matrix is simply the sum of the diagonal elements of the Cholesky factor. From our first classes in Computer science we know that summing lots of values of different scale is much better for keeping precision compared to taking the product.
You can also use the Cholesky factors in order to address the term that includes the inverse and making this better conditioned. What we will do is to solve the inverse by solving two systems of linear equations, both who have already been made into upper and lower triangular form therefore being trivial to solve.
def solve_cholesky(K, y):
"""
Solve system Kx = y using Cholesky decomposition
Parameters:
K (ndarray): Positive definite matrix
y (ndarray): Right hand side vector
Returns:
ndarray: Solution x
"""
= np.linalg.cholesky(K)
L # Forward substitution
= linalg.solve_triangular(L, y, lower=True)
z # Backward substitution
= linalg.solve_triangular(L.T, z, lower=False)
x return x
Let’s write down how this works step by step. For solving the system \(Ax = b\), instead of computing \(x = A^{-1}b\) directly, we use the Cholesky decomposition \(A = LL^T\) giving us:
\[LL^Tx = b\]
We can solve this in two steps:
First solve \(Lz = b\) by forward substitution: \[\begin{aligned} \ell_{1,1}z_1 &= b_1 \\ \ell_{2,1}z_1 + \ell_{2,2}z_2 &= b_2 \\ &\vdots \\ \ell_{n,1}z_1 + \ell_{n,2}z_2 + \ldots + \ell_{n,n}z_n &= b_n \end{aligned}\]
Then solve \(L^Tx = z\) by backward substitution: \[\begin{aligned} \ell_{n,n}x_n &= z_n \\ \ell_{n-1,n-1}x_{n-1} + \ell_{n,n-1}x_n &= z_{n-1} \\ &\vdots \\ \ell_{1,1}x_1 + \ell_{2,1}x_2 + \ldots + \ell_{n,1}x_n &= z_1 \end{aligned}\]
Approximate Inference
To compute the marginal likelihood of the Gaussian process requires inverting a matrix that is the size of the data. As we have already seen this can be numerically tricky. It is also a very expensive process of cubic complexity which severely limits the size of data-sets that we can use.
Variational Inference
The key idea behind variational inference is to approximate an intractable posterior distribution \(p(x|y)\) with a simpler distribution \(q(x)\). We do this by minimizing the KL divergence between these distributions.
def kl_divergence(q_mean, q_cov, p_mean, p_cov):
"""
Compute KL divergence between two multivariate Gaussians
Parameters:
q_mean, p_mean (ndarray): Mean vectors
q_cov, p_cov (ndarray): Covariance matrices
Returns:
float: KL(q||p)
"""
= len(q_mean)
k
# Compute inverse of p_cov
= np.linalg.cholesky(p_cov)
L = linalg.solve_triangular(L.T,
p_cov_inv
linalg.solve_triangular(L, np.eye(k), =True),
lower=False)
lower
# Compute terms
= np.trace(p_cov_inv @ q_cov)
trace_term = (p_mean - q_mean).T @ p_cov_inv @ (p_mean - q_mean)
mu_term = np.log(np.linalg.det(p_cov)) - np.log(np.linalg.det(q_cov))
logdet_term
return 0.5 * (trace_term + mu_term - k + logdet_term)
We can derive a lower bound on the log marginal likelihood using Jensen’s inequality. For a convex function \(f\):
\[f(\int g\,dx) \leq \int f \circ g\,dx\]
For the log function (which is concave), the inequality is reversed:
\[\log(\int g\,dx) \geq \int \log(g)\,dx\]
Using this, we can derive the Evidence Lower BOund (ELBO):
\[\log p(y) \geq \mathbb{E}_{q(f)}[\log p(y|f)] - \text{KL}(q(f)||p(f))\]
Sparse Approximations
To make GPs scalable to large datasets, we introduce inducing points - a smaller set of points that summarize the GP. Let’s implement this approach:
def sparse_gp(X, y, Z, kernel, noise_var=1.0):
"""
Implement sparse GP using inducing points
Parameters:
X (ndarray): Input locations (N x D)
y (ndarray): Observations (N,)
Z (ndarray): Inducing point locations (M x D)
kernel: Kernel function
noise_var: Observation noise variance
Returns:
tuple: Predictive mean and variance
"""
# Compute kernel matrices
= kernel(Z, X) # M x N
Kuf = kernel(Z, Z) # M x M
Kuu
# Compute Cholesky of Kuu
= np.linalg.cholesky(Kuu)
L
# Compute intermediate terms
= linalg.solve_triangular(L, Kuf, lower=True)
A = A @ A.T
AAT
# Add noise variance
= Kuf.T @ linalg.solve(Kuu, Kuf)
Qff
# Compute mean and variance
= Kuf.T @ linalg.solve(Kuu + AAT/noise_var,
mean @ y))
linalg.solve(Kuu, Kuf = kernel(X, X) - Qff + \
var @ linalg.solve(Kuu + AAT/noise_var,
Kuf.T
linalg.solve(Kuu, Kuf))
return mean, var
Software Packages
Several excellent software packages exist for working with Gaussian processes. Here’s a brief example using GPyTorch:
%pip install gpytorch
import gpytorch
import torch
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel())
def forward(self, x):
= self.mean_module(x)
mean_x = self.covar_module(x)
covar_x return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
def train_gp(train_x, train_y, n_iterations=100):
"""
Train a GP model using GPyTorch
"""
= gpytorch.likelihoods.GaussianLikelihood()
likelihood = ExactGPModel(train_x, train_y, likelihood)
model
# Use the adam optimizer
= torch.optim.Adam(model.parameters(), lr=0.1)
optimizer
# "Loss" for GPs - the marginal log likelihood
= gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
mll
model.train()
likelihood.train()
for i in range(n_iterations):
optimizer.zero_grad()= model(train_x)
output = -mll(output, train_y)
loss
loss.backward()
optimizer.step()
return model, likelihood
Summary
In this worksheet, we’ve covered the practical aspects of implementing Gaussian processes:
- Numerical stability through Cholesky decomposition
- Approximate inference using variational methods
- Sparse approximations for scaling to larger datasets
- Practical implementation using modern software packages
While GPs provide an elegant mathematical framework, making them work in practice requires careful attention to computational details. The methods we’ve discussed here form the basis of modern GP implementations.
Thanks!
For more information on these subjects and more you might want to check the following resources.
- book: The Atomic Human
- twitter: @lawrennd
- podcast: The Talking Machines
- newspaper: Guardian Profile Page
- blog: http://inverseprobability.com