at on Friday, Oct 22, 2021 at 12:00 [jupyter][google colab][reveal]
Neil D. Lawrence, University of Cambridge gscholar

#### Abstract

In this lecture we motivate the use of emulation, and introduce the GPy software as a framework for building Gaussian process emulators.

# Emulation

There are a number of ways we can use machine learning to accelerate scientific discovery. But one way is to have the machine learning model learn the effect of the rules. Rather than worrying about the detail of the rules through computing each step, we can have the machine learning model look to abstract the rules and capture emergent phenomena, just as the Maxwell-Boltzmann distribution captures the essence of the behavior of the ideal gas.

In the papers listed above, neural networks are being used to speed up computations. In this course we’ve introduced Gaussian processes that will be used to speed up these computations. In both cases the ideas are similar. Rather than rerunning the simulation, we use data from the simulation to fit the neural network or the Gaussian process to the data.

We’ll see an example of how this is done in a moment, taken from a simple ride hailing simulator, but before we look at that, we’ll first consider why this might be a useful approach.

# Surrogate Modelling in Practice

As we’ve seen from the very simple rules in the Game of Life, emergent phenomena we might be interested in take computation power to discover, just as Laplace’s and Dirac’s quotes suggest. The objective in surrogate modelling is to harness machine learning models to learn those physical characteristics.

## Types of Simulations

We’ve introduced simulations from the perspective of laws of physics. In practice, many simulations may not directly encode for the laws of physics, but they might encode expert intuitions about a problem.

For example, in Formula 1 races, the cars have tyres that wear at different rates. Softer tyres allow the cars to drive faster but wear quicker. Harder tyres man the car drives slower but they last longer. Changing between tyres is part of the race, and it has a time penalty. Before each race the teams decide what their strategy will be with tyre changes. It’s not only how many tyre changes that are important, but when they happen. If you change your tyre early, you might get a speed advantage and be able to pass your rival when they change their tyre later. This is a trick known as ‘undercutting,’ but if your early change puts you back onto the track behind other slower cars, you will lose this advantage.

Formula 1 teams determine their strategy through simulating the race. Each team knows how fast other teams are around the track, and what their top speeds are. So the teams simulate many thousands or millions of races with different strategies for their rivals, and they choose the strategy for which they maximize their number of expected points.

When many simulations are done, the results take time to come. During the actual race, the simulations are too slow to provide the real time information teams would need. In this case F1 teams can use emulators, models that have learnt the effect of the simulations, to give real time updates.

Formula 1 race simulations contain assumptions that derive from physics but don’t directly encode the physical laws. For example, if one car is stuck behind another, in any given lap, it might overtake. A typical race simulation will look at the lap speed of each car and the top speed of each car (as measured in ‘speed traps’ that are placed on the straight). It will assume a probability of overtake for each lap that is a function of these values. Of course, underlying that function is the physics of how cars overtake each other, but that can be abstracted away into a simpler function that the Race Strategy Engineer defines from their knowledge and previous experience.

Many simulations have this characteristic: major parts of the simulation are the result of encoding expert knowledge in the code. But this can lead to challenges. I once asked a strategy engineer, who had completed a new simulation, how it was going. He replied that things had started out well, but over time its performance was degrading. We discussed this for a while and over time a challenge of mispecified granularity emerged.

## Fidelity of the Simulation

The engineer explained how there’d been a race where the simulation had suggested that their main driver shouldn’t pit because he would have emerged behind a car with a slower lap speed, but a high top-speed. This would have made that car difficult to overtake. However, the driver of that slower car was also in the team’s ‘development program,’ so everyone in the team knew that the slower car would have moved aside to let their driver through. Unfortunately, the simulation didn’t know this. So, the team felt the wrong stategy decision was made. After the race, the simulation was updated to include a special case for this situation. The new code checked whether the slower car was a development driver, making it ‘more realistic.’

Over time there were a number of similar changes, each of which should have improved the simulation, but the reality was the code was now ‘mixing granularities.’ The formula for computing the probability of overtake as a function of speeds is one that is relatively easy to verify. It ignores the relationships between drivers, whether a given driver is a development driver, whether one bears a grudge or not, whether one is fighting for their place in the team. That’s all assimilated into the equation. The original equation is easy to calibrate, but as soon as you move to a finer granularity and consider more details about individual drivers, the model seems more realistic, but it becomes difficult to specify, and therefore performance degrades.

Simulations work at different fidelities, but as the Formula 1 example shows you must be very careful about mixing fidelities within the same simulation. The appropriate fidelity of a simulation is strongly dependent on the question being asked of it. On the context. For example, in Formula 1 races you can also simulate the performance of the car in the wind tunnel and using computational fluid dynamics represenations of the Navier Stokes equations. That level of fidelity is appropriate when designing the aerodynamic components of the car, but inappropriate when building a strategy simulation.

# Epidemiology

The same concept of modelling at a particular fidelity comes up in epidemiology. Disease is transmitted by direct person to person interactions between individuals and objects. But in theoretical epidemiology, this is approximated by differential equations. The resulting models look very similar to reaction rate models used in Chemistry for well mixed beakers. Let’s have a look at a simple example used for modelling the policy of ‘herd immunity’ for Covid19.

## Modelling Herd Immunity

This example is taken from Thomas House’s blog post on Herd Immunity. This model was shared at the beginning of the Covid19 pandemic when the first UK lockdown hadn’t yet occurred.

import numpy as np
from scipy import integrate

The next piece of code sets up the dynamics of the compartmental model. He doesn’t give the specific details in the blog post, but my understanding is that the four states are as follows. x[0] is the susceptible population, those that haven’t had the disease yet. The susceptible population decreases by encounters with infections people. In Thomas’s model, both x[3] and x[4] are infections. So the dynamics of the reduction of the susceptible is given by $\frac{\text{d}{S}}{\text{d}t} = - \beta S (I_1 + I_2).$ Here, we’ve used $I_1$ and $I_2$ to represent what appears to be two separate infectious compartments in Thomas’s model. We’ll speculate about why there are two in a moment.

The model appears to be an SEIR model, so rather than becoming infectious directly you next move to an ‘exposed,’ where you have the disease, but you are not yet infectious. There are again two exposed states, we’ll return to that in a moment. We denote the first, x[1] by $E_1$. We have $\frac{\text{d}{E_1}}{\text{d}t} = \beta S (I_1 + I_2) - \sigma E_1.$ Note that the first term matches the term from the Susceptible equation. This is because it is the incoming exposed population.

The exposed population move to a second compartment of exposure, $E_2$. I believe the reason for this is that if you use only one exposure compartment, then the statistics of the duration of exposure are incorrect (implicitly they are exponentially distributed in the underlying stochastic version of the model). By using two exposure departments, Thomas is making a slight correction to this which would impose a first order gamma distribution on those statistics. A similar trick is being deployed for the ‘infectious group.’ So we gain an additional equation to help with these statistics, $\frac{\text{d}{E_2}}{\text{d}t} = \sigma E_1 - \sigma E_2.$ giving us the exposed group as the sum of the two compartments $E_1$ and $E_2$. The exposed group from the second compartment then become ‘infected,’ which we represent with $I_1$, in the code this is x[3], $\frac{\text{d}{I_1}}{\text{d}t} = \sigma E_2 - \gamma I_1,$ and similarly, Thomas is using a two-compartment infectious group to fix up the duration model. So we have, $\frac{\text{d}{I_2}}{\text{d}t} = \gamma I_1 - \gamma I_2.$ And finally, we have those that have recovered emerging from the second infections compartment. In this model there is no separate model for ‘deaths,’ so the recovered compartment, $R$, would also include those that die, $\frac{\text{d}R}{\text{d}t} = \gamma I_2.$ All of these equations are then represented in code as follows.

def odefun(t,x,beta0,betat,t0,t1,sigma,gamma):
dx = np.zeros(6)
if ((t>=t0) and (t<=t1)):
beta = betat
else:
beta = beta0
dx[0] = -beta*x[0]*(x[3] + x[4])
dx[1] = beta*x[0]*(x[3] + x[4]) - sigma*x[1]
dx[2] = sigma*x[1] - sigma*x[2]
dx[3] = sigma*x[2] - gamma*x[3]
dx[4] = gamma*x[3] - gamma*x[4]
dx[5] = gamma*x[4]
return dx

Where the code takes in the states of the compartments (the values of x) and returns the gradients of those states for the provided parameters (sigma, gamma and beta). Those parameters are set according to the known characteristics of the disease.

The next block of code sets up the parameters of the SEIR model. A particularly important parameter is the reproduction number ($R_0$), here Thomas has assumed a reproduction number of 2.5, implying that each infected member of the population transmits the infection up to 2.5 other people. The effective $R$ decreases over time though, because some of those people they meet will no longer be in the susceptible group.

# Parameters of the model
N = 6.7e7 # Total population
i0 = 1e-4 # 0.5*Proportion of the population infected on day 0
tlast = 365.0 # Consider a year
latent_period = 5.0 # Days between being infected and becoming infectious
infectious_period = 7.0 # Days infectious
R0 = 2.5 # Basic reproduction number in the absence of interventions
Rt = 0.75 # Reproduction number in the presence of interventions
tend = 21.0 # Number of days of interventions

The parameters are correct for the ‘discrete system,’ where the infectious period is a discrete time, and the numbers are discrete values. To translate into our continuous differential equation system’s parameters, we need to do a couple of manipulations. Note the factor of 2 associated with gamma and sigma. This is a doubling of the rate to account for the fact that there are two compartments for each of these states (to fix-up the statistics of the duration models).

beta0 = R0 / infectious_period
betat = Rt / infectious_period
sigma = 2.0 / latent_period
gamma = 2.0 / infectious_period

Next, we solve the system using scipy’s initial value problem solver. The solution method is Runge-Kutta-Fehlberg method, as indicated by the 'RK45' solver. This is a numerical method for solving differential equations. The 45 is the order of the method and the error estimator.

We can view the solver itself as somehow a piece of simulation code, but here it’s being called as sub routine in the system. It returns a solution for each time step, stored in a list sol.

This is typical of this type of non-linear differential equation problem. Whether it’s partial differential equations, ordinary differential equations, there’s a step where a numerical solver needs to be called. These are often expensive to run. For climate and weather models, this would be where we solved the Navier-Stokes equations. For this simple model, the solution is relatively quick.


t0ran = np.array([-100, 40, 52.5, 65])
sol=[]
for tt in range(0,len(t0ran)):
sol.append(integrate.solve_ivp(lambda t,x: odefun(t,x,beta0,betat,t0ran[tt],t0ran[tt]+tend,sigma,gamma),
(0.0,tlast),
np.array([1.0-2.0*i0, 0.0, 0.0, i0, i0, 0.0]),
'RK45',
atol=1e-8,
rtol=1e-9))

In practice, immunity for Covid19 may only last around 6 months. As an exercise, try to extend Thomas’s model for the case where immunity is temporary. You’ll need to account for deaths as well in your new model.

Thinking about our Formula 1 example, and the differing levels of fidelity that might be included in a model, you can now imagine the challenges of doing large scale theoretical epidemiology. The compartment model is operating at a particular level of fidelity. Imagine trying to modify this model for a specific circumstance, like the way that the University of Cambridge chooses to do lectures. It’s not appropriate for this level of fidelity. You need to use different types of models for that decision making. Later, we’ll look at a simulation that was used to advise the government on the Test Trace Isolate program that took a different approach (The DELVE Initiative, 2020a).

# Strategies for Simulation

Within any simulation, we can roughly split the variables of interest into the state variables and the parameters. In the Herd immunity example, the state variables were the different susceptible, exposed, infectious and recovered groups. The parameters were the reproduction number and the expected lengths of infection and the timing of lockdown. Often parameters are viewed as the inputs to the simulation, the things we can control. We might want to know how to time lock down to minimize the number of deaths. This behavior of the simulator is what we may want to emulate with our Gaussian process model.

So far, we’ve introduced simulation motivated by the physical laws of the universe. Those laws are sometimes encoded in differential equations, in which case we can try to solve those systems (like with Herd Immunity or Navier Stokes). An alternative approach is taken in the Game of Life. There a turn-based simulation is used, at each turn, we iterate through the simulation updating the state of the simulation. This is known as a discrete event simulation. In race simulation for Formula 1 a discrete event simulation is also used. There is another form of discrete event simulation, often used in chemical models, where the events don’t take place at regular intervals. Instead, the timing to the next event is computed, and the simulator advances that amount of time. For an example of this see the Gillespie algorithm.

There is a third type of simulation that we’d also like to introduce. That is simulation within computer software. In particular, the need to backtest software with ‘what if’ ideas, or to trace errors that may have occurred in production. This can involve loading up entire code bases and rerunning them with simulated inputs. This is a third form of simulation where emulation can also come in useful.

## Backtesting Production Code

In Amazon the team I led looked at examples of simulations and emulation as varied as Prime Air drones across to the Amazon Supply Chain. In a purchasing system, the idea is to store stock to balance supply and demand. The aim is to keep product in stock for quick dispatch while keeping prices (and therefore costs) low. This idea is at the heart of Amazon’s focus on customer experience.

## Supply Chain Optimization

Supply chain is the process of matching between the supply of the product and demand for the product. This matching process is done through the deployment of resources: places to store the products, ways and means of transporting the product and even the ability to transform products from one form to another (e.g. transforming paper into books through printing).

Arugably the Amazon supply chain is the largest automated decision making system in the world in terms of the amount of money spent and the quantity of product moved through automated decision making.

## Supply Chain Optimization

At the heart of an automated supply chain is the buying system, which determines the optimal stock level for products and makes purchases based on the difference between that stock level and the current stock level.

To make these decisions predictive models (often machine learning or statistical models) have to be moved. For example, the demand for a particular product needs to be pforecast.

## Forecasting

The process of forecasting is described by Jenny Freshwater (at the time Director for Forecasting within Amazon’s Supply Chain Optimization team in the video in Figure .

For each product in the Amazon catalogue, the demand is forecast across the a given future period.

Forecast information is combined with predictions around lead times from suppliers, understanding of the network’s capacity (in terms of how much space is available in which fulfillment centres), the cost of storing and transporting products and the “value” for the consumer in finding the product is in stock. These models are typically operational research models (such as the “newsvendor problem” combined with machine learning and/or statistical forecasts.

An example of a complex decision-making system might be an automated buying system. In such a system, the idea is to match demand for products to supply of products.

The matching of demand and supply is a repetitive theme for decision making systems. Not only does it occur in automated buying, but also in the allocation of drivers to riders in a ride sharing system. Or in the allocation of compute resource to users in a cloud system.

The components of any of these system include predictions of the demand for the product, the drivers, or the compute. Predictions of the supply. Decisions are then made for how much material to keep in stock, or how many drivers to have on the road, or how much computer capacity to have in your data centers. These decisions have cost implications. The optimal amount of product will depend on the cost of making it available. For a buying system this is the storage costs.

Decisions are made based on the supply and demand to make new orders, to encourage more drivers to come into the system or to build new data centers or rent more computational power.

## Monolithic System

The classical approach to building these systems was a ‘monolithic system.’ Built in a similar way to the successful applications software such as Excel or Word, or large operating systems, a single code base was constructed. The complexity of such code bases run to many lines.

In practice, shared dynamically linked libraries may be used for aspects such as user interface, or networking, but the software often has many millions of lines of code. For example, the Microsoft Office suite is said to contain over 30 million lines of code.

Such software is not only difficult to develop, but also to scale when computation demands increase. Amazon’s original website software (called Obidos) was a monolithic design but by the early noughties it was becoming difficult to sustain and maintain. The software was phased out in 2006 to be replaced by a modularized software known as a ‘service-oriented architecture.’

## Service Oriented Architecture

In Service Oriented Architecture, or “Software as a Service” the idea is that code bases are modularized and communicate with one another using network requests. A standard approach is to use a REST API. So, rather than a single monolithic code base, the code is developed with individual services that handle the different requests.

Modern software development uses an approach known as service-oriented architecture to build highly complex systems. Such systems have similar emergent properties to Conway’s “Game of Life.” Understanding these emergent properties is vitally important when diagnosing problems in the system.

In the context of machine learning and complex systems, Jonathan Zittrain has coined the term “Intellectual Debt” to describe the challenge of understanding what you’ve created.

This is the landscape we now find ourselves in with regard to software development. In practice, each of these services is often ‘owned’ and maintained by an individual team. The team is judged by the quality of their service provision. They work to detailed specifications on what their service should output, what its availability should be and other objectives like speed of response. This allows for conditional independence between teams and for faster development.

Clearly Conway’s Game of Life exhibits an enormous amount of intellectual debt, indeed that was the idea. Build something simple that exhibits great complexity. That’s what makes it so popular. But in deployed system software, intellectual debt is a major headache and emulation presents one way of dealing with it.

Unfortunately, it also makes sophisticated software systems a breeding ground for intellectual debt. Particularly when they contain components which are themselves ML components. Dealing with this challenge is a major objective of my Senior AI Fellowship at the Alan Turing Institute. You can see me talking about the problems at this recent seminar given virtually in Manchester.

## Simulation System

An example of a complex decision-making system might be a climate model, in such a system there are separate models for the atmosphere, the ocean and the land.

The components of these systems include flowing of currents, chemical interactions in the upper atmosphere, evaporation of water etc..

The influence of human activity also needs to be incorporated and modelled so we can make judgments about how to mitigate the effects of global warming.

## Monolithic System

The classical approach to building these systems was a ‘monolithic system.’ Built in a similar way to the successful applications software such as Excel or Word, or large operating systems, a single code base was constructed. The complexity of such code bases run to many lines.

In practice, shared dynamically linked libraries may be used for aspects such as user interface, or networking, but the software often has many millions of lines of code. For example, the Microsoft Office suite is said to contain over 30 million lines of code.

## Service Oriented Architecture

Such software is not only difficult to develop, but also to scale when computation demands increase. For example, Amazon’s original website software (called Obidos) was a monolithic design but by the early noughties it was becoming difficult to sustain and maintain. The software was phased out in 2006 to be replaced by a modularized software known as a ‘service-oriented architecture.’

In Service Oriented Architecture, or “Software as a Service” the idea is that code bases are modularized and communicate with one another using network requests. A standard approach is to use a REST API. So, rather than a single monolithic code base, the code is developed with individual services that handle the different requests.

The simulation software is turned inside out to expose the individual components to the operator.

This is the landscape we now find ourselves in for software development. In practice, each of these services is often ‘owned’ and maintained by an individual team. The team is judged by the quality of their service provision. They work to detailed specifications on what their service should output, what its availability should be and other objectives like speed of response. This allows for conditional independence between teams and for faster development.

One question is to what extent is the same approach possible/desirable for scientific models? The components we listed above are already separated and often run independently. But those components themselves are made up of other sub-components that could also be exposed in a similar manner to software-as-a-service, giving us the notion of “simulation as a service.”

One thing about working in an industrial environment, is the way that short-term thinking actions become important. For example, in Formula One, the teams are working on a two-week cycle to digest information from the previous week’s race and incorporate updates to the car or their strategy.

However, businesses must also think about more medium-term horizons. For example, in Formula 1 you need to worry about next year’s car. So while you’re working on updating this year’s car, you also need to think about what will happen for next year and prioritize these conflicting needs appropriately.

In the Amazon supply chain, there are the equivalent demands. If we accept that an artificial intelligence is just an automated decision-making system. And if we measure in terms of money automatically spent, or goods automatically moved, then Amazon’s buying system is perhaps the world’s largest AI.

Those decisions are being made on short time schedules; purchases are made by the system on weekly cycles. But just as in Formula 1, there is also a need to think about what needs to be done next month, next quarter and next year. Planning meetings are held not only on a weekly basis (known as weekly business reviews), but monthly, quarterly, and then yearly meetings for planning spends and investments.

Amazon is known for being longer term thinking than many companies, and a lot of this is coming from the CEO. One quote from Jeff Bezos that stuck with me was the following.

“I very frequently get the question: ‘What’s going to change in the next 10 years?’ And that is a very interesting question; it’s a very common one. I almost never get the question: ‘What’s not going to change in the next 10 years?’ And I submit to you that that second question is actually the more important of the two – because you can build a business strategy around the things that are stable in time. … [I]n our retail business, we know that customers want low prices, and I know that’s going to be true 10 years from now. They want fast delivery; they want vast selection. It’s impossible to imagine a future 10 years from now where a customer comes up and says, ‘Jeff I love Amazon; I just wish the prices were a little higher,’ [or] ‘I love Amazon; I just wish you’d deliver a little more slowly.’ Impossible. And so the effort we put into those things, spinning those things up, we know the energy we put into it today will still be paying off dividends for our customers 10 years from now. When you have something that you know is true, even over the long term, you can afford to put a lot of energy into it.”

This quote is incredibly important for long term thinking. Indeed, it’s a failure of many of our simulations that they focus on what is going to happen, not what will not happen. In Amazon, this meant that there was constant focus on these three areas, keeping costs low, making delivery fast and improving selection. For example, shortly before I left Amazon moved its entire US network from two-day delivery to one-day delivery. This involves changing the way the entire buying system operates. Or, more recently, the company has had to radically change the portfolio of products it buys in the face of Covid19.

From the perspective of the team that we had in the supply chain, we looked at what we most needed to focus on. Amazon moves very quickly, but we could also take a leaf out of Jeff’s book, and instead of worrying about what was going to change, remember what wasn’t going to change.

We don’t know what science we’ll want to do in five years’ time, but we won’t want slower experiments, we won’t want more expensive experiments and we won’t want a narrower selection of experiments.

As a result, our focus was on how to speed up the process of experiments, increase the diversity of experiments that we can do, and keep the experiments price as low as possible.

The faster the innovation flywheel can be iterated, then the quicker we can ask about different parts of the supply chain, and the better we can tailor systems to answering those questions.

We need faster, cheaper and more diverse experiments which implies we need better ecosystems for experimentation. This has led us to focus on the software frameworks we’re using to develop machine learning systems including data oriented architectures (Borchert (2020);(Lawrence-doa19?);Vorhemus and Schikuta (2017);Joshi (2007)), data maturity assessments (Lawrence et al. (2020)) and data readiness levels (See this blog post on Data Readiness Levels. and Lawrence (2017);The DELVE Initiative (2020b))

## Statistical Emulation

In many real-world systems, decisions are made through simulating the environment. Simulations may operate at different granularities. For example, simulations are used in weather forecasts and climate forecasts. Interestingly, the UK Met office uses the same code for both, it has a “Unified Model” approach, but they operate climate simulations at greater spatial and temporal resolutions.

A statistical emulator is a data-driven model that learns about the underlying simulation. Importantly, learns with uncertainty, so it ‘knows what it doesn’t know.’ In practice, we can call the emulator in place of the simulator. If the emulator ‘doesn’t know,’ it can call the simulator for the answer.

As well as reconstructing an individual simulator, the emulator can calibrate the simulation to the real world, by monitoring differences between the simulator and real data. This allows the emulator to characterize where the simulation can be relied on, i.e., we can validate the simulator.

Similarly, the emulator can adjudicate between simulations. This is known as multi-fidelity emulation. The emulator characterizes which emulations perform well where.

If all this modelling is done with judicious handling of the uncertainty, the computational doubt, then the emulator can assist in desciding what experiment should be run next to aid a decision: should we run a simulator, in which case which one, or should we attempt to acquire data from a real-world intervention.

%pip install gpy

## GPy: A Gaussian Process Framework in Python

Gaussian processes are a flexible tool for non-parametric analysis with uncertainty. The GPy software was started in Sheffield to provide a easy to use interface to GPs. One which allowed the user to focus on the modelling rather than the mathematics.

GPy is a BSD licensed software code base for implementing Gaussian process models in python. This allows GPs to be combined with a wide variety of software libraries.

The software itself is available on GitHub and the team welcomes contributions.

The aim for GPy is to be a probabilistic-style programming language, i.e., you specify the model rather than the algorithm. As well as a large range of covariance functions the software allows for non-Gaussian likelihoods, multivariate outputs, dimensionality reduction and approximations for larger data sets.

The documentation for GPy can be found here.

## GPy Tutorial

This GPy tutorial is based on material we share in the Gaussian process summer school for teaching these models https://gpss.cc. It contains material from various members and former members of the Sheffield machine learning group, but particular mention should be made of Nicolas Durrande and James Hensman, see http://gpss.cc/gpss17/labs/GPSS_Lab1_2017.ipynb.

%pip install mlai
import numpy as np
import GPy

To give a feel for the software we’ll start by creating an exponentiated quadratic covariance function, $k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}- \mathbf{ x}^\prime \right\Vert_2^2}{2\ell^2}\right),$ where the length scale is $\ell$ and the variance is $\alpha$.

To set this up in GPy we create a kernel in the following manner.

input_dim=1
alpha = 1.0
lengthscale = 2.0
kern = GPy.kern.RBF(input_dim=input_dim, variance=alpha, lengthscale=lengthscale)

That builds a kernel object for us. The kernel can be displayed.

display(kern)

Or because it’s one dimensional, you can also plot the kernel as a function of its inputs (while the other is fixed).

You can set the length scale of the covariance to different values and plot the result.

kern = GPy.kern.RBF(input_dim=input_dim)     # By default, the parameters are set to 1.
lengthscales = np.asarray([0.2,0.5,1.,2.,4.])

## Covariance Functions in GPy

Many covariance functions are already implemented in GPy. Instead of rbf, try constructing and plotting the following covariance functions: exponential, Matern32, Matern52, Brownian, linear, bias, rbfcos, periodic_Matern32, etc. Some of these covariance functions, such as rbfcos, are not parametrized by a variance and a length scale. Further, not all kernels are stationary (i.e., they can’t all be written as $k(\mathbf{ x}, \mathbf{ x}^\prime) = f(\mathbf{ x}-\mathbf{ x}^\prime)$, see for example the Brownian covariance function). So for plotting it may be interesting to change the value of the fixed input.

## Combining Covariance Functions in GPy

In GPy you can easily combine covariance functions you have created using the sum and product operators, + and *. So, for example, if we wish to combine an exponentiated quadratic covariance with a Matern 5/2 then we can write

kern1 = GPy.kern.RBF(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 + kern2
display(kern)

Or if we wanted to multiply them, we can write

kern1 = GPy.kern.RBF(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 * kern2
display(kern)

You can learn about how to implement new kernel objects in GPy here.

## A Gaussian Process Regression Model

We will now combine the Gaussian process prior with some data to form a GP regression model with GPy. We will generate data from the function $f( x) = − \cos(\pi x) + \sin(4\pi x)$ over the domain $[0, 1]$, adding some noise to gives $y(x) = f(x) + \epsilon,$ with the noise being Gaussian distributed, $\epsilon\sim \mathcal{N}\left(0,0.01\right)$.

X = np.linspace(0.05,0.95,10)[:,np.newaxis]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1))

A GP regression model based on an exponentiated quadratic covariance function can be defined by first defining a covariance function.

kern = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)

And then combining it with the data to form a Gaussian process model.

model = GPy.models.GPRegression(X,Y,kern)

Just as for the covariance function object, we can find out about the model using the command display(model).

display(model)

Note that by default the model includes some observation noise with variance 1. We can see the posterior mean prediction and visualize the marginal posterior variances using model.plot().

You can also look directly at the predictions for the model using.

Xstar = np.linspace(0, 10, 100)[:, np.newaxis]
Ystar, Vstar = model.predict(Xstar)

Which gives you the mean (Ystar), the variance (Vstar) at the locations given by Xstar.

## Covariance Function Parameter Estimation

As we have seen during the lectures, the parameters values can be estimated by maximizing the likelihood of the observations. Since we don’t want any of the variances to become negative during the optimization, we can constrain all parameters to be positive before running the optimization.

model.constrain_positive()

The warnings are because the parameters are already constrained by default, the software is warning us that they are being reconstrained.

Now we can optimize the model using the model.optimize() method. Here we switch messages on, which allows us to see the progression of the optimization.

model.optimize(messages=True)

By default, the optimization is using a limited memory BFGS optimizer (Byrd et al., 1995).

Once again, we can display the model, now to see how the parameters have changed.

display(model)

The length scale is much smaller, as well as the noise level. The variance of the exponentiated quadratic has also reduced.

## GPy and Emulation

Let $\mathbf{ x}$ be a random variable defined over the real numbers, $\Re$, and $f(\cdot)$ be a function mapping between the real numbers $\Re \rightarrow \Re$.

The problem of uncertainty propagation is the study of the distribution of the random variable $f(\mathbf{ x})$.

We’re going to address this problem using emulation and GPy. We will see in this section the advantage of using a model when only a few observations of $f$ are available.

Firstly, we’ll make use of a test function known as the Branin test function. $f(\mathbf{ x}) = a(x_2 - bx_1^2 + cx_1 - r)^2 + s(1-t \cos(x_1)) + s$ where we are setting $a=1$, $b=5.1/(4\pi^2)$, $c=5/\pi$, $r=6$, $s=10$ and $t=1/(8\pi)$.

import numpy as np
def branin(X):
y = ((X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2
+ 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10)
return(y)

We’ll define a grid of twenty-five observations over [−5, 10] × [0, 15] and a set of 25 observations.

# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
for j,x2 in enumerate(xg2):
X[i+xg1.size*j,:] = [x1,x2]

Y = branin(X)[:,np.newaxis]

The task here will be to consider the distribution of $f(U)$, where $U$ is a random variable with uniform distribution over the input space of $f$. We focus on the computaiton of two quantities, the expectation of $f(U)$, $\left\langle f(U)\right\rangle$, and the probability that the value is greater than 200.

## Computation of $\left\langle f(U)\right\rangle$

The expectation of $f(U )$ is given by $\int_\mathbf{ x}f( \mathbf{ x})\text{d}\mathbf{ x}$. A basic approach to approximate this integral is to compute the mean of the 25 observations: np.mean(Y). Since the points are distributed on a grid, this can be seen as the approximation of the integral by a rough Riemann sum.

print('Estimate of the expectation is given by: {mean}'.format(mean=Y.mean()))

The result can be compared with the actual mean of the Branin function which is 54.31.

Alternatively, we can fit a GP model and compute the integral of the best predictor by Monte Carlo sampling.

Firstly, we create the covariance function. Here we’re going to use an exponentiated quadratic, but we’ll augment it with the ‘bias’ covariance function. This covariance function represents a single fixed bias that is added to the overall covariance. It allows us to deal with non-zero-mean emulations.

import GPy
# Create an exponentiated quadratic plus bias covariance function
kern_eq = GPy.kern.RBF(input_dim=2, ARD = True)
kern_bias = GPy.kern.Bias(input_dim=2)
kern = kern_eq + kern_bias

Now we construct the Gaussian process regression model in GPy.

# Build a GP model
model = GPy.models.GPRegression(X,Y,kern)

In the sinusoid example above, we learnt the variance of the process. But in this example, we are fitting an emulator to a function we know is noise-free. However, we don’t fix the noise value to precisely zero, as this can lead to some numerical errors. Instead, we fix the variance of the Gaussian noise to a very small value.

# fix the noise variance
model.likelihood.variance.fix(1e-5)

Now we fit the model. Note, that the initial values for the length scale are not appropriate. So first set the length scale of the model needs to be reset.

kern.rbf.lengthscale = np.asarray([3, 3])

It’s a common error in Gaussian process fitting to initialize the length scale too small or too big. The challenge is that the error surface is normally multimodal, and the final solution can be very sensitive to this initialization. If the length scale is initialized too small, the solution can converge on an place where the signal isn’t extracted by the covariance function. If the length scale is initialized too large, then the variations of the function are often missing. Here the length scale is set for each dimension of inputs as 3. Now that’s done, we can optimize the model.

# Randomize the model and optimize
model.optimize(messages=True)

Finally, we can compute the mean of the model predictions using very many Monte Carlo samples.

Note, that in this example, because we’re using a test function, we could simply have done the Monte Carlo estimation directly on the Branin function. However, imagine instead that we were trying to understand the results of a complex computational fluid dynamics simulation, where each run of the simulation (which is equivalent to our test function) took many hours. In that case the advantage of the emulator is clear.

# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(int(1e5),2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
mu, var = model.predict(Xp)
print('The estimate of the mean of the Branin function is {mean}'.format(mean=np.mean(mu)))

### Exercise 2

Now think about how to make use of the variance estimation from the Gaussian process to obtain error bars around your estimate.

### Exercise 3

You’ve seen how the Monte Carlo estimates work with the Gaussian process. Now make your estimate of the probability that the Branin function is greater than 200 with the uniform random inputs.

## Uncertainty Quantification and Design of Experiments

We’re introducing you to the optimization and analysis of real-world models through emulation, this domain is part of a broader field known as surrogate modelling.

Although we’re approaching this from the machine learning perspective, with a computer-scientist’s approach, you won’t be surprised to find out that this field is not new and there are a range of research groups interested in this domain.

We’ve also been mainly focussing on active experimental design. In particular the case where we are sequentially selecting points to run our simulation based on previous results.

Here, we pause for a moment and cover approaches to passive experimental design. Almost all the emulation examples we’ve looked at so far need some initial points to ‘seed’ the emulator. Selecting these is also a task of experimental design, but one we perform without running our simulator.

This type of challenge, of where to run the simulation to get the answer you require is an old challenge. One classic paper, McKay et al. (1979), reviews three different methods for designing these inputs. They are random sampling, stratified sampling and Latin hypercube sampling.

Let the input values $\mathbf{ x}_1, \dots, \mathbf{ x}_n$ be a random sample from $f(\mathbf{ x})$. This method of sampling is perhaps the most obvious, and an entire body of statistical literature may be used in making inferences regarding the distribution of $Y(t)$.

Using stratified sampling, all areas of the sample space of $\mathbf{ x}$ are represented by input values. Let the sample space $S$ of $\mathbf{ x}$ be partitioned into $I$ disjoint strata $S_t$. Let $\pi = P(\mathbf{ x}\in S_i)$ represent the size of $S_i$. Obtain a random sample $\mathbf{ x}_{ij}$, $j = 1, \dots, n$ from $S_i$. Then of course the $n_i$ sum to $n$. If $I = 1$, we have random sampling over the entire sample space.

The same reasoning that led to stratified sampling, ensuring that all portions of $S$ were sampled, could lead further. If we wish to ensure also that each of the input variables $\mathbf{ x}_k$ has all portions of its distribution represented by input values, we can divide the range of each $\mathbf{ x}_k$ into $n$ strata of equal marginal probability $1/n$, and sample once from each stratum. Let this sample be $\mathbf{ x}_{kj}$, $j = 1, \dots, n$. These form the $\mathbf{ x}_k$ component, $k = 1, \dots , K$, in $\mathbf{ x}_i$, $i = 1, \dots, n$. The components of the various $\mathbf{ x}_k$’s are matched at random. This method of selecting input values is an extension of quota sampling (Steinberg 1963), and can be viewed as a $K$-dimensional extension of Latin square sampling (Raj 1968).

The paper’s rather dated reference to “Output from a Computer Code” does carry forward through this literature, which has continued to be a focus of interest for statisticians. Tony O’Hagan, who was a colleague in Sheffield but is also one of the pioneers of Gaussian process models was developing these methods when I first arrived there (Kennedy and O’Hagan, 2001), and continued with a large EPSRC funded project for managing uncertainty in computational models, http://www.mucm.ac.uk/. You can see a list of their technical reports here.

Another important group based in France is the “MASCOT-NUM Research Group,” https://www.gdr-mascotnum.fr/. These researchers bring together statisticians, applied mathematicians and engineers in solving these problems.

## Emukit Playground

Emukit playground is a software toolkit for exploring the use of statistical emulation as a tool. It was built by Leah Hirst, during her software engineering internship at Amazon and supervised by Cliff McCollum.

You can explore Bayesian optimization of a taxi simulation.

### Exercise 4

You now know enough to build a simple emulation. To test your knowledge have a go at cobmining GPy with Thomas House’s herd immunity simulation. Can you build a Gaussian process emulator of the simulation? Don’t spent do long on this exercise. The idea is just to consolidate things like what the inputs and outputs should be.

## Conclusions

We summarized the different types of simmulation into roughly three groups. Firstly, those based on physical laws in the form of differential equations. Examples include certain compartmental epidemiological models, climate models and weather models. Secondly, discrete event simulations. These simulations often run to a ‘clock,’ where updates to the state are taken in turns. The Game of Life is an example of this type of simulation, and Formula 1 models of race strategy also use this approach. There is another type of discrete event simulation that doesn’t use a turn-based approach but waits for the next event. The Gillespie algorithm is an example of such an approach but we didn’t cover it here. Finally, we realised that general computer code bases are also simulations. If a company has a large body of code, and particularly if it’s hosted within a streaming environment (such as Apache Kafka), it’s possible to back test the code with different inputs. Such backtests can be viewed as simulations, and in the case of large bodies of code (such as the code that manages Amazon’s automated buying systems) the back tests can be slow and could also benefit from emulation.

We’ve introduced emulation as a way of dealing with different fidelities of simulations and removing the computational demands that come with them. We’ve highlighted how emulation can be deployed and introduced the GPy software for Gaussian process modelling.

## Thanks!

For more information on these subjects and more you might want to check the following resources.

# References

Borchert, T., 2020. Milan: An evolution of data-oriented programming.
Byrd, R.H., Lu, P., Nocedal, J., 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing 16, 1190–1208.
Joshi, R., 2007. A loosely-coupled real-time SOA. Real-Time Innovations Inc.
Kennedy, M.C., O’Hagan, A., 2001. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 425–464. https://doi.org/10.1111/1467-9868.00294
Lawrence, N.D., 2017. Data readiness levels. ArXiv.
Lawrence, N.D., Montgomery, J., Paquet, U., 2020. Organisational data maturity. The Royal Society.
McKay, M.D., Beckman, R.J., Conover, W.J., 1979. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245.
The DELVE Initiative, 2020b. Data readiness: Lessons from an emergency. The Royal Society.
The DELVE Initiative, 2020a. Test, trace, isolate. The Royal Society.
Vorhemus, C., Schikuta, E., 2017. A data-oriented architecture for loosely coupled real-time information systems, in: Proceedings of the 19th International Conference on Information Integration and Web-Based Applications & Services, iiWAS ’17. Association for Computing Machinery, New York, NY, USA, pp. 472–481. https://doi.org/10.1145/3151759.3151770