Week 3: Emulation

[jupyter][google colab][reveal]

Neil D. Lawrence

Abstract:

In this lecture we motivate the use of emulation and introduce the GPy software as a framework for building Gaussian process emulators.

notutils

[edit]

This small package is a helper package for various notebook utilities used below.

The software can be installed using

%pip install notutils

from the command prompt where you can access your python installation.

The code is also available on GitHub: https://github.com/lawrennd/notutils

Once notutils is installed, it can be imported in the usual manner.

import notutils

mlai

[edit]

The mlai software is a suite of helper functions for teaching and demonstrating machine learning algorithms. It was first used in the Machine Learning and Adaptive Intelligence course in Sheffield in 2013.

The software can be installed using

%pip install mlai

from the command prompt where you can access your python installation.

The code is also available on GitHub: https://github.com/lawrennd/mlai

Once mlai is installed, it can be imported in the usual manner.

import mlai

Emulation

There are a number of ways we can use machine learning to accelerate scientific discovery. But one way is to have the machine learning model learn the effect of the rules. Rather than worrying about the detail of the rules through computing each step, we can have the machine learning model look to abstract the rules and capture emergent phenomena, just as the Maxwell-Boltzmann distribution captures the essence of the behavior of the ideal gas.

The challenges of Laplace’s gremlin present us with issues that we solve in a particular way, this is the focus of Chapter 6 in The Atomic Human.

The Atomic Human

[edit]

Figure: The Atomic Human (Lawrence, 2024) due for release in June 2024.

What follows is a quote form Chapter 6, which introduces Laplace’s gremlin and its consequences.

In Douglas Adams’s Hitchhiker’s Guide to the Galaxy the computer Deep Thought is asked to provide the answer to the ‘great question’ of ‘life, the universe and everything’. After seven and a half million years of computation, Deep Thought has completed its program but is reluctant to give its creators the answer.

‘You’re really not going to like it,’ observed Deep Thought.
‘Tell us!’
‘All right,’ said Deep Thought. ‘The Answer to the Great Question . . .’
‘Yes . . . !’
‘Of Life, the Universe and Everything …’ said Deep Thought.
‘Yes … !’
‘Is …’ said Deep Thought, and paused.
‘Yes … !’
‘Is …’
‘Yes … !!! … ?’
‘Forty-two,’ said Deep Thought, with infinite majesty and calm.

Douglas Adams The Hitchhiker’s Guide to the Galaxy, 1979, Chapter 27

After a period of shock from the questioners, the machine goes on to explain.

‘I checked it very thoroughly,’ said the computer, ‘and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.’

Douglas Adams The Hitchhiker’s Guide to the Galaxy, 1979, Chapter 28

To understand the question, Deep Thought goes on to agree to design a computer which will work out what the question is. In the book that machine is the planet Earth, and its operators are mice. Deep Thought’s idea is that the mice will observe the Earth and their observations will allow them to know what the Great Question is.

To understand the consequences of Hawking’s Theory of Everything, we would have to carry out a scheme similar to Deep Thought’s. The Theory wouldn’t directly tell us that hurricanes exist or that when the sun sets on Earth the sky will have a red hue. It wouldn’t directly tell us that water will boil at 100 degrees centigrade. These consequences of the Theory would only play out once it was combined with the data to give us the emergent qualities of the Universe. The Deep Thought problem hints at the intractability of doing this. The computation required to make predictions from Laplace’s demon can be enormous, Deep Thought intends to create a planet to run it. As a result, this isn’t how our intelligence works in practice. The computations required are just too gargantuan. Relative to the scale of the Universe our brains are extremely limited. Fortunately though, to make these predictions, we don’t have to build our own Universe, because we’ve already got one.

Figure: Rabbit and Pooh watch the result of Pooh’s hooshing idea to move Eeyore towards the shore.

When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.

A.A. Milne as Winnie-the-Pooh in The House at Pooh Corner, 1928

This comment from Pooh bear comes just as he’s tried to rescue his donkey friend, Eeyore, from a river by dropping a large stone on him from a bridge. Pooh’s idea had been to create a wave to push the donkey to the shore, a process that Pooh’s rabbit friend calls “hooshing”.

Hooshing is a technique many children will have tried to retrieve a ball from a river. It can work, so Pooh’s idea wasn’t a bad one, but the challenge he faced was in its execution. Pooh aimed to the side of Eeyore, unfortunately the stone fell directly on the stuffed donkey. But where is Laplace’s demon in hooshing? Just as we can talk about Gliders and Loafers in Conway’s Game of Life, we talk about stones and donkeys in our Universe. Pooh’s prediction that he can hoosh the donkey with the stone is not based on the Theory, it comes from observing the way objects interact in the actual Universe. Pooh is like the mice in Douglas Adams’s Earth. He is observing his environment. He looks for patterns in that environment. Pooh then borrows the computation that the Universe has already done for us. He has seen similar situations before, perhaps he once used a stone to hoosh a ball. He is then generalising from these previous circumstances to suggest that he can also hoosh the donkey. Despite being a bear of little brain, like the mice on Adams’s Earth, Pooh can answer questions about his universe by observing the results of the Theory of Everything playing out around him.

Surrogate Modelling in Practice

The knowledge of ones own limitations that Pooh shows is sometimes known as Socratic wisdom, and its a vital part of our intelligence. It expresses itself as humility and skepticism.

In the papers we reviewed last lecture, neural networks are being used to speed up computations. In this course we’ve introduced Gaussian processes that will be used to speed up these computations. In both cases the ideas are similar. Rather than rerunning the simulation, we use data from the simulation to fit the neural network or the Gaussian process to the data.

We’ll see an example of how this is done in a moment, taken from a simple ride hailing simulator, but before we look at that, we’ll first consider why this might be a useful approach.

As we’ve seen from the very simple rules in the Game of Life, emergent phenomena we might be interested in take computation power to discover, just as Laplace’s and Dirac’s quotes suggest. The objective in surrogate modelling is to harness machine learning models to learn those physical characteristics.

Types of Simulations

We’ve introduced simulations from the perspective of laws of physics. In practice, many simulations may not directly encode for the laws of physics, but they might encode expert intuitions about a problem (like Pooh’s intuition about hooshing).

For example, in Formula 1 races, the cars have tyres that wear at different rates. Softer tyres allow the cars to drive faster but wear quicker. Harder tyres man the car drives slower but they last longer. Changing between tyres is part of the race, and it has a time penalty. Before each race the teams decide what their strategy will be with tyre changes. It’s not only how many tyre changes that are important, but when they happen. If you change your tyre early, you might get a speed advantage and be able to pass your rival when they change their tyre later. This is a trick known as ‘undercutting’, but if your early change puts you back onto the track behind other slower cars, you will lose this advantage.

Formula 1 teams determine their strategy through simulating the race. Each team knows how fast other teams are around the track, and what their top speeds are. So, the teams simulate many thousands or millions of races with different strategies for their rivals, and they choose the strategy for which they maximize their number of expected points.

When many simulations are done, the results take time to come. During the actual race, the simulations are too slow to provide the real time information teams would need. In this case F1 teams can use emulators, models that have learnt the effect of the simulations, to give real time updates.

Formula 1 race simulations contain assumptions that derive from physics but don’t directly encode the physical laws. For example, if one car is stuck behind another, in any given lap, it might overtake. A typical race simulation will look at the lap speed of each car and the top speed of each car (as measured in ‘speed traps’ that are placed on the straight). It will assume a probability of overtake for each lap that is a function of these values. Of course, underlying that function is the physics of how cars overtake each other, but that can be abstracted away into a simpler function that the Race Strategy Engineer defines from their knowledge and previous experience.

Many simulations have this characteristic: major parts of the simulation are the result of encoding expert knowledge in the code. But this can lead to challenges. I once asked a strategy engineer, who had completed a new simulation, how it was going. He replied that things had started out well, but over time its performance was degrading. We discussed this for a while and over time a challenge of mis-specified granularity emerged.

Fidelity of the Simulation

The engineer explained how there’d been a race where the simulation had suggested that their main driver shouldn’t pit because he would have emerged behind a car with a slower lap speed, but a high top-speed. This would have made that car difficult to overtake. However, the driver of that slower car was also in the team’s ‘development program’, so everyone in the team knew that the slower car would have moved aside to let their driver through. Unfortunately, the simulation didn’t know this. So, the team felt the wrong stategy decision was made. After the race, the simulation was updated to include a special case for this situation. The new code checked whether the slower car was a development driver, making it ‘more realistic’.

Over time there were a number of similar changes, each of which should have improved the simulation, but the reality was the code was now ‘mixing granularities’. The formula for computing the probability of overtake as a function of speeds is one that is relatively easy to verify. It ignores the relationships between drivers, whether a given driver is a development driver, whether one bears a grudge or not, whether one is fighting for their place in the team. That’s all assimilated into the equation. The original equation is easy to calibrate, but as soon as you move to a finer granularity and consider more details about individual drivers, the model seems more realistic, but it becomes difficult to specify, and therefore performance degrades.

Simulations work at different fidelities, but as the Formula 1 example shows you must be very careful about mixing fidelities within the same simulation. The appropriate fidelity of a simulation is strongly dependent on the question being asked of it. On the context. For example, in Formula 1 races you can also simulate the performance of the car in the wind tunnel and using computational fluid dynamics representations of the Navier-Stokes equations. That level of fidelity is appropriate when designing the aerodynamic components of the car, but inappropriate when building a strategy simulation.

Epidemiology

The same concept of modelling at a particular fidelity comes up in epidemiology. Disease is transmitted by direct person to person interactions between individuals and objects. But in theoretical epidemiology, this is approximated by differential equations. The resulting models look very similar to reaction rate models used in Chemistry for well mixed beakers. Let’s have a look at a simple example used for modelling the policy of ‘herd immunity’ for Covid19.

Modelling Herd Immunity

[edit]

This example is taken from Thomas House’s blog post on Herd Immunity. This model was shared at the beginning of the Covid19 pandemic when the first UK lockdown hadn’t yet occurred.

import numpy as np
from scipy import integrate

The next piece of code sets up the dynamics of the compartmental model. He doesn’t give the specific details in the blog post, but my understanding is that the four states are as follows. x[0] is the susceptible population, those that haven’t had the disease yet. The susceptible population decreases by encounters with infections people. In Thomas’s model, both x[3] and x[4] are infections. So, the dynamics of the reduction of the susceptible is given by \[ \frac{\text{d}{S}}{\text{d}t} = - \beta S (I_1 + I_2). \] Here, we’ve used \(I_1\) and \(I_2\) to represent what appears to be two separate infectious compartments in Thomas’s model. We’ll speculate about why there are two in a moment.

The model appears to be an SEIR model, so rather than becoming infectious directly you next move to an ‘exposed’, where you have the disease, but you are not yet infectious. There are again two exposed states, we’ll return to that in a moment. We denote the first, x[1] by \(E_1\). We have \[ \frac{\text{d}{E_1}}{\text{d}t} = \beta S (I_1 + I_2) - \sigma E_1. \] Note that the first term matches the term from the Susceptible equation. This is because it is the incoming exposed population.

The exposed population move to a second compartment of exposure, \(E_2\). I believe the reason for this is that if you use only one exposure compartment, then the statistics of the duration of exposure are incorrect (implicitly they are exponentially distributed in the underlying stochastic version of the model). By using two exposure departments, Thomas is making a slight correction to this which would impose a first order gamma distribution on those statistics. A similar trick is being deployed for the ‘infectious group’. So, we gain an additional equation to help with these statistics, \[ \frac{\text{d}{E_2}}{\text{d}t} = \sigma E_1 - \sigma E_2. \] giving us the exposed group as the sum of the two compartments \(E_1\) and \(E_2\). The exposed group from the second compartment then become ‘infected’, which we represent with \(I_1\), in the code this is x[3], \[ \frac{\text{d}{I_1}}{\text{d}t} = \sigma E_2 - \gamma I_1, \] and similarly, Thomas is using a two-compartment infectious group to fix up the duration model. So we have, \[ \frac{\text{d}{I_2}}{\text{d}t} = \gamma I_1 - \gamma I_2. \] And finally, we have those that have recovered emerging from the second infections compartment. In this model there is no separate model for ‘deaths’, so the recovered compartment, \(R\), would also include those that die, \[ \frac{\text{d}R}{\text{d}t} = \gamma I_2. \] All these equations are then represented in code as follows.

def odefun(t,x,beta0,betat,t0,t1,sigma,gamma):
    dx = np.zeros(6)
    if ((t>=t0) and (t<=t1)):
        beta = betat
    else:
        beta = beta0
    dx[0] = -beta*x[0]*(x[3] + x[4])
    dx[1] = beta*x[0]*(x[3] + x[4]) - sigma*x[1]
    dx[2] = sigma*x[1] - sigma*x[2]
    dx[3] = sigma*x[2] - gamma*x[3]
    dx[4] = gamma*x[3] - gamma*x[4]
    dx[5] = gamma*x[4]
    return dx

Where the code takes in the states of the compartments (the values of x) and returns the gradients of those states for the provided parameters (sigma, gamma and beta). Those parameters are set according to the known characteristics of the disease.

The next block of code sets up the parameters of the SEIR model. A particularly important parameter is the reproduction number (\(R_0\)), here Thomas has assumed a reproduction number of 2.5, implying that each infected member of the population transmits the infection up to 2.5 other people. The effective \(R\) decreases over time though, because some of those people they meet will no longer be in the susceptible group.

# Parameters of the model
N = 6.7e7 # Total population
i0 = 1e-4 # 0.5*Proportion of the population infected on day 0
tlast = 365.0 # Consider a year
latent_period = 5.0 # Days between being infected and becoming infectious
infectious_period = 7.0 # Days infectious
R0 = 2.5 # Basic reproduction number in the absence of interventions
Rt = 0.75 # Reproduction number in the presence of interventions
tend = 21.0 # Number of days of interventions

The parameters are correct for the ‘discrete system’, where the infectious period is a discrete time, and the numbers are discrete values. To translate into our continuous differential equation system’s parameters, we need to do a couple of manipulations. Note the factor of 2 associated with gamma and sigma. This is a doubling of the rate to account for the fact that there are two compartments for each of these states (to fix-up the statistics of the duration models).

beta0 = R0 / infectious_period
betat = Rt / infectious_period
sigma = 2.0 / latent_period
gamma = 2.0 / infectious_period

Next, we solve the system using scipy’s initial value problem solver. The solution method is Runge-Kutta-Fehlberg method, as indicated by the 'RK45' solver. This is a numerical method for solving differential equations. The 45 is the order of the method and the error estimator.

We can view the solver itself as somehow a piece of simulation code, but here it’s being called as sub routine in the system. It returns a solution for each time step, stored in a list sol.

This is typical of this type of non-linear differential equation problem. Whether it’s partial differential equations, ordinary differential equations, there’s a step where a numerical solver needs to be called. These are often expensive to run. For climate and weather models, this would be where we solved the Navier-Stokes equations. For this simple model, the solution is relatively quick.


t0ran = np.array([-100, 40, 52.5, 65])
sol=[]
for tt in range(0,len(t0ran)):
    sol.append(integrate.solve_ivp(lambda t,x: odefun(t,x,beta0,betat,t0ran[tt],t0ran[tt]+tend,sigma,gamma),
                              (0.0,tlast),
                              np.array([1.0-2.0*i0, 0.0, 0.0, i0, i0, 0.0]),
                              'RK45',
                              atol=1e-8,
                              rtol=1e-9))

Figure: A zoomed in version of Thomas House’s variation on the SEIR model for evaluating the effect of early interventions.

Figure: The full progress of the disease in Thomas House’s variation on the SEIR model for evaluating the effect of early interventions.

In practice, immunity for Covid19 may only last around 6 months. As an exercise, try to extend Thomas’s model for the case where immunity is temporary. You’ll need to account for deaths as well in your new model.

Thinking about our Formula 1 example, and the differing levels of fidelity that might be included in a model, you can now imagine the challenges of doing large scale theoretical epidemiology. The compartment model is operating at a particular level of fidelity. Imagine trying to modify this model for a specific circumstance, like the way that the University of Cambridge chooses to do lectures. It’s not appropriate for this level of fidelity. You need to use different types of models for that decision making. One of our case studies looks at a simulation that was used to advise the government on the Test Trace Isolate program that took a different approach (The DELVE Initiative, 2020a).

Strategies for Simulation

Within any simulation, we can roughly split the variables of interest into the state variables and the parameters. In the Herd immunity example, the state variables were the different susceptible, exposed, infectious and recovered groups. The parameters were the reproduction number and the expected lengths of infection and the timing of lockdown. Often parameters are viewed as the inputs to the simulation, the things we can control. We might want to know how to time lock down to minimize the number of deaths. This behavior of the simulator is what we may want to emulate with our Gaussian process model.

So far, we’ve introduced simulation motivated by the physical laws of the universe. Those laws are sometimes encoded in differential equations, in which case we can try to solve those systems (like with Herd Immunity or Navier Stokes). An alternative approach is taken in the Game of Life. There a turn-based simulation is used, at each turn, we iterate through the simulation updating the state of the simulation. This is known as a discrete event simulation. In race simulation for Formula 1 a discrete event simulation is also used. There is another form of discrete event simulation, often used in chemical models, where the events don’t take place at regular intervals. Instead, the timing to the next event is computed, and the simulator advances that amount of time. For an example of this see the Gillespie algorithm.

There is a third type of simulation that we’d also like to introduce. That is simulation within computer software. In particular, the need to backtest software with ‘what if’ ideas – known as counter factual simulation – or to trace errors that may have occurred in production. This can involve loading up entire code bases and rerunning them with simulated inputs. This is a third form of simulation where emulation can also come in useful.

Backtesting Production Code

In Amazon the team I led looked at examples of simulations and emulation as varied as Prime Air drones across to the Amazon Supply Chain. In a purchasing system, the idea is to store stock to balance supply and demand. The aim is to keep product in stock for quick dispatch while keeping prices (and therefore costs) low. This idea is at the heart of Amazon’s focus on customer experience.

Alexa

[edit]
Tom Taylor Joe Walowski Rohit Prasad

Figure: Alexa Architecture overview taken from the Alexa Skills programme.

Rohit Prasad

Figure: Rohit Prasad talking about Alexa at the Amazon 2019 re:MARS event.

Tom Taylor Joe Walowski Rohit Prasad

Figure: Simple schmeatic of an intelligent agent’s components.

Speech to Text

Catherine Breslin

Cloud Service: Knowledge Base

David Hardcastle Arpit Mittal Christos Christodoulopoulos

Text to Speech

Andrew Breen Roberto Barra Chicote

Prime Air

[edit]

One project where the components of machine learning and the physical world come together is Amazon’s Prime Air drone delivery system.

Automating the process of moving physical goods through autonomous vehicles completes the loop between the ‘bits’ and the ‘atoms’. In other words, the information and the ‘stuff’. The idea of the drone is to complete a component of package delivery, the notion of last mile movement of goods, but in a fully autonomous way.

Gur Kimchi Paul Viola David Moro

Figure: An actual ‘Santa’s sleigh’. Amazon’s prototype delivery drone. Machine learning algorithms are used across various systems including sensing (computer vision for detection of wires, people, dogs etc) and piloting. The technology is necessarily a combination of old and new ideas. The transition from vertical to horizontal flight is vital for efficiency and uses sophisticated machine learning to achieve.

As Jeff Wilke (who was CEO of Amazon Retail at the time) announced in June 2019 the technology is ready, but still needs operationalization including e.g. regulatory approval.

Figure: Jeff Wilke (CEO Amazon Consumer) announcing the new drone at the Amazon 2019 re:MARS event alongside the scale of the Amazon supply chain.

When we announced earlier this year that we were evolving our Prime two-day shipping offer in the U.S. to a one-day program, the response was terrific. But we know customers are always looking for something better, more convenient, and there may be times when one-day delivery may not be the right choice. Can we deliver packages to customers even faster? We think the answer is yes, and one way we’re pursuing that goal is by pioneering autonomous drone technology.

Today at Amazon’s re:MARS Conference (Machine Learning, Automation, Robotics and Space) in Las Vegas, we unveiled our latest Prime Air drone design. We’ve been hard at work building fully electric drones that can fly up to 15 miles and deliver packages under five pounds to customers in less than 30 minutes. And, with the help of our world-class fulfillment and delivery network, we expect to scale Prime Air both quickly and efficiently, delivering packages via drone to customers within months.

The 15 miles in less than 30 minutes implies air speed velocities of around 50 kilometers per hour.

Our newest drone design includes advances in efficiency, stability and, most importantly, in safety. It is also unique, and it advances the state of the art. How so? First, it’s a hybrid design. It can do vertical takeoffs and landings – like a helicopter. And it’s efficient and aerodynamic—like an airplane. It also easily transitions between these two modes—from vertical-mode to airplane mode, and back to vertical mode.

It’s fully shrouded for safety. The shrouds are also the wings, which makes it efficient in flight.

Figure: Picture of the drone from Amazon Re-MARS event in 2019.

Our drones need to be able to identify static and moving objects coming from any direction. We employ diverse sensors and advanced algorithms, such as multi-view stereo vision, to detect static objects like a chimney. To detect moving objects, like a paraglider or helicopter, we use proprietary computer-vision and machine learning algorithms.

A customer’s yard may have clotheslines, telephone wires, or electrical wires. Wire detection is one of the hardest challenges for low-altitude flights. Through the use of computer-vision techniques we’ve invented, our drones can recognize and avoid wires as they descend into, and ascend out of, a customer’s yard.

Supply Chain Optimization

[edit]
Llew Mason Devesh Mishra

Supply chain is the process of matching between the supply of the product and demand for the product. This matching process is done through the deployment of resources: places to store the products, ways and means of transporting the product and even the ability to transform products from one form to another (e.g. transforming paper into books through printing).

Figure: Promotional video for the Amazon supply chain optimization team.

Arugably the Amazon supply chain is (or was) the largest automated decision-making system in the world in terms of the amount of money spent and the quantity of product moved through automated decision making.

Supply Chain Optimization

Llew Mason Devesh Mishra

At the heart of an automated supply chain is the buying system, which determines the optimal stock level for products and makes purchases based on the difference between that stock level and the current stock level.

Figure: A schematic of a typical buying system for supply chain.

To make these decisions predictive models (often machine learning or statistical models) have to be moved. For example, the demand for a particular product needs to be forecast.

Forecasting

Jenny Freshwater Ping Xu Dean Foster

Figure: Jenny Freshwater speaking at the Amazon re:MARS event in June 2019.

The process of forecasting is described by Jenny Freshwater (at the time Director for Forecasting within Amazon’s Supply Chain Optimization team in the video in Figure \(\ref{jenny-freshwater-remars}\).

For each product in the Amazon catalogue, the demand is forecast across the a given future period.

Inventory and Buying

Deepak Bhatia Piyush Saraogi Raman Iyer Salal Humair Narayan Venkatasubramanyan

Forecast information is combined with predictions around lead times from suppliers, understanding of the network’s capacity (in terms of how much space is available in which fulfillment centers), the cost of storing and transporting products and the “value” for the consumer in finding the product is in stock. These models are typically operational research models (such as the “newsvendor problem” combined with machine learning and/or statistical forecasts.

Buying System

[edit]

An example of a complex decision-making system might be an automated buying system. In such a system, the idea is to match demand for products to supply of products.

The matching of demand and supply is a repetitive theme for decision making systems. Not only does it occur in automated buying, but also in the allocation of drivers to riders in a ride sharing system. Or in the allocation of compute resource to users in a cloud system.

The components of any of these systems include predictions of the demand for the product, the drivers, or the compute. Predictions of the supply. Decisions are then made for how much material to keep in stock, or how many drivers to have on the road, or how much computer capacity to have in your data centers. These decisions have cost implications. The optimal amount of product will depend on the cost of making it available. For a buying system this is the storage costs.

Decisions are made based on the supply and demand to make new orders, to encourage more drivers to come into the system or to build new data centers or rent more computational power.

Figure: The components of a putative automated buying system

Monolithic System

[edit]

The classical approach to building these systems was a ‘monolithic system’. Built in a similar way to the successful applications software such as Excel or Word, or large operating systems, a single code base was constructed. The complexity of such code bases run to many lines.

In practice, shared dynamically linked libraries may be used for aspects such as user interface, or networking, but the software often has many millions of lines of code. For example, the Microsoft Office suite is said to contain over 30 million lines of code.

Charlie Bell Peter Vosshall

Figure: A potential path of models in a machine learning system.

Such software is not only difficult to develop, but also to scale when computation demands increase. Amazon’s original website software (called Obidos) was a monolithic design but by the early noughties it was becoming difficult to sustain and maintain. The software was phased out in 2006 to be replaced by a modularized software known as a ‘service-oriented architecture’.

Service Oriented Architecture

Charlie Bell Peter Vosshall

In Service Oriented Architecture, or “Software as a Service” the idea is that code bases are modularized and communicate with one another using network requests. A standard approach is to use a REST API. So, rather than a single monolithic code base, the code is developed with individual services that handle the different requests.

Figure: A potential path of models in a machine learning system.

Modern software development uses an approach known as service-oriented architecture to build highly complex systems. Such systems have similar emergent properties to Conway’s “Game of Life”. Understanding these emergent properties is vitally important when diagnosing problems in the system.

Intellectual Debt

[edit]

Figure: Jonathan Zittrain’s term to describe the challenges of explanation that come with AI is Intellectual Debt.

In the context of machine learning and complex systems, Jonathan Zittrain has coined the term “Intellectual Debt” to describe the challenge of understanding what you’ve created. In the ML@CL group we’ve been foucssing on developing the notion of a data-oriented architecture to deal with intellectual debt (Cabrera et al., 2023).

Zittrain points out the challenge around the lack of interpretability of individual ML models as the origin of intellectual debt. In machine learning I refer to work in this area as fairness, interpretability and transparency or FIT models. To an extent I agree with Zittrain, but if we understand the context and purpose of the decision making, I believe this is readily put right by the correct monitoring and retraining regime around the model. A concept I refer to as “progression testing”. Indeed, the best teams do this at the moment, and their failure to do it feels more of a matter of technical debt rather than intellectual, because arguably it is a maintenance task rather than an explanation task. After all, we have good statistical tools for interpreting individual models and decisions when we have the context. We can linearise around the operating point, we can perform counterfactual tests on the model. We can build empirical validation sets that explore fairness or accuracy of the model.

This is the landscape we now find ourselves in with regard to software development. In practice, each of these services is often ‘owned’ and maintained by an individual team. The team is judged by the quality of their service provision. They work to detailed specifications on what their service should output, what its availability should be and other objectives like speed of response. This allows for conditional independence between teams and for faster development.

Clearly Conway’s Game of Life exhibits an enormous amount of intellectual debt, indeed that was the idea. Build something simple that exhibits great complexity. That’s what makes it so popular. But in deployed system software, intellectual debt is a major headache and emulation presents one way of dealing with it.

Unfortunately, it also makes sophisticated software systems a breeding ground for intellectual debt. Particularly when they contain components which are themselves ML components. Dealing with this challenge is a major objective of my Senior AI Fellowship at the Alan Turing Institute. You can see me talking about the problems at this recent seminar given virtually in Manchester.

Examples in Python

There are a number of different simulations available in python, and tools specifically design for simulation. For example simpy has a list of example simulations around machine shops, gas stations, car washes.

Operations research ideas like the newsvendor problem, can also be solved, Andrea Fabry provides an example of the newsvendor problem in python for order of NFL replica jerseys here.

The news vendor problem is also a component in the Amazon supply chain. The Amazon supply chain makes predictions about supply and demand, combines them with a cost basis and computes optimal stock. Inventory orders are based on the difference between this optimal stock and current stock. The MiniSCOT simulation provides code that illustrates this.

Control problems are also systems that aim to achieve objectives given a set of parameters. A classic control problem is the inverted pendulum. This python code generalises the standard inverted pendulum using one with \(n\)-links using symbolic python code.

In reinforcement learning similar control problems are studied, a classic reinforcement learning problem is known as the Mountain Car. You can work with this problem using an environment known as OpenAI gym that includes many different reinforcement learning scenarios.

In neuroscience Hodgkin and Huxley studied the giant axon in squid to work out a set of differential equations that are still used today to model spiking neurons. Mark Kramer explains how to simulate them in python here.

All Formula One race teams simulate the race to determine their strategy (tyres, pit stops etc). While those simulations are proprietary, there is sufficient interest in the wider community that race simulations have been developed in python. Here is one that Nick Roth has made available.

Formula one teams also make use of simulation to design car parts. There are at least two components to this, simulations that allow the modelling of fluid flow across the car, and simulations that analyze the capabilities of materials under stress. You can find computational fluild dynamics simulations in python here and finite element analysis methods for computing stress in materials here.

Alongside continuous system simulation through partial differential equations, another form of simulation that arises is known as discrete event simulation. This comes up in computer networks and also in chemical systems where the approach to simulating is kown as the Gillespie algorithm.

Simulation System

[edit]

An example of a complex decision-making system might be a climate model, in such a system there are separate models for the atmosphere, the ocean and the land.

The components of these systems include flowing of currents, chemical interactions in the upper atmosphere, evaporation of water etc..

Figure: Representation of the Carbon Cycle from the US National Oceanic and Atmospheric Administration. While everything is interconnected in the system, we can decompose into separate models for atmosphere, ocean, land.

The influence of human activity also needs to be incorporated and modelled so we can make judgments about how to mitigate the effects of global warming.

Figure: The components of a simulation system for climate modelling.

Monolithic System

The classical approach to building these systems was a ‘monolithic system’. Built in a similar way to the successful applications software such as Excel or Word, or large operating systems, a single code base was constructed. The complexity of such code bases run to many lines.

In practice, shared dynamically linked libraries may be used for aspects such as user interface, or networking, but the software often has many millions of lines of code. For example, the Microsoft Office suite is said to contain over 30 million lines of code.

Figure: A potential path of models in a machine learning system.

Service Oriented Architecture

Such software is not only difficult to develop, but also to scale when computation demands increase. For example, Amazon’s original website software (called Obidos) was a monolithic design but by the early noughties it was becoming difficult to sustain and maintain. The software was phased out in 2006 to be replaced by a modularized software known as a ‘service-oriented architecture’.

In Service Oriented Architecture, or “Software as a Service” the idea is that code bases are modularized and communicate with one another using network requests. A standard approach is to use a REST API. So, rather than a single monolithic code base, the code is developed with individual services that handle the different requests.

The simulation software is turned inside out to expose the individual components to the operator.

Figure: A potential path of models in a machine learning system.

This is the landscape we now find ourselves in for software development. In practice, each of these services is often ‘owned’ and maintained by an individual team. The team is judged by the quality of their service provision. They work to detailed specifications on what their service should output, what its availability should be and other objectives like speed of response. This allows for conditional independence between teams and for faster development.

One question is to what extent is the same approach possible/desirable for scientific models? The components we listed above are already separated and often run independently. But those components themselves are made up of other sub-components that could also be exposed in a similar manner to software-as-a-service, giving us the notion of “simulation as a service”.

One thing about working in an industrial environment, is the way that short-term thinking actions become important. For example, in Formula One, the teams are working on a two-week cycle to digest information from the previous week’s race and incorporate updates to the car or their strategy.

However, businesses must also think about more medium-term horizons. For example, in Formula 1 you need to worry about next year’s car. So, while you’re working on updating this year’s car, you also need to think about what will happen for next year and prioritize these conflicting needs appropriately.

In the Amazon supply chain, there are equivalent demands. If we accept that an artificial intelligence is just an automated decision-making system. And if we measure in terms of money automatically spent, or goods automatically moved, then Amazon’s buying system is perhaps the world’s largest AI.

Those decisions are being made on short time schedules; purchases are made by the system on weekly cycles. But just as in Formula 1, there is also a need to think about what needs to be done next month, next quarter and next year. Planning meetings are held not only on a weekly basis (known as weekly business reviews), but monthly, quarterly, and then yearly meetings for planning spends and investments.

Amazon is known for being longer term thinking than many companies, and a lot of this is coming from the CEO. One quote from Jeff Bezos that stuck with me was the following.

“I very frequently get the question: ‘What’s going to change in the next 10 years?’ And that is a very interesting question; it’s a very common one. I almost never get the question: ‘What’s not going to change in the next 10 years?’ And I submit to you that that second question is actually the more important of the two – because you can build a business strategy around the things that are stable in time. … [I]n our retail business, we know that customers want low prices, and I know that’s going to be true 10 years from now. They want fast delivery; they want vast selection. It’s impossible to imagine a future 10 years from now where a customer comes up and says, ‘Jeff I love Amazon; I just wish the prices were a little higher,’ [or] ‘I love Amazon; I just wish you’d deliver a little more slowly.’ Impossible. And so the effort we put into those things, spinning those things up, we know the energy we put into it today will still be paying off dividends for our customers 10 years from now. When you have something that you know is true, even over the long term, you can afford to put a lot of energy into it.”

This quote is incredibly important for long term thinking. Indeed, it’s a failure of many of our simulations that they focus on what is going to happen, not what will not happen. In Amazon, this meant that there was constant focus on these three areas, keeping costs low, making delivery fast and improving selection. For example, shortly before I left Amazon moved its entire US network from two-day delivery to one-day delivery. This involves changing the way the entire buying system operates. Or, more recently, the company has had to radically change the portfolio of products it buys in the face of Covid19.

Figure: Experiment, analyze and design is a flywheel of knowledge that is the dual of the model, data and compute. By running through this spiral, we refine our hypothesis/model and develop new experiments which can be analyzed to further refine our hypothesis.

From the perspective of the team that we had in the supply chain, we looked at what we most needed to focus on. Amazon moves very quickly, but we could also take a leaf out of Jeff’s book, and instead of worrying about what was going to change, remember what wasn’t going to change.

We don’t know what science we’ll want to do in five years’ time, but we won’t want slower experiments, we won’t want more expensive experiments and we won’t want a narrower selection of experiments.

As a result, our focus was on how to speed up the process of experiments, increase the diversity of experiments that we can do, and keep the experiments price as low as possible.

The faster the innovation flywheel can be iterated, then the quicker we can ask about different parts of the supply chain, and the better we can tailor systems to answering those questions.

We need faster, cheaper and more diverse experiments which implies we need better ecosystems for experimentation. This has led us to focus on the software frameworks we’re using to develop machine learning systems including data oriented architectures (Borchert (2020);Lawrence (2019);Vorhemus and Schikuta (2017);Joshi (2007)), data maturity assessments (Lawrence et al. (2020)) and data readiness levels (See this blog post on Data Readiness Levels. and Lawrence (2017);The DELVE Initiative (2020b))

Packing Problems

[edit]

Figure: Packing 9 squares into a single square. This example is trivially solved. Credit https://erich-friedman.github.io/packing/

Figure: Packing 17 squares into a single square. The optimal solution is sometimes hard to find. Here the side length of the smallest square that holds 17 similarly shaped squares is at least 4.675 times the smaller square. This solution found by John Bidwell in 1997. Credit https://erich-friedman.github.io/packing/

Another example of a problem where the “physics” is understood because it’s really mathematics, is packing problems. Here the mathematics is just geometry, but still we need some form of compute to solve these problems. Erich Friedman’s website contains a host of these problems, only some of which are analytically tractable.

Figure: Packing 10 squares into a single square. This example is proven by Walter Stromquist (Stromquist, 1984). Here \(s=3+\frac{1}{\sqrt{2}}\). Credit https://erich-friedman.github.io/packing/

Modelling with a Function

What if the question of interest was quite simple, for example in the packing problem, we just wanted to know the minimum side length. Sometimes, regardless of the complexity of the problem, there can be a pattern to the answer that is emergent due to regularities in the underlying problem.

Erich Friedman Packing Data

[edit]
%pip install pods
import numpy as np
import pods
data = pods.datasets.erich_friedman_packing_data()
x = data['X']
y = data['Y']

Figure: Plot of minimum side length known as a function of number of squares inside.

%pip install gpy

Gaussian Process Fit

[edit]

Our first objective will be to perform a Gaussian process fit to the data, we’ll do this using the GPy software.

import GPy
m_full = GPy.models.GPRegression(x,y)
_ = m_full.optimize() # Optimize parameters of covariance function

The first command sets up the model, then m_full.optimize() optimizes the parameters of the covariance function and the noise level of the model. Once the fit is complete, we’ll try creating some test points, and computing the output of the GP model in terms of the mean and standard deviation of the posterior functions between 1870 and 2030. We plot the mean function and the standard deviation at 200 locations. We can obtain the predictions using y_mean, y_var = m_full.predict(xt)

xt = np.linspace(0,100,400)[:,np.newaxis]
yt_mean, yt_var = m_full.predict(xt)
yt_sd=np.sqrt(yt_var)

Now we plot the results using the helper function in mlai.plot.

Figure: Gaussian process fit to the Erich Friedman Packing data.

Statistical Emulation

[edit]

Figure: The UK Met office runs a shared code base for its simulations of climate and the weather. This plot shows the different spatial and temporal scales used.

In many real-world systems, decisions are made through simulating the environment. Simulations may operate at different granularities. For example, simulations are used in weather forecasts and climate forecasts. Interestingly, the UK Met office uses the same code for both, it has a “Unified Model” approach, but they operate climate simulations at greater spatial and temporal resolutions.

Figure: Real world systems consist of simulators that capture our domain knowledge about how our systems operate. Different simulators run at different speeds and granularities.

Figure: A statistical emulator is a system that reconstructs the simulation with a statistical model.

A statistical emulator is a data-driven model that learns about the underlying simulation. Importantly, learns with uncertainty, so it ‘knows what it doesn’t know’. In practice, we can call the emulator in place of the simulator. If the emulator ‘doesn’t know’, it can call the simulator for the answer.

Figure: A statistical emulator is a system that reconstructs the simulation with a statistical model. As well as reconstructing the simulation, a statistical emulator can be used to correlate with the real world.

As well as reconstructing an individual simulator, the emulator can calibrate the simulation to the real world, by monitoring differences between the simulator and real data. This allows the emulator to characterize where the simulation can be relied on, i.e., we can validate the simulator.

Similarly, the emulator can adjudicate between simulations. This is known as multi-fidelity emulation. The emulator characterizes which emulations perform well where.

If all this modelling is done with judicious handling of the uncertainty, the computational doubt, then the emulator can assist in deciding what experiment should be run next to aid a decision: should we run a simulator, in which case which one, or should we attempt to acquire data from a real-world intervention.

GPy: A Gaussian Process Framework in Python

[edit]

Gaussian processes are a flexible tool for non-parametric analysis with uncertainty. The GPy software was started in Sheffield to provide a easy to use interface to GPs. One which allowed the user to focus on the modelling rather than the mathematics.

Figure: GPy is a BSD licensed software code base for implementing Gaussian process models in Python. It is designed for teaching and modelling. We welcome contributions which can be made through the GitHub repository https://github.com/SheffieldML/GPy

GPy is a BSD licensed software code base for implementing Gaussian process models in python. This allows GPs to be combined with a wide variety of software libraries.

The software itself is available on GitHub and the team welcomes contributions.

The aim for GPy is to be a probabilistic-style programming language, i.e., you specify the model rather than the algorithm. As well as a large range of covariance functions the software allows for non-Gaussian likelihoods, multivariate outputs, dimensionality reduction and approximations for larger data sets.

The documentation for GPy can be found here.

GPy Tutorial

[edit]
James Hensman Nicolas Durrande

This GPy tutorial is based on material we share in the Gaussian process summer school for teaching these models https://gpss.cc. It contains material from various members and former members of the Sheffield machine learning group, but particular mention should be made of Nicolas Durrande and James Hensman, see http://gpss.cc/gpss17/labs/GPSS_Lab1_2017.ipynb.

import numpy as np
import GPy

To give a feel for the software we’ll start by creating an exponentiated quadratic covariance function, \[ k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}- \mathbf{ x}^\prime \right\Vert_2^2}{2\ell^2}\right), \] where the length scale is \(\ell\) and the variance is \(\alpha\).

To set this up in GPy we create a kernel in the following manner.

input_dim=1
alpha = 1.0
lengthscale = 2.0
kern = GPy.kern.RBF(input_dim=input_dim, variance=alpha, lengthscale=lengthscale)

That builds a kernel object for us. The kernel can be displayed.

display(kern)

Or because it’s one dimensional, you can also plot the kernel as a function of its inputs (while the other is fixed).

Figure: The exponentiated quadratic covariance function as plotted by the GPy.kern.plot command.

You can set the length scale of the covariance to different values and plot the result.

kern = GPy.kern.RBF(input_dim=input_dim)     # By default, the parameters are set to 1.
lengthscales = np.asarray([0.2,0.5,1.,2.,4.])

Figure: The exponentiated quadratic covariance function plotted for different length scales by GPy.kern.plot command.

Covariance Functions in GPy

Many covariance functions are already implemented in GPy. Instead of rbf, try constructing and plotting the following covariance functions: exponential, Matern32, Matern52, Brownian, linear, bias, rbfcos, periodic_Matern32, etc. Some of these covariance functions, such as rbfcos, are not parametrized by a variance and a length scale. Further, not all kernels are stationary (i.e., they can’t all be written as \(k(\mathbf{ x}, \mathbf{ x}^\prime) = f(\mathbf{ x}-\mathbf{ x}^\prime)\), see for example the Brownian covariance function). So for plotting it may be interesting to change the value of the fixed input.

Combining Covariance Functions in GPy

In GPy you can easily combine covariance functions you have created using the sum and product operators, + and *. So, for example, if we wish to combine an exponentiated quadratic covariance with a Matern 5/2 then we can write

kern1 = GPy.kern.RBF(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 + kern2
display(kern)

Figure: A combination of the exponentiated quadratic covariance plus the Matern \(5/2\) covariance.

Or if we wanted to multiply them, we can write

kern1 = GPy.kern.RBF(1, variance=1., lengthscale=2.)
kern2 = GPy.kern.Matern52(1, variance=2., lengthscale=4.)
kern = kern1 * kern2
display(kern)

Figure: A combination of the exponentiated quadratic covariance multiplied by the Matern \(5/2\) covariance.

You can learn about how to implement new kernel objects in GPy here.

Figure: Designing the covariance function for your Gaussian process is a key place in which you introduce your understanding of the data problem. To learn more about the design of covariance functions, see this talk from Nicolas Durrande at GPSS in 2016.

A Gaussian Process Regression Model

We will now combine the Gaussian process prior with some data to form a GP regression model with GPy. We will generate data from the function \[ f( x) = − \cos(\pi x) + \sin(4\pi x) \] over the domain \([0, 1]\), adding some noise to gives \[ y(x) = f(x) + \epsilon, \] with the noise being Gaussian distributed, \(\epsilon\sim \mathcal{N}\left(0,0.01\right)\).

X = np.linspace(0.05,0.95,10)[:,np.newaxis]
Y = -np.cos(np.pi*X) + np.sin(4*np.pi*X) + np.random.normal(loc=0.0, scale=0.1, size=(10,1))

Figure: Data from the noisy sine wave for fitting with a GPy model.

A GP regression model based on an exponentiated quadratic covariance function can be defined by first defining a covariance function.

kern = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)

And then combining it with the data to form a Gaussian process model.

model = GPy.models.GPRegression(X,Y,kern)

Just as for the covariance function object, we can find out about the model using the command display(model).

display(model)

Note that by default the model includes some observation noise with variance 1. We can see the posterior mean prediction and visualize the marginal posterior variances using model.plot().

Figure: A Gaussian process fit to the noisy sine data. Here the parameters of the process and the covariance function haven’t yet been optimized.

You can also look directly at the predictions for the model using.

Xstar = np.linspace(0, 10, 100)[:, np.newaxis]
Ystar, Vstar = model.predict(Xstar)

Which gives you the mean (Ystar), the variance (Vstar) at the locations given by Xstar.

Covariance Function Parameter Estimation

As we have seen during the lectures, the parameters values can be estimated by maximizing the likelihood of the observations. Since we don’t want any of the variances to become negative during the optimization, we can constrain all parameters to be positive before running the optimization.

model.constrain_positive()

The warnings are because the parameters are already constrained by default, the software is warning us that they are being reconstrained.

Now we can optimize the model using the model.optimize() method. Here we switch messages on, which allows us to see the progression of the optimization.

model.optimize(messages=True)

By default, the optimization is using a limited memory BFGS optimizer (Byrd et al., 1995).

Once again, we can display the model, now to see how the parameters have changed.

display(model)

The length scale is much smaller, as well as the noise level. The variance of the exponentiated quadratic has also reduced.

Figure: A Gaussian process fit to the noisy sine data with parameters optimized.

GPy and Emulation

[edit]

Let \(\mathbf{ x}\) be a random variable defined over the real numbers, \(\Re\), and \(f(\cdot)\) be a function mapping between the real numbers \(\Re \rightarrow \Re\).

The problem of uncertainty propagation is the study of the distribution of the random variable \(f(\mathbf{ x})\).

We’re going to address this problem using emulation and GPy. We will see in this section the advantage of using a model when only a few observations of \(f\) are available.

Firstly, we’ll make use of a test function known as the Branin test function. \[ f(\mathbf{ x}) = a(x_2 - bx_1^2 + cx_1 - r)^2 + s(1-t \cos(x_1)) + s \] where we are setting \(a=1\), \(b=5.1/(4\pi^2)\), \(c=5/\pi\), \(r=6\), \(s=10\) and \(t=1/(8\pi)\).

import numpy as np
def branin(X):
    y = ((X[:,1]-5.1/(4*np.pi**2)*X[:,0]**2+5*X[:,0]/np.pi-6)**2 
        + 10*(1-1/(8*np.pi))*np.cos(X[:,0])+10)
    return(y)

We’ll define a grid of twenty-five observations over [−5, 10] × [0, 15] and a set of 25 observations.

# Training set defined as a 5*5 grid:
xg1 = np.linspace(-5,10,5)
xg2 = np.linspace(0,15,5)
X = np.zeros((xg1.size * xg2.size,2))
for i,x1 in enumerate(xg1):
    for j,x2 in enumerate(xg2):
        X[i+xg1.size*j,:] = [x1,x2]

Y = branin(X)[:,np.newaxis]

The task here will be to consider the distribution of \(f(U)\), where \(U\) is a random variable with uniform distribution over the input space of \(f\). We focus on the computaiton of two quantities, the expectation of \(f(U)\), \(\left\langle f(U)\right\rangle\), and the probability that the value is greater than 200.

Computation of \(\left\langle f(U)\right\rangle\)

The expectation of \(f(U )\) is given by \(\int_\mathbf{ x}f( \mathbf{ x})\text{d}\mathbf{ x}\). A basic approach to approximate this integral is to compute the mean of the 25 observations: np.mean(Y). Since the points are distributed on a grid, this can be seen as the approximation of the integral by a rough Riemann sum.

print('Estimate of the expectation is given by: {mean}'.format(mean=Y.mean()))

The result can be compared with the actual mean of the Branin function which is 54.31.

Alternatively, we can fit a GP model and compute the integral of the best predictor by Monte Carlo sampling.

Firstly, we create the covariance function. Here we’re going to use an exponentiated quadratic, but we’ll augment it with the ‘bias’ covariance function. This covariance function represents a single fixed bias that is added to the overall covariance. It allows us to deal with non-zero-mean emulations.

import GPy
# Create an exponentiated quadratic plus bias covariance function
kern_eq = GPy.kern.RBF(input_dim=2, ARD = True)
kern_bias = GPy.kern.Bias(input_dim=2)
kern = kern_eq + kern_bias

Now we construct the Gaussian process regression model in GPy.

# Build a GP model
model = GPy.models.GPRegression(X,Y,kern)

In the sinusoid example above, we learnt the variance of the process. But in this example, we are fitting an emulator to a function we know is noise-free. However, we don’t fix the noise value to precisely zero, as this can lead to some numerical errors. Instead, we fix the variance of the Gaussian noise to a very small value.

# fix the noise variance
model.likelihood.variance.fix(1e-5)

Now we fit the model. Note, that the initial values for the length scale are not appropriate. So first set the length scale of the model needs to be reset.

kern.rbf.lengthscale = np.asarray([3, 3])

It’s a common error in Gaussian process fitting to initialize the length scale too small or too big. The challenge is that the error surface is normally multimodal, and the final solution can be very sensitive to this initialization. If the length scale is initialized too small, the solution can converge on an place where the signal isn’t extracted by the covariance function. If the length scale is initialized too large, then the variations of the function are often missing. Here the length scale is set for each dimension of inputs as 3. Now that’s done, we can optimize the model.

# Randomize the model and optimize
model.optimize(messages=True)

Figure: A Gaussian process fit to the Branin test function, used to assess the mean of the function by emulation.

Finally, we can compute the mean of the model predictions using very many Monte Carlo samples.

Note, that in this example, because we’re using a test function, we could simply have done the Monte Carlo estimation directly on the Branin function. However, imagine instead that we were trying to understand the results of a complex computational fluid dynamics simulation, where each run of the simulation (which is equivalent to our test function) took many hours. In that case the advantage of the emulator is clear.

# Compute the mean of model prediction on 1e5 Monte Carlo samples
Xp = np.random.uniform(size=(int(1e5),2))
Xp[:,0] = Xp[:,0]*15-5
Xp[:,1] = Xp[:,1]*15
mu, var = model.predict(Xp)
print('The estimate of the mean of the Branin function is {mean}'.format(mean=np.mean(mu)))

Exercise 1

Now think about how to make use of the variance estimation from the Gaussian process to obtain error bars around your estimate.

Exercise 2

You’ve seen how the Monte Carlo estimates work with the Gaussian process. Now make your estimate of the probability that the Branin function is greater than 200 with the uniform random inputs.

Uncertainty Quantification and Design of Experiments

[edit]

We’re introducing you to the optimization and analysis of real-world models through emulation, this domain is part of a broader field known as surrogate modelling.

Although we’re approaching this from the machine learning perspective, with a computer-scientist’s approach, you won’t be surprised to find out that this field is not new and there are a diverse range of research communities interested in this domain.

We’ve been focussing on active experimental design. In particular, the case where we are sequentially selecting points to run our simulation based on previous results.

Here, we pause for a moment and cover approaches to passive experimental design. Almost all the emulation examples we’ve looked at so far need some initial points to ‘seed’ the emulator. Selecting these is also a task of experimental design, but one we perform without running our simulator.

This type of challenge, of where to run the simulation to get the answer you require is an old challenge. One classic paper, McKay et al. (1979), reviews three different methods for designing these inputs. They are random sampling, stratified sampling, and Latin hypercube sampling.

{Random sampling is the default approach, this is where across the input domain of interest, we just choose to select samples randomly (perhaps uniformly, or if we believe there’s an underlying distribution

Let the input values \(\mathbf{ x}_1, \dots, \mathbf{ x}_n\) be a random sample from \(f(\mathbf{ x})\). This method of sampling is perhaps the most obvious, and an entire body of statistical literature may be used in making inferences regarding the distribution of \(Y(t)\).

In statistical surveillance stratified sampling is an approach from statistics where a population is divided into sub-populations before sampling. For example, imagine that we are interested surveillance data for Covid-19 tracking. If we can only afford to track 100 people, we could sample them randomly from across the population. But we might worry that (by chance) we don’t get many people from a particular age group. So instead, we could divide the population into sub-groups and sample a fixed number from each group. This ensures we get coverage across all ages, although downstream we might have to do some weighting of the samples when considering the population effect.

The same ideas can be deployed in emulation, your input domain can be divided into domains of particular intererest. For example if testing the performance of a component from an F1 car, you might be interested in the performance on the straight, the performance in “fast corners” and the performance in “slow corners”. Because slow corners have a very large effect on the final performance, you might take more samples from slow corners relative to the frequency that such corners appear in the actual races.

Using stratified sampling, all areas of the sample space of \(\mathbf{ x}\) are represented by input values. Let the sample space \(S\) of \(\mathbf{ x}\) be partitioned into \(I\) disjoint strata \(S_t\). Let \(\pi = P(\mathbf{ x}\in S_i)\) represent the size of \(S_i\). Obtain a random sample \(\mathbf{ x}_{ij}\), \(j = 1, \dots, n\) from \(S_i\). Then of course the \(n_i\) sum to \(n\). If \(I = 1\), we have random sampling over the entire sample space.

Latin hypercube sampling is a form of stratified sampling. For a Latin square if \(M\) samples are requred, then the strata are determined by dividing the area of the inputs into discrete \(M\) rows and \(M\) columns. Then the samples are taken so that each row and column only contains one total sample. The Latin hypercube is the generalisation of this idea to more than two dimensions.

The same reasoning that led to stratified sampling, ensuring that all portions of \(S\) were sampled, could lead further. If we wish to ensure also that each of the input variables \(\mathbf{ x}_k\) has all portions of its distribution represented by input values, we can divide the range of each \(\mathbf{ x}_k\) into \(n\) strata of equal marginal probability \(1/n\), and sample once from each stratum. Let this sample be \(\mathbf{ x}_{kj}\), \(j = 1, \dots, n\). These form the \(\mathbf{ x}_k\) component, \(k = 1, \dots , K\), in \(\mathbf{ x}_i\), \(i = 1, \dots, n\). The components of the various \(\mathbf{ x}_k\)’s are matched at random. This method of selecting input values is an extension of quota sampling (Steinberg 1963), and can be viewed as a \(K\)-dimensional extension of Latin square sampling (Raj 1968).

The paper’s rather dated reference to “Output from a Computer Code” does carry forward through this literature, which has continued to be a focus of interest for statisticians. Tony O’Hagan, who was a colleague in Sheffield but is also one of the pioneers of Gaussian process models was developing these methods when I first arrived there (Kennedy and O’Hagan, 2001), and continued with a large EPSRC funded project for managing uncertainty in computational models, http://www.mucm.ac.uk/. You can see a list of their technical reports here.

Another important group based in France is the “MASCOT-NUM Research Group”, https://www.gdr-mascotnum.fr/. These researchers bring together statisticians, applied mathematicians and engineers in solving these problems.

Emukit Playground

[edit]
Leah Hirst Cliff McCollum

Emukit playground is a software toolkit for exploring the use of statistical emulation as a tool. It was built by Leah Hirst, during her software engineering internship at Amazon and supervised by Cliff McCollum.

Figure: Emukit playground is a tutorial for understanding the simulation/emulation relationship. https://amzn.github.io/emukit-playground/

Figure: Tutorial on Bayesian optimization of the number of taxis deployed from Emukit playground. https://amzn.github.io/emukit-playground/#!/learn/bayesian_optimization

You can explore Bayesian optimization of a taxi simulation.

Exercise 3

You now know enough to build a simple emulation. To test your knowledge have a go at cobmining GPy with Thomas House’s herd immunity simulation. Can you build a Gaussian process emulator of the simulation? Don’t spent do long on this exercise. The idea is just to consolidate things like what the inputs and outputs should be.

Conclusions

We summarized the different types of simulation into roughly three groups. Firstly, those based on physical laws in the form of differential equations. Examples include certain compartmental epidemiological models, climate models and weather models. Secondly, discrete event simulations. These simulations often run to a ‘clock’, where updates to the state are taken in turns. The Game of Life is an example of this type of simulation, and Formula 1 models of race strategy also use this approach. There is another type of discrete event simulation that doesn’t use a turn-based approach but waits for the next event. The Gillespie algorithm is an example of such an approach but we didn’t cover it here. Finally, we realized that general computer code bases are also simulations. If a company has a large body of code, and particularly if it’s hosted within a streaming environment (such as Apache Kafka), it’s possible to back test the code with different inputs. Such backtests can be viewed as simulations, and in the case of large bodies of code (such as the code that manages Amazon’s automated buying systems) the back tests can be slow and could also benefit from emulation.

We’ve introduced emulation as a way of dealing with different fidelities of simulations and removing the computational demands that come with them. We’ve highlighted how emulation can be deployed and introduced the GPy software for Gaussian process modelling.

Thanks!

For more information on these subjects and more you might want to check the following resources.

References

Borchert, T., 2020. Milan: An evolution of data-oriented programming.
Byrd, R.H., Lu, P., Nocedal, J., 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing 16, 1190–1208.
Cabrera, C., Paleyes, A., Thodoroff, P., Lawrence, N.D., 2023. Real-world machine learning systems: A survey from a data-oriented architecture perspective.
Joshi, R., 2007. A loosely-coupled real-time SOA. Real-Time Innovations Inc.
Kennedy, M.C., O’Hagan, A., 2001. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 425–464. https://doi.org/10.1111/1467-9868.00294
Lawrence, N.D., 2024. The atomic human: Understanding ourselves in the age of AI. Allen Lane.
Lawrence, N.D., 2019. Modern data oriented programming.
Lawrence, N.D., 2017. Data readiness levels. ArXiv.
Lawrence, N.D., Montgomery, J., Paquet, U., 2020. Organisational data maturity. The Royal Society.
McKay, M.D., Beckman, R.J., Conover, W.J., 1979. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245.
Stromquist, W.R., 1984. Packing unit squares inside squares, III. Daniel H. Wagner Associates.
The DELVE Initiative, 2020b. Data readiness: Lessons from an emergency. The Royal Society.
The DELVE Initiative, 2020a. Test, trace, isolate. The Royal Society.
Vorhemus, C., Schikuta, E., 2017. A data-oriented architecture for loosely coupled real-time information systems, in: Proceedings of the 19th International Conference on Information Integration and Web-Based Applications & Services, iiWAS ’17. Association for Computing Machinery, New York, NY, USA, pp. 472–481. https://doi.org/10.1145/3151759.3151770