Probabilistic-style programming (specify the model, not the
algorithm).
Non-Gaussian likelihoods.
Multivariate outputs.
Dimensionality reduction.
Approximations for large data sets.
The Importance of the Covariance Function
μf=A⊤y,
Improving the Numerics
In practice we shouldn’t be using matrix inverse directly to solve
the GP system. One more stable way is to compute the Cholesky
decomposition of the kernel matrix. The log determinant of the
covariance can also be derived from the Cholesky decomposition.
Capacity Control
Gradients of the Likelihood
Overall Process Scale
Capacity Control and Data Fit
Learning Covariance Parameters
Can we determine covariance parameters from the data?
N(y|0,K)=1(2π)n2detK12exp(−y⊤K−1y2)
N(y|0,K)=1(2π)n2detK12exp(−y⊤K−1y2)
logN(y|0,K)=−12logdetK−y⊤K−1y2−n2log2π
E(θ)=12logdetK+y⊤K−1y2
Capacity Control through the Determinant
The parameters are inside the covariance function (matrix).
ki,j=k(xi,xj;θ)
Eigendecomposition of Covariance
K=RΛ2R⊤
Λ represents
distance on axes. R gives
rotation.
Eigendecomposition of Covariance
Λ is
diagonal, R⊤R=I.
Useful representation since detK=detΛ2=detΛ2.
Capacity control: logdetK
Quadratic Data Fit
Data Fit: y⊤K−1y2
E(θ)=12logdetK+y⊤K−1y2
Data Fit Term
Exponentiated Quadratic Covariance
k(x,x′)=αexp(−‖x−x′‖222ℓ2)
Where Did This Covariance Matrix Come From?
k(x,x′)=αexp(−‖x−x′‖222ℓ2)
Covariance matrix is built using the inputs to the
function, x.
For the example above it was based on Euclidean
distance.
The covariance function is also know as a kernel.
Computing Covariance
Computing Covariance
Computing Covariance
Brownian Covariance
k(t,t′)=αmin(t,t′)
Where did this covariance matrix come from?
Markov Process
Visualization of inverse covariance (precision).
Precision matrix is sparse: only neighbours in matrix are
non-zero.
This reflects conditional independencies in
data.
In this case Markov structure.
Where did this covariance matrix come from?
Exponentiated Quadratic
Visualization of inverse covariance (precision).
Precision matrix is not sparse.
Each point is dependent on all the others.
In this case non-Markovian.
rbfprecisionSample
Covariance Functions
Markov Process
Visualization of inverse covariance (precision).
Precision matrix is sparse: only neighbours in matrix are
non-zero.
This reflects conditional independencies in data.
In this case Markov structure.
markovprecisionPlot
Exponential Covariance
k(x,x′)=αexp(−‖x−x′‖2ℓ)
Basis Function Covariance
k(x,x′)=ϕ(x)⊤ϕ(x′)
Degenerate Covariance Functions
RBF Basis Functions
ϕk(x)=exp(−‖x−μk‖22ℓ2).
μ=[−101],
k(x,x′)=αϕ(x)⊤ϕ(x′).
Bochners Theoerem
Given a positive finite Borel measure μ on the real line R, the Fourier transform Q of μ is the continuous function Q(t)=∫Re−itxdμ(x).Q is continuous since
for a fixed x, the function e−itx is continuous and periodic. The
function Q is a positive definite
function, i.e. the kernel k(x,x′)=Q(x′−x) is positive definite.
Bochner’s theorem (Bochner, 1959) says the converse is
true, i.e. every positive definite function Q is the Fourier transform of a positive
finite Borel measure. A proof can be sketched as follows (Stein,
1999)
f(x)=n∑i=1yiδ(x−xi),
F(ω)=∫∞−∞f(x)exp(−i2πω⊤x)dx
F(ω)=n∑i=1yiexp(−i2πω⊤xi)
F(ω)=∫∞−∞f(t)[cos(2πωt)−isin(2πωt)]dt
exp(ix)=cosx+isinx we can re-express this form as F(ω)=∫∞−∞f(t)exp(−i2πω)dt